Streaming Data IO in R

June 29th, 2015

Streaming Data IO in R – curl, jsonlite, mongolite by Jeroem Ooms.


The jsonlite package provides a powerful JSON parser and generator that has become one of standard methods for getting data in and out of R. We discuss some recent additions to the package, in particular support streaming (large) data over http(s) connections. We then introduce the new mongolite package: a high-performance MongoDB client based on jsonlite. MongoDB (from “humongous”) is a popular open-source document database for storing and manipulating very big JSON structures. It includes a JSON query language and an embedded V8 engine for in-database aggregation and map-reduce. We show how mongolite makes inserting and retrieving R data to/from a database as easy as converting it to/from JSON, without the bureaucracy that comes with traditional databases. Users that are already familiar with the JSON format might find MongoDB a great companion to the R language and will enjoy the benefits of using a single format for both serialization and persistency of data.

R, JSON, MongoDB, what’s there not to like? ;-)

From UseR! 2015.


ChemistryWorld Podcasts: Compounds (Phosgene)

June 29th, 2015

Chemistry in its elements: Compounds is a weekly podcast sponsored by ChemistryWorld, which features a chemical compound or group of compounds every week.

Matthew Gunter has a podcast entitled: Phosgene.

In case your recent history is a bit rusty, phosgene was one of the terror weapons of World War I. It accounted for 85% of the 100,000 deaths from chemical gas. Not as effective as say sarin but no slouch.

Don’t run to the library, online guides or the FBI for recipes to make phosgene at home. Its use in industrial applications should give you a clue as to an alternative to home-made phosgene. Use of phosgene violates the laws of war, so being a thief as well should not trouble you.

No, I don’t have a list of locations that make or use phosgene, but then DHS probably doesn’t either. They are more concerned with terrorists using “nuclear weapons” or “gamma-ray bursts“. One is mechanically and technically difficult to do well and the other is impossible to control.

The idea of someone using a dual-wheel pickup and a plant pass to pickup and deliver phosgene gas is too simple to have occurred to them.

If you are pitching topic maps to a science/chemistry oriented audience, these podcasts make a nice starting point for expansion. To date there are two hundred and forty-two (242) of them.


A Critical Review of Recurrent Neural Networks for Sequence Learning

June 29th, 2015

A Critical Review of Recurrent Neural Networks for Sequence Learning by Zachary C. Lipton.


Countless learning tasks require awareness of time. Image captioning, speech synthesis, and video game playing all require that a model generate sequences of outputs. In other domains, such as time series prediction, video analysis, and music information retrieval, a model must learn from sequences of inputs. Significantly more interactive tasks, such as natural language translation, engaging in dialogue, and robotic control, often demand both.

Recurrent neural networks (RNNs) are a powerful family of connectionist models that capture time dynamics via cycles in the graph. Unlike feedforward neural networks, recurrent networks can process examples one at a time, retaining a state, or memory, that reflects an arbitrarily long context window. While these networks have long been difficult to train and often contain millions of parameters, recent advances in network architectures, optimization techniques, and parallel computation have enabled large-scale learning with recurrent nets.

Over the past few years, systems based on state of the art long short-term memory (LSTM) and bidirectional recurrent neural network (BRNN) architectures have demonstrated record-setting performance on tasks as varied as image captioning, language translation, and handwriting recognition. In this review of the literature we synthesize the body of research that over the past three decades has yielded and reduced to practice these powerful models. When appropriate, we reconcile conflicting notation and nomenclature. Our goal is to provide a mostly self-contained explication of state of the art systems, together with a historical perspective and ample references to the primary research.

Lipton begins with an all too common lament:

The literature on recurrent neural networks can seem impenetrable to the uninitiated. Shorter papers assume familiarity with a large body of background literature. Diagrams are frequently underspecified, failing to indicate which edges span time steps and which don’t. Worse, jargon abounds while notation is frequently inconsistent across papers or overloaded within papers. Readers are frequently in the unenviable position of having to synthesize conflicting information across many papers in order to understand but one. For example, in many papers subscripts index both nodes and time steps. In others, h simultaneously stands for link functions and a layer of hidden nodes. The variable t simultaneously stands for both time indices and targets, sometimes in the same equation. Many terrific breakthrough papers have appeared recently, but clear reviews of recurrent neural network literature are rare.

Unfortunately, Lipton gives no pointers to where the variant practices occur, leaving the reader forewarned but not forearmed.

Still, this is a survey paper with seventy-three (73) references over thirty-three (33) pages, so I assume you will encounter various notation practices if you follow the references and current literature.

Capturing variations in notation, along with where they have been seen, won’t win the Turing Award but may improve the CS field overall.

BBC News Labs (and other news labs)

June 29th, 2015

BBC News Labs

I saw a tweet from the BBC News Labs saying:

We were News labs before it was cool.

cc @googlenewslab

Which was followed by this lively exchange:


From the about page:

This Jekyll-powered blog is pitched at interested Journalists, Technologists and Hacker Journalists, and provides regular updates on News Labs' activities.

We hope it will open new opportunities for collaborative work, by attracting attention from like-minded people in this space.

You can still find our major updates pitched at a broader audience here on the BBC Internet Blog.

About BBC News Labs

BBC News Labs is an incubator powered by BBC Connected Studio, and is charged with driving innovation for BBC News.

Our M.O.

We work as a multi-discipline incubator, exploring scalable opportunities at the intersection of:

  1. Journalism
  2. Technology
  3. Data

Our goals

  1. Harness BBC talent & creativity to drive Innovation
  2. Open new opportunities for Story-driven Journalism
  3. Support Innovation Transfer into Production
  4. Drive open standards through News Industry collaboration
  5. Raise BBC News’ Profile as an Innovator

You can find out more on the BBC News Labs corporate website here

Get in touch

We'd be delighted to hear from you or to see if you can contribute to one of our projects. Give us a shout at:

News Labs Links

For your Twitter following pleasure, news labs mentioned in this post:







Other news labs that should be added to this list?

PS: I would include @Journalism2ls. – Journalism Tools in a more general list.

Update: @NeimanLab: Nieman Journalism Lab at Harvard.

More Analytics Needed in Cyberdefense: [The first step towards cybersecurity is…]

June 28th, 2015

More Analytics Needed in Cyberdefense by David Stegon.

Before you credit this report too much, consider the following points:

Crunching the Survey Numbers

MeriTalk, on behalf of Splunk, conducted an online survey of 150 Federal and 152 State and Local cyber security pros in March 2015. The report has a margin of error of ±5.6% at a 95% confidence level. (slide 15)

Federal Computer Week has 80,057 subscribers and approximately 21% of them are Senior IT Management. Federal Computer Week (FCW)

That’s 16,812 of the subscriber total and MeriTalk captured opinions from 150 “cyber security pros.”

Roughly that means that MeriTalk obtained opinions from the equivalent of 0.009% of the senior IT management subscribers to Federal Computer Week.

A survey of less than 0.009% of cyber security pros doesn’t fill me with confidence about these survey “results.”

Big Data analytics for Cyberdefense

In addition to being a tiny portion of “cyber security pros,” you have to wonder what “big data” the respondents thought would be analyzed?

OPM wasn’t running any logging on its servers! (The Absence of Proof Against China on OPM Hacks)

Care to wager that other federal agencies and contractors are not running logging on their networks? I didn’t think so.

Big data techniques, properly understood and applied can lead to valuable insights for cybersecurity. But note the qualifiers, “properly understood and applied…”

The first step towards cybersecurity is recognizing when vendors are taking your money and not improving your IT security.

Medical Sieve [Information Sieve]

June 28th, 2015

Medical Sieve

An effort to capture anomalies from medical imaging, package those with other data, and deliver it for use by clinicians.

If you think of each medical image as represented a large amount of data, the underlying idea is to filter out all but the most relevant data, so that clinicians are not confronting an overload of information.

In network terms, rather than displaying all of the current connections to a network (the ever popular eye-candy view of connections), displaying only those connections that are different from all the rest.

The same technique could be usefully applied in a number of “big data” areas.

From the post:

Medical Sieve is an ambitious long-term exploratory grand challenge project to build a next generation cognitive assistant with advanced multimodal analytics, clinical knowledge and reasoning capabilities that is qualified to assist in clinical decision making in radiology and cardiology. It will exhibit a deep understanding of diseases and their interpretation in multiple modalities (X-ray, Ultrasound, CT, MRI, PET, Clinical text) covering various radiology and cardiology specialties. The project aims at producing a sieve that filters essential clinical and diagnostic imaging information to form anomaly-driven summaries and recommendations that tremendously reduce the viewing load of clinicians without negatively impacting diagnosis.

Statistics show that eye fatigue is a common problem with radiologists as they visually examine a large number of images per day. An emergency room radiologist may look at as many 200 cases a day, and some of these imaging studies, particulary lower body CT angiography can be as many as 3000 images per study. Due to the volume overload, and limited amount of clinical information available as part of imaging studies, diagnosis errors, particularly relating to conincidental diagnosis cases can occur. With radiologists also being a scarce resource in many countries, it will even more important to reduce the volume of data to be seen by clinicians particularly, when they have to be sent over low bandwidth teleradiology networks.

MedicalSieve is an image-guided informatics system that acts as a medical sieve filtering the essential clinical information physicians need to know about the patient for diagnosis and treatment planning. The system gathers clinical data about the patient from a variety of enterprise systems in hospitals including EMR, pharmacy, labs, ADT, and radiology/cardiology PACs systems using HL7 and DICOM adapters. It then uses sophisticated medical text and image processing, pattern recognition and machine learning techniques guided by advanced clinical knowledge to process clinical data about the patient to extract meaningful summaries indicating the anomalies. Finally, it creates advanced summaries of imaging studies capturing the salient anomalies detected in various viewpoints.

Medical Sieve is leading the way in diagnostic interpretation of medical imaging datasets guided by clinical knowledge with many first-time inventions including (a) the first fully automatic spatio-temporal coronary stenosis detection and localization from 2D X-ray angiography studies, (b) novel methods for highly accurate benign/malignant discrimination in breast imaging, and (c) first automated production of AHA guideline17 segment model for cardiac MRI diagnosis.

For more details on the project, please contact Tanveer Syeda-Mahmood (>

You can watch a demo of our Medical Sieve Cognitive Assistant Application here.

Curious: How would you specify the exclusions of information? So that you could replicate the “filtered” view of the data?

Replication is a major issue in publicly funded research these days. Not reason for that to be any different for data science.


Domain Modeling: Choose your tools

June 28th, 2015

Kirk Borne posted to Twitter:

Great analogy by @wcukierski at #GEOINT2015 on #DataScience Domain Modeling > bulldozers: toy model versus the real thing.



Does your tool adapt to the data? (The real bulldozer above.)

Or, do you adapt your data to the tool? (The toy bulldozer above.)

No, I’m not going there. That is like a “the best editor” flame war. You have to decide that question for yourself and your project.

Good luck!

The Week’s Most Popular Data Journalism Links [June 22nd]

June 28th, 2015

Top Ten #ddj: The Week’s Most Popular Data Journalism Links by GIJN Staff and Connected Action.

From the post:

What’s the data-driven journalism crowd tweeting? Here are the Top Ten links for Jun 11-18: mapping global tax evasion (@grandjeanmartin), vote for best data journalism site (@GENinnovate); data viz examples (@visualoop, @OKFN), data retention (@Frontal21) and more.

A number of compelling visualizations and in particular: SwissLeaks: the map of the globalized tax evasion. Imaginative visualization of countries but not with the typical global map.

A great first step but I don’t find country level visualizations (or agency level accountability) all that compelling. There is $X amount of tax avoidance in country Y but that lacks the impact of naming the people who are evading the taxes, perhaps along with a photo for the society pages and their current location.

BTW, you should start following #ddj on Twitter.

New York Philharmonic Performance History

June 28th, 2015

New York Philharmonic Performance History

From the post:

The New York Philharmonic played its first concert on December 7, 1842. Since then, it has merged with the New York Symphony, the New/National Symphony, and had a long-running summer season at New York’s Lewisohn Stadium. This Performance History database documents all known concerts of all of these organizations, amounting to more than 20,000 performances. The New York Philharmonic Leon Levy Digital Archives provides an additional interface for searching printed programs alongside other digitized items such as marked music scores, marked orchestral parts, business records, and photos.

In an effort to make this data available for study, analysis, and reuse, the New York Philharmonic joins organizations like The Tate and the Cooper-Hewitt Smithsonian National Design Museum in making its own contribution to the Open Data movement.

The metadata here is released under the Creative Commons Public Domain CC0 licence. Please see the enclosed LICENCE file for more detail.

The data:

Field Description
General Info: Info that applies to entire program
id GUID (To view program:
ProgramID Local NYP ID
Orchestra Full orchestra name Learn more…
Season Defined as Sep 1 – Aug 31, displayed “1842-43″
Concert Info: Repeated for each individual performance within a program
eventType See term definitions
Location Geographic location of concert (Countries are identified by their current name. For example, even though the orchestra played in Czechoslovakia, it is now identified in the data as the Czech Republic)
Venue Name of hall, theater, or building where the concert took place
Date Full ISO date used, but ignore TIME part (1842-12-07T05:00:00Z = Dec. 7, 1842)
Time Actual time of concert, e.g. “8:00PM”
Works Info: the fields below are repeated for each work performed on a program. By matching the index number of each field, you can tell which soloist(s) and conductor(s) performed a specific work on each of the concerts listed above.
worksConductorName Last name, first name
worksComposerTitle Composer Last name, first / TITLE (NYP short titles used)
worksSoloistName Last name, first name (if multiple soloists on a single work, delimited by semicolon)
worksSoloistInstrument Last name, first name (if multiple soloists on a single work, delimited by semicolon)
worksSoloistRole “S” means “Soloist”; “A” means “Assisting Artist” (if multiple soloists on a single work, delimited by semicolon)

A great starting place for a topic map for performances of the New York Philharmonic or for combination with topic maps for composers or soloists.

I first saw this in a tweet by Anna Kijas.

The Absence of Proof Against China on OPM Hacks

June 27th, 2015

The Obama Administration has failed to release any evidence connecting China to the OPM hacks.

Now we know why: Hacked OPM and Background Check Contractors Lacked Logs, DHS Says.

From the post:

Tracking everyday network traffic requires an investment and some managers decide the expense outweighs the risk of a breach going undetected, security experts say.

In this case, taking chances has delayed a probe into the exposure of secrets on potentially 18 million national security personnel.

Hopefully congressional hearings will expand “some managers” into a list of identified individuals.

That is a level of incompetence that verges on the criminal.

Not having accountability for government employees has not lead to a secure IT infrastructure. Time to try something new. Like holding all employees accountable for their incompetence.

Running Lisp in Production

June 27th, 2015

Running Lisp in Production by Vsevolod Dyomkin and Kevin McIntire.

From the post:

At Grammarly, the foundation of our business, our core grammar engine, is written in Common Lisp. It currently processes more than a thousand sentences per second, is horizontally scalable, and has reliably served in production for almost 3 years.

We noticed that there are very few, if any, accounts of how to deploy Lisp software to modern cloud infrastructure, so we thought that it would be a good idea to share our experience. The Lisp runtime and programming environment provides several unique, albeit obscure, capabilities to support production systems (for the impatient, they are described in the final chapter).

An inspirational story about Lisp, along with tips on features you are unlikely to find elsewhere. A good read and worth the time.

Since the OPM is still running COBOL, I am sure one of your favorite agencies is still crunching Lisp. You might need to get them to upgrade.

Linked Data Repair and Certification

June 27th, 2015

1st International Workshop on Linked Data Repair and Certification (ReCert 2015) is a half-day workshop at the 8th International Conference on Knowledge Capture (K-CAP 2015).

I know, not nearly as interesting as talking about Raquel Welch, but someone has to. ;-)

From the post:

In recent years, we have witnessed a big growth of the Web of Data due to the enthusiasm shown by research scholars, public sector institutions and some private companies. Nevertheless, no rigorous processes for creating or mapping data have been systematically followed in most cases, leading to uneven quality among the different datasets available. Though low quality datasets might be adequate in some cases, these gaps in quality in different datasets sometimes hinder the effective exploitation, especially in industrial and production settings.

In this context, there are ongoing efforts in the Linked Data community to define the different quality dimensions and metrics to develop quality assessment frameworks. These initiatives have mostly focused on spotting errors as part of independent research efforts, sometimes lacking a global vision. Further, up to date, no significant attention has been paid to the automatic or semi-automatic repair of Linked Data, i.e., the use of unattended algorithms or supervised procedures for the correction of errors in linked data. Repaired data is susceptible of receiving a certification stamp, which together with reputation metrics of the sources can lead to having trusted linked data sources.

The goal of the Workshop on Linked Data Repair and Certification is to raise the awareness of dataset repair and certification techniques for Linked Data and to promote approaches to assess, monitor, maintain, improve, and certify Linked Data quality.

There is a call for papers with the following deadlines:

Paper submission: Monday, July 20, 2015

Acceptance Notification: Monday August 3, 2015

Camera-ready version: Monday August 10, 2015

Workshop: Monday October 7, 2015

Now that linked data exists, someone has to undertake the task of maintaining it. You could make links in linked data into topics in a topic map and add properties that would make them easier to match and maintain. Just a thought.

As far as “trusted link data sources,” I think the correct phrasing is: “less untrusted data sources than others.”

You know the phrase: “In God we trust, all others pay cash.”

Same is true for data. It may be a “trusted” source, but verify the data first, then trust.

Subjects For Less Obscure Topic Maps?

June 27th, 2015

A new window into our world with real-time trends

From the post:

Every journey we take on the web is unique. Yet looked at together, the questions and topics we search for can tell us a great deal about who we are and what we care about. That’s why today we’re announcing the biggest expansion of Google Trends since 2012. You can now find real-time data on everything from the FIFA scandal to Donald Trump’s presidential campaign kick-off, and get a sense of what stories people are searching for. Many of these changes are based on feedback we’ve collected through conversations with hundreds of journalists and others around the world—so whether you’re a reporter, a researcher, or an armchair trend-tracker, the new site gives you a faster, deeper and more comprehensive view of our world through the lens of Google Search.

Real-time data

You can now explore minute-by-minute, real-time data behind the more than 100 billion searches that take place on Google every month, getting deeper into the topics you care about. During major events like the Oscars or the NBA Finals, you’ll be able to track the stories most people are searching for and where in the world interest is peaking. Explore this data by selecting any time range in the last week from the date picker.

Follow @GoogleTrends for tweets about new data sets and trends.

See GoogleTrends at:

This has been in browser tab for several days. I could not decide if it was eye candy or something more serious.

After all, we are talking about searches ranging from experts to the vulgar.

I went an visited today’s results at Google Trends, and found:

  • 5 Crater of Diamonds State Park, Arkansas
  • 17 Ted 2, Jurassic World
  • 22 World’s Ugliest Dog Contest [It doesn’t say if Trump entered or not.]
  • 35 Episcopal Church
  • 48 Grace Lee Boggs
  • 59 Raquel Welch
  • 68 Dodge, Mopar, Dodge Challenger
  • 79 Xbox One, Xbox, Television
  • 86 Escobar: Paradise Lost, Pablo Escobar, Benicio del Toro
  • 98 Islamic State of Iraq and the Levant

I was glad to see Raquel Welch was in the top 100 but saddened that she was out scored by the Episcopal Church. That has to sting.

When I think of topic maps that I can give you as examples, they involve taxes, Castrati, and other obscure topics. My favorite use case is an ancient text annotated with commentaries and comparative linguistics based on languages no longer spoken.

I know what interests me but not what interests other people.

Thoughts on using Google Trends to pick “hot” topics for topic mapping?

Celebrity Porn Alert!

June 27th, 2015

A security blog I was reading mentioned that when named celebrity porn leaks, the number of searches for the named celebrity and nude, etc. jumps.

The blog also pointed out that infectious sites rapidly adapt to be in the top “hits” for such searches.

Not only do you run the risk of being discovered looking for celebrity porn, you may get an infection as well.

I wonder if you could trap CIA operatives by claiming to have compromising photos of Putin? ;-)

Is there a startup opportunity here? Safe celebrity porn?

FBI Builds Silencers For The Mentally Ill

June 26th, 2015

North Carolina Man Charged with Attempting to Provide Material Support to ISIL and Weapon Offenses

If you read the press release, you will miss these goodies from the complaint:

28. The FBI built a functional silencer at Sullivan’s request. That silencer does not bear the required serial number,7 and is not registered to Sullivan or any person in the National Firearms Registration and Transfer Record.

29. The FBI sent a package constaining the silencer to Sullivan’s home at 5470 Rose Carswell Road, Morganton, North Carolina, according to Sullivan’s instructions. At approximately 4:15 p.m. on June 19, 2015, Sullivan’s mother picked up the mail, to include the package containing the silencer, from the mailbox and returned to the house. FBI surveillance confirmed Sullivan was in the house when his mother entered with the silencer.

30. On June 19, 2015, the FBI conducted a search of 5470 Carswell Road, Morganton, North Carolina, pursuant to the consent of Sullivan’s mother and a federal search warrant. Among other things, the FBI found the silencer delivered to Sullivan earlier that day, which was hidden under plastic in a crawlspace accessible from the basement of the home….

How did all this start?

10. On April 21, 2015, Sullivan’s father placed a “911” call to request police assistance at the family residence at 5470 Rose Carswell Road, Morganton, North Carolina. Sullivan’s father said: “I don’t know if it is ISIS or what, but he [Sullivan] is destroying Buddhas, and figurines and stuff.” He stated that Sullivan was destroying their “religious” items, had done so before, and this time Sullivan poured gasoline on some such items to burn them. Sullivan’s father added: “I mean, we are scared to leave the house.” Sullivan could be heard in the background stating: “why are you trying to say I am a terrorist?” and words to that effect, multiple times. Sullivan complained in the background that his father was only mentioning the religious items, and asked his father to tell the police he had destroyed other objects as well. Sullivan could be heard stating that “they” were going to put Sullivan “in jail my whole life,” or, alternatively: “they are not going to put me in jail. They are going to kill me.”

Of course, rather than a referral to mental health services, a FBI undercover agent made contact with Sullivan on June 6, 2015. You can read the recounting of the bizarre conversations with Sullivan in the complaint. It is an image file so I have to re-type anything that appears in the blog.

According to the news release Sullivan was charged with:

one count of attempting to provide material support to ISIL,

one count of transporting and receiving a silencer in interstate commerce with intent to commit a felony, and

one count of receipt and possession of an unregistered silencer, unidentified by a serial number.

True enough, a person disturbed enough to:

Sullivan complained in the background that his father was only mentioning the religious items, and asked his father to tell the police he had destroyed other objects as well.

How’s that for an answer to the complaint you are destroying religious items? You want to point out to the police you are destroying other stuff too?

Sullivan was suffering from paranoid delusions but rather than getting him help, the FBI set him up for being charged with attempting to assist ISIS and two silencer violations that occurred only because the FBI built and mailed him a silencer.

Victimizing the mentally ill pads the FBI terrorist statistics and serves to further the fictional war on terrorism.

DuckDuckGo search traffic soars 600% post-Snowden

June 26th, 2015

DuckDuckGo search traffic soars 600% post-Snowden by Lee Munson.

From the post:

When Gabriel Weinberg launched a new search engine in 2008 I doubt even he thought it would gain any traction in an online world dominated by Google.

Now, seven years on, Philadelphia-based startup DuckDuckGo – a search engine that launched with a promise to respect user privacy – has seen a massive increase in traffic, thanks largely to ex-NSA contractor Edward Snowden’s revelations.

Since Snowden began dumping documents two years ago, DuckDuckGo has seen a 600% increase in traffic (but not in China – just like its larger brethren, its blocked there), thanks largely to its unique selling point of not recording any information about its users or their previous searches.

Such a huge rise in traffic means DuckDuckGo now handles around 3 billion searches per year.

DuckDuckGo does not track its users. Instead, it makes money off of displaying key word (from your search string) based ads.

Hmmm, what if instead of key words from your search string, you pre-qualified yourself for ads?

Say for example I have a topic map fragment that pre-qualifies me for new books on computer science, break baking, and waxed dental floss. When I use a search site, it uses those “topics” or key words to display ads to me.

That avoids displaying to me ads for new cars (don’t own one, don’t want one), hair replacement ads (not interested) and ski resorts (don’t ski).

Advertisers benefit because their ads are displayed to people who have qualified themselves as interested in their products. I don’t know what the difference in click-through rate would be but I suspect it would be substantial.


Top 10 data mining algorithms in plain R

June 26th, 2015

Top 10 data mining algorithms in plain R by Raymond Li.

From the post:

Knowing the top 10 most influential data mining algorithms is awesome.

Knowing how to USE the top 10 data mining algorithms in R is even more awesome.

That’s when you can slap a big ol’ “S” on your chest…

…because you’ll be unstoppable!

Today, I’m going to take you step-by-step through how to use each of the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

By the end of this post…

You’ll have 10 insanely actionable data mining superpowers that you’ll be able to use right away.

The table of contents follows his Top 10 data mining algorithms in plain English, with additions for R:

I would not be at all surprised to see these top ten (10) algorithms show up in other popular data mining languages.


BBC Pages Censored by the EU

June 26th, 2015

List of BBC web pages which have been removed from Google’s search results by Neil McIntosh.

From the post:

Since a European Court of Justice ruling last year, individuals have the right to request that search engines remove certain web pages from their search results. Those pages usually contain personal information about individuals.

Following the ruling, Google removed a large number of links from its search results, including some to BBC web pages, and continues to delist pages from BBC Online.

The BBC has decided to make clear to licence fee payers which pages have been removed from Google’s search results by publishing this list of links. Each month, we’ll republish this list with new removals added at the top.

We are doing this primarily as a contribution to public policy. We think it is important that those with an interest in the “right to be forgotten” can ascertain which articles have been affected by the ruling. We hope it will contribute to the debate about this issue. We also think the integrity of the BBC’s online archive is important and, although the pages concerned remain published on BBC Online, removal from Google searches makes parts of that archive harder to find.

The pages affected by delinking may disappear from Google searches, but they do still exist on BBC Online. David Jordan, the BBC’s Director of Editorial Policy and Standards, has written a blog post which explains how we view that archive as “a matter of historic public record” and, thus, something we alter only in exceptional circumstances. The BBC’s rules on deleting content from BBC Online are strict; in general, unless content is specifically made available only for a limited time, the assumption is that what we publish on BBC Online will become part of a permanently accessible archive. To do anything else risks reducing transparency and damaging trust.

Kudos for the BBC for demonstrating the extent of censorship implied by the EU’s “right to be forgotten. The “right to be forgotten” combines ignorance of technology with eurocentrism at its very worst. Not to mention being futile when directed at a search engine.

Just to get you started, here are the links from the post:

One caveat: when looking through this list it is worth noting that we are not told who has requested the delisting, and we should not leap to conclusions as to who is responsible. The request may not have come from the obvious subject of a story.

May 2015

April 2015

March 2015

February 2015

January 2015

December 2014

November 2014

October 2014

September 2014

August 2014

July 2014

One consequence of this listing is that I will have to follow the BBC blog to catch the new list of deletions, month by month. The writing is always enjoyable but it’s one more thing to track.

The thought does occur to me that analysis of the EU censored pages may reveal patterns of what materials are the most likely subjects of censorship.

In addition to the BBC list, one can imagine a search engine that only indexes EU censored pages. Would ad revenue sustain such an index or would it be pay-per-view?

It would be very ironic if EU censorship resulted in more publicity for people exercising their “right to be forgotten.” Not only ironic, but appropriate at well.

PS: You can follow the BBC Internet Blog on Twitter: @bbcinternetblog.

Topic Maps For Sharing (or NOT!)

June 26th, 2015

This is one slide (#38) out of several but I saw it posted by PBBsRealm(Brad M) and thought it was worth transcribing part of it:

From the slide:

Why is Cyber Security so Hard?

No common taxonomy

  • Information is power; sharing is seen as loss of power

[Searching on several phrases and NERC (North American Electricity Reliability Corporation), I have been unable to find the entire slide deck.]

Did you catch the line:

Information is power; sharing is seen as loss of power

You can use topic maps for sharing, but how much sharing you choose to do is up to you.

For example, assume your department is responsible for mapping data for ETL operations. Each analyst is using state of the art software to create mappings from field to field. In the process of creating those mappings, each analyst learns enough about those fields to make sure the mapping is correct.

Now one or more of your analysts leave for other positions. All the ad hoc knowledge they had of the data fields has been lost. With a topic map, you could have been accumulating power as each analyst discovered information about each data field.

If management requests the mapping you are using, you output the standard field to field mapping, with none of the extra information that you have accumulated for each field in a topic map. The underlying descriptions remain solely in your possession.

With topic maps, you can share a little or a lot, your call.

PS: You can also encrypt the values you use for merging in your topic map. Which could enable different levels of merging for one map, based upon a level of security clearance. An example would be a topic map resource accessible by people with varying security clearances. (CIA/NSA take note.)

Internationalization & Unicode Conference ICU 39

June 25th, 2015

Internationalization & Unicode Conference ICU 39

October 26-28, 2015 – Santa Clara, CA USA

From the webpage:

The Internationalization and Unicode® Conference (IUC) is the premier event covering the latest in industry standards and best practices for bringing software and Web applications to worldwide markets. This annual event focuses on software and Web globalization, bringing together internationalization experts, tools vendors, software implementers, and business and program managers from around the world. 

Expert practitioners and industry leaders present detailed recommendations for businesses looking to expand to new international markets and those seeking to improve time to market and cost-efficiency of supporting existing markets. Recent conferences have provided specific advice on designing software for European countries, Latin America, China, India, Japan, Korea, the Middle East, and emerging markets.

This highly rated conference features excellent technical content, industry-tested recommendations and updates on the latest standards and technology. Subject areas include web globalization, programming practices, endangered languages and un-encoded scripts, integrating with social networking software, and implementing mobile apps. This year’s conference will also highlight new features in Unicode and other relevant standards. 

In addition, please join us in welcoming over 20 first-time speakers to the program! This is just another reason to attend; fresh talks, fresh faces, and fresh ideas!

(emphasis and colors in original)

If you want your software to be an edge case and hard to migrate in the future, go ahead, don’t support Unicode. Unicode libraries exist in all the major and many minor programming languages. Not supporting Unicode isn’t simpler, it’s just dumber.

Sorry, I have been a long time follower of the Unicode work and an occasional individual member of the Consortium. Those of us old enough to remember pre-Unicode days want to lessen the burden of interchanging texts, not increase it.

Enjoy the conference!

1.5 Million Slavery Era Documents Will Be Digitized…

June 25th, 2015

1.5 Million Slavery Era Documents Will Be Digitized, Helping African Americans to Learn About Their Lost Ancestors

From the post:

The Freedmen’s Bureau Project — a new initiative spearheaded by the Smithsonian, the National Archives, the Afro-American Historical and Genealogical Society, and the Church of Jesus Christ of Latter-Day Saints — will make available online 1.5 million historical documents, finally allowing ancestors [sic. descendants] of former African-American slaves to learn more about their family roots. Near the end of the US Civil War, The Freedmen’s Bureau was created to help newly-freed slaves find their footing in postbellum America. The Bureau “opened schools to educate the illiterate, managed hospitals, rationed food and clothing for the destitute, and even solemnized marriages.” And, along the way, the Bureau gathered handwritten records on roughly 4 million African Americans. Now, those documents are being digitized with the help of volunteers, and, by the end of 2016, they will be made available in a searchable database at According to Hollis Gentry, a Smithsonian genealogist, this archive “will give African Americans the ability to explore some of the earliest records detailing people who were formerly enslaved,” finally giving us a sense “of their voice, their dreams.”

You can learn more about the project by watching the video below, and you can volunteer your own services here.

A crowd sourced project that has a great deal of promise with regard to records on 4 million African Americans, who were previously held as slaves.

Making the documents “searchable” will be of immense value. However, imagine capturing the myriad relationships documented in these records so that subsequent searchers can more quickly find relationships you have already documented.

Finding former slaves with a common owner or other commonalities, could be the clues others need to untangle a past we only see dimly.

Topic maps are a nice fit for this work.

Eidyia (Scientific Python)

June 25th, 2015


From the webpage:

A scientific Python 3 environment configured with Vagrant. This environment is designed to be used by professionals and students, with ease of access a priority.

Libraries included:


Eidyia also includes MongoDB and PostgreSQL

Getting Started

With Vagrant and VirtualBox installed:

Watch the Vagrant link on the Github page, it is broken. Correct link appears above. (I am posting an issue about the link to Github.)

The more experience I have with virtual environments, the more I like them. Mostly from a configuration perspective. I don’t have to worry about library upgrades stepping on other programs, port confusion, etc.


Flash Audit on OPM Infrastructure Update Plan

June 24th, 2015

Flash Audit Alert – U.S. Office ofPersonnel Management’s Infrastructure Improvement Project (Report No. 4A-CI-00-15-055)

Hot off the presses! Just posted online today!

From the report:

The U.S. Office of Personnel Management (OPM) Office ofthe Inspector General (OIG) is issuing this Flash Audit Alert to bring to your immediate attention serious concerns we have regarding the Office of the Chief Information Officer’ s (OCIO) infrastructure improvement project (Project). 1 This Project includes a full overhaul ofthe agency’s technical infrastructure by implementing additional information technology (IT) security controls and then migrating the entire infrastructure into a completely new environment (referred to as Shell).

Our primary concern is that the OCIO has not followed U.S . Office ofManagement and Budget (OMB) requirements and project management best practices. The OCIO has initiated this project without a complete understanding ofthe scope ofOPM’ s existing technical infrastructure or the scale and costs of the effort required to migrate it to the new environment.

In addition, we have concerns with the nontraditional Government procurement vehicle that was used to secure a sole-source contract with a vendor to manage the infrastructure overhaul. While we agree that the sole-source contract may have been appropriate for the initial phases of securing the existing technical environment, we do not agree that it is appropriate to use this vehicle for the long-term system migration efforts.

How bad is it?

Several examples of critical processes that OPM has not completed for this project include:

  • Project charter;
  • Comprehensive list of project stakeholders;
  • Feasibility study to address scope and timeline in concert with budgetary justification/cost estimates;
  • Impact assessment for existing systems and stakeholders;
  • Quality assurance plan and procedures for contractor oversight;
  • Technological infrastructure acquisition plan;
  • High-level test plan; and,
  • Implementation plan to include resource planning, readiness assessment plan, success factors, conversion plan, and back-out plan.

The report isn’t that long, six (6) page in total, but it is a snap shot of bad project management in its essence.

I helped torpedo a project once upon a time where management defended a one paragraph email description of a proposed CMS system as being “agile.” The word they were looking for was “juvenile,” but they were unwilling to admit to years of mistakes in allowing the “programmer” (used very loosely) to remain employed.

What do you think of inspector generals as an audience for topic maps? They investigate large and disorganized agencies, repeatedly over time, with lots of players and documents. Thoughts?

PS: I read about the flash audit report several days ago but didn’t want to post about it until I could share a source for it. Would make great example material for a course on project management.

World Factbook 2015 (paper, online, downloadable)

June 24th, 2015

World Factbook 2015 (GPO)

From the webpage:

The Central Intelligence Agency’s World Factbook provides brief information on the history, geography, people, government, economy, communications, transportation, military, and transnational issues for 267 countries and regions around world.

The CIA’s World Factbook also contains several appendices and maps of major world regions, which are located at the very end of the publication. The appendices cover abbreviations, international organizations and groups, selected international environmental agreements, weights and measures, cross-reference lists of country and hydrographic data codes, and geographic names.

For maps, it provides a country map for each country entry and a total of 12 regional reference maps that display the physical features and political boundaries of each world region. It also includes a pull-out Flags of the World, a Physical Map of the World, a Political Map of the World, and a Standard Time Zones of the World map.

Who should read The World Factbook? It is a great one-stop reference for anyone looking for an expansive body of international data on world statistics, and has been a must-have publication for:

  • US Government officials and diplomats
  • News organizations and researchers
  • Corporations and geographers
  • Teachers, professors, librarians, and students
  • Anyone who travels abroad or who is interested in foreign countries

The print version is $89.00 (U.S.), is 923 pages long and weighs in at 5.75 lb. in paperback.

A convenient and frequently updated alternative is the online CIA World Factbook.

I can’t compare the two versions because I am not going to spend $89.00 for an arm wrecker. ;-)

You can also download a copy of the HTML version.

I downloaded and unzipped the file, only to find that the last update was in June, 2014.

That may be updated soon or it may not. I really don’t know.

If you just need background information that is unlikely to change or you want to avoid surveillance on what countries you look at and for how long, download the 2014 HTML version or pony up for the 2015 paper version.

Semi-nude Photos and iPhones

June 24th, 2015

Graham Cluley has advice in Dear politicians, here’s some advice before you check out semi-nude photos on your iPhone… that works for everyone viewing semi-nude photos on their iPhones, not just politicians.

In a prior post, Nude Heather Morris pictures – hacker blamed, Graham has this advice on taking nude photos of yourself (iPhone or not):


Keeping that in your wallet may or may not help.

Startup idea: App that prevents nude or semi-nude photos of the phone owner. ;-)

Would you choose a 1,425% or 0% ROI?

June 24th, 2015

The 2015 Trust Wave Security Report calculates that attacks on end users enjoys an ROI of 1,425%.

Can you guess the liability for producing software that allows attacks on end users?

It’s the same amount as the return on making software secure. That is 0% ROI.

No doubt the Obama administration will spend $millions if not $billions in its multi-year cyber egg roll to improve cybersecurity for government networks, but the result will be:


an insecure IT stack topped off by insecure security software.

Unless and until there are economic incentives and hence meaningful ROIs for secure software, cyberinsecurity will continue.

Given the near idolatry of capitalism and economic incentives in the United States, it is truly surprising that lesson remains unlearned.

Well, save for the realization that secure software requires more investment in tools, training and testing, than current approaches to building commercial software.

Customers demanding more secure software, who are willing to pay more for secure software and liability for the production of insecure software, are all keys to solving (over time) the current state of cyberinsecurity.

Country Reports on Terrorism 2014

June 23rd, 2015

Country Reports on Terrorism 2014 by United States Department of State. (June 2015)

The report runs some three hundred and eighty-eight (388) pages but I thought you would find the criteria for inclusion (starts on page 386) is rather amusing.

Note in particular that:

Section 2656f(d) of Title 22 of the United States Code defines certain key terms used in Section 2656f(a) as follows:

(2) the term “terrorism” means premeditated, politically motivated violence perpetrated against non-combatant targets by subnational groups or clandestine agents; and (emphasis added)

I am guessing that means that US drone pilots are not terrorists because they are not part of “subnational groups or clandestine agents.”

The rest of the report is fairly disheartening. If you have been following the attempts of the Obama Whitehouse to say nice things about the slavers operating out of Malaysia, you will find a number of laudatory comments about Malaysia in this report.

You will notice that the number of those “killed” by “terrorist” figure prominently in the report but no mention is made of civilian casualties inflicted by the United States for the same years.

As a political theater document you may find this useful for detecting shifts in who is a “favorite” of the State Department when this report is published every year.

The report originates under the following authority:

Section 2656f(a) of Title 22 of the United States Code states as follows:

(a) … The Secretary of State shall transmit to the Speaker of the House of Representatives and the Committee on Foreign Relations of the Senate, by April 30 of each year, a full and complete report providing –

(1) (A) detailed assessments with respect to each foreign country –

(i) in which acts of international terrorism occurred which were, in the opinion of the Secretary, of major significance;

(ii) about which the Congress was notified during the preceding five years pursuant to Section 2405(j) of the Export Administration Act of 1979; and

(iii) which the Secretary determines should be the subject of such report; and

(B) detailed assessments with respect to each foreign country whose territory is being used as a sanctuary for terrorist organizations;

(2) all relevant information about the activities during the preceding year of any terrorist group, and any umbrella group under which such terrorist group falls, known to be responsible for the kidnapping or death of an American citizen during the preceding five years, any terrorist group known to have obtained or developed, or to have attempted to obtain or develop, weapons of mass destruction, any terrorist group known to be financed by countries about which Congress was notified during the preceding year pursuant to section 2405(j) of the Export Administration Act of 1979, any group designated by the Secretary as a foreign terrorist organization under section 219 of the Immigration and Nationality Act (8 U.S.C. 1189), and any other known international terrorist group which the Secretary determines should be the subject of such report;

(3) with respect to each foreign country from which the United States Government has sought cooperation during the previous five years in the investigation or prosecution of an act of international terrorism against United States citizens or interests, information on –

(A) the extent to which the government of the foreign country is cooperating with the United States Government in apprehending, convicting, and punishing the individual or individuals responsible for the act; and

(B) the extent to which the government of the foreign country is cooperating in preventing further acts of terrorism against United States citizens in the foreign country; and

(4) with respect to each foreign country from which the United States Government has sought cooperation during the previous five years in the prevention of an act of international terrorism against such citizens or interests, the information described in paragraph (3)(B).

Section 2656f(d) of Title 22 of the United States Code defines certain key terms used in Section 2656f(a) as follows:

(1) the term “international terrorism” means terrorism involving citizens or the territory of more than one country;

(2) the term “terrorism” means premeditated, politically motivated violence perpetrated against non-combatant targets by subnational groups or clandestine agents; and

(3) the term “terrorist group” means any group practicing, or which has significant subgroups which practice, international terrorism.

I first saw this in a tweet by switched.


June 22nd, 2015

LuxRender – Physically Based Renderer.

From the webpage:

LuxRender is a physically based and unbiased rendering engine. Based on state of the art algorithms, LuxRender simulates the flow of light according to physical equations, thus producing realistic images of photographic quality.

LuxRender is now a member project of the Software Freedom Conservancy which provides administrative and financial support to FOSS projects. This allows us to receive donations, which can be tax deductible in the US.

Physically based spectral rendering

LuxRender is built on physically based equations that model the transportation of light. This allows it to accurately capture a wide range of phenomena which most other rendering programs are simply unable to reproduce. This also means that it fully supports high-dynamic range (HDR) rendering.


LuxRender features a variety of material types. Apart from generic materials such as matte and glossy, physically accurate representations of metal, glass, and car paint are present. Complex properties such as absorption, dispersive refraction and thin film coating are available.

Fleximage (virtual film)

The virtual film allows you to pause and continue a rendering at any time. The current state of the rendering can even be written to a file, so that the computer (or even another computer) can continue rendering at a later moment.

Free for everyone

LuxRender is and will always be free software, both for private and commercial use. It is being developed by people with a passion for programming and for computer graphics who like sharing their work. We encourage you to download LuxRender and use it to express your artistic ideas. (learn more)

Too advanced for my graphic skills but I thought some of you might find this useful in populating your topic maps with high-end visualizations.

I first saw this in a tweet by David Bucciarelli that announced the LuxRender v1.5RC1 release.

Learning to Execute

June 22nd, 2015

Learning to Execute by Wojciech Zaremba and Ilya Sutskever.


Recurrent Neural Networks (RNNs) with Long Short-Term Memory units (LSTM) are widely used because they are expressive and are easy to train. Our interest lies in empirically evaluating the expressiveness and the learnability of LSTMs in the sequence-to-sequence regime by training them to evaluate short computer programs, a domain that has traditionally been seen as too complex for neural networks. We consider a simple class of programs that can be evaluated with a single left-to-right pass using constant memory. Our main result is that LSTMs can learn to map the character-level representations of such programs to their correct outputs. Notably, it was necessary to use curriculum learning, and while conventional curriculum learning proved ineffective, we developed a new variant of curriculum learning that improved our networks’ performance in all experimental conditions. The improved curriculum had a dramatic impact on an addition problem, making it possible to train an LSTM to add two 9-digit numbers with 99% accuracy.

Code to replicate the experiments:

A step towards generation of code that conforms to coding standards?

I first saw this in a tweet by samin.