Computation + Journalism Symposium 2015

October 3rd, 2015

Computation + Journalism Symposium 2015

From the webpage:

Data and computation drive our world, often without sufficient critical assessment or accountability. Journalism is adapting responsibly—finding and creating new kinds of stories that respond directly to our new societal condition. Join us for a two-day conference exploring the interface between journalism and computing.

Papers are up! Papers are up!

Many excellent papers but one caught my eye in particular:

DeScipher: A Text Simplification Tool for Science Journalism, Yea Seul Kim, Jessica Hullman and Eytan Adar.

High on my reading list after spending a day with “almost” explanations in technical documentation.

This could be very useful for anyone authoring useful technical documentation, not to mention writing for the general public.

Workflow for R & Shakespeare

October 2nd, 2015

A new data processing workflow for R: dplyr, magrittr, tidyr, ggplot2

From the post:

Over the last year I have changed my data processing and manipulation workflow in R dramatically. Thanks to some great new packages like dplyr, tidyr and magrittr (as well as the less-new ggplot2) I've been able to streamline code and speed up processing. Up until 2014, I had used essentially the same R workflow (aggregate, merge, apply/tapply, reshape etc) for more than 10 years. I have added a few improvements over the years in the form of functions in packages doBy, reshape2 and plyr and I also flirted with the package data.table (which I found to be much faster for big datasets but the syntax made it difficult to work with) — but the basic flow has remained remarkably similar. Until now…

Given how much I've enjoyed the speed and clarity of the new workflow, I thought I would share a quick demonstration.

In this example, I am going to grab data from a sample SQL database provided by Google via Google BigQuery and then give examples of manipulation using dplyr, magrittr and tidyr (and ggplot2 for visualization).

This is a great introduction to a work flow in R that you can generalize for your own purposes.

Word counts won’t impress your English professor but you will have a base for deeper analysis of Shakespeare.

I first saw this in a tweet by Christophe Lalanne.

Emacs Mini Manual, etc.

October 2nd, 2015

Emacs Mini Manual, etc.

From the webpage:

Very strong resources on Emacs for programmers.

The animated graphics are a real treat.

I first saw this in a tweet by Christophe Lalanne.

Debugging XQuery Advice

October 2nd, 2015

If you are ever called upon to diagnose or repair network connectivity issues, you know that the first thing to check is the network cable.

Well, much to my chagrin, there is a similar principle to follow when debugging XQuery statements.

If you type an element name incorrectly, you may not get an error from the query and it will happily complete, sans your expected content.

To broaden that a bit, the first thing to check, outside of reported syntax and type errors, are your XPath expressions, including the spelling of element names.

Especially for no errors but also not the expected result cases.

Thus ends the XQuery lesson for the day.

Stagefright Bug 2.0 [/bettertargets.txt ?]

October 1st, 2015

Stagefright Bug 2.0 – One Billion Android SmartPhones Vulnerable to Hacking by Mohit Kumar.

From the post:

Attention Android users!

More than 1 Billion Android devices are vulnerable to hackers once again – Thanks to newly disclosed two new Android Stagefright vulnerabilities.

Yes, Android Stagefright bug is Back…

…and this time, the flaw allows an attacker to hack Android smartphones just by tricking users into visiting a website that contains a malicious multimedia file, either MP3 or MP4.

For all the talk about better software, better security procedures, etc., nothing seems to be really cost-effective at stopping hacking.

Instead of putting our limited fingers into the increasing number of cyber vulnerabilities, may I suggest we take a page from the history of /robots.txt?

In addition to your robots.txt file at the root of your web server, create a bettertargets.txt file also at the root of your file system.

List other organizations, government agencies, etc. that have more valuable information assets than you and any information you have that could be used to breach those sites.

Hackers should appreciate the assist and the higher quality assets they can obtain at other sites. At the least it will get them to move away from your machine, which is the point of cybersecurity, at least from a personal point of view.

As a suggested format, a plain tab-delimited text file where each line begins with the domain-name tab IP-address tab assets-(summary of information assets) tab vulnerability-(description of vulnerability).

Suggestions for enhancements?

Is the term “tease” still in fashion?

October 1st, 2015

I ask if “tease” is still in fashion (or its more sexist equivalent) because I keep running across partial O’Reilly publications that are touted as “free,” but are in reality, just extended ads for forthcoming books.

A case in point is “Transforms in CSS” which isn’t really a book but an excerpt from the forth edition of CSS: The Definitive Guide.

Forty page book?

Social media with light up with posts and reposts about this “free” title.

Save your time and disk space. If anything, get a preview copy of the forth edition of CSS: The Definitive Guide when it is available.

Make no mistake, I like O’Reilly publications and I am presently reading what I suspect is the best O’Reilly title in a number of years, XQuery by Priscilla Walmsley.

O’Reilly shouldn’t waste bandwidth with disconnected excerpts for its titles.

Federal Cybersecurity: More Holes Than Swiss Cheese

October 1st, 2015

Agencies Need to Correct Weaknesses and Fully Implement Security Programs GAO-15-714: Published: Sep 29, 2015.

From the webpage:

Persistent weaknesses at 24 federal agencies illustrate the challenges they face in effectively applying information security policies and practices. Most agencies continue to have weaknesses in (1) limiting, preventing, and detecting inappropriate access to computer resources; (2) managing the configuration of software and hardware; (3) segregating duties to ensure that a single individual does not have control over all key aspects of a computer-related operation; (4) planning for continuity of operations in the event of a disaster or disruption; and (5) implementing agency-wide security management programs that are critical to identifying control deficiencies, resolving problems, and managing risks on an ongoing basis (see fig.). These deficiencies place critical information and information systems used to support the operations, assets, and personnel of federal agencies at risk, and can impair agencies’ efforts to fully implement effective information security programs. In prior reports, GAO and inspectors general have made hundreds of recommendations to agencies to address deficiencies in their information security controls and weaknesses in their programs, but many of these recommendations remain unimplemented.

Can you guess why “…may of these recommendations remain unimplemented?

The first and foremost reason is that disregarding a recommendation by the GAO or inspectors general has no consequences, none.

Can you imagine being in charge of maintaining your corporate firewall and when it is breached telling your boss, “yeah, I know you said to fix the old one but I got busy and just never did it.”

What do you think the consequences for you personally would be? (You have only one guess.)

It doesn’t appear to work like that at federal agencies. The same people make the same mistakes, over and over again, with no consequences whatsoever.

The only way to change the current cybersecurity state of federal agencies is to provide consequences for failure to improve.

The GAO and inspectors general should be given day to day control over agency spending and personnel decisions as they relate to cybersecurity priorities. And empowered to hire and fire staff as they see fit.

Any other remedy is a recipe for federal security that barely test script kiddies, much less more serious international opponents.

“No Sir, Mayor Daley no longer dines here. He’s dead sir.” The Limits of Linked Data

September 30th, 2015

That line from the Blues Brothers came to mind when I read OCLC to launch linked data pilot with seven leading libraries, which reads in part:

DUBLIN, Ohio, 11 September 2015OCLC is working with seven leading libraries in a pilot program designed to learn more about how linked data will influence library workflows in the future.

The Person Entity Lookup pilot will help library professionals reduce redundant data by linking related sets of person identifiers and authorities. Pilot participants will be able to surface WorldCat Person entities, including 109 million brief descriptions of authors, directors, musicians and others that have been mined from WorldCat, the world’s largest resource of library metadata.

By submitting one of a number of identifiers, such as VIAF, ISNI and LCNAF, the pilot service will respond with a WorldCat Person identifier and mappings to additional identifiers for the same person.

The pilot will begin in September and is expected to last several months. The seven participating libraries include Cornell University, Harvard University, the Library of Congress, the National Library of Medicine, the National Library of Poland, Stanford University and the University of California, Davis.

If you happen to use one of the known identifiers and like Mayor Daley, your subject is one of the 109 million authors, directors, musicians, etc., and you are at one of these seven participants, your in luck!

If your subject is one of the 253 million vehicles on U.S. roads, or one of the 123.4 million people employed full time in the U.S., or one or more of the 73.9 billion credit card transactions in 2012, or one of the 3 billion cellphone calls made every day in the U.S., then linked data and the OCLC pilot project will leave you high and dry. (Feel free to add in subjects of interest to you that aren’t captured by linked data.)

It’s not a bad pilot project but it does serve to highlight the primary weakness of linked data: It doesn’t include any subjects of interest to you.

You want to talk about your employees, your products, your investments, your trades, etc.

That’s understandable. That will drive your ROI from semantic technologies.

OCLC linked data can help you with dead people and some famous ones, but that doesn’t begin to satisfy your needs.

What you need is a semantic technology that puts the fewest constraints on you and at the same time enables to talk about your subjects, using your terms.


Tracking Congressional Whores

September 30th, 2015

Introducing legis-graph – US Congressional Data With Govtrack and Neo4j by William Lyon.

From the post:

Interactions among members of any large organization are naturally a graph, yet the tools we use to collect and analyze data about these organizations often ignore the graphiness of the data and instead map the data into structures (such as relational databases) that make taking advantage of the relationships in the data much more difficult when it comes time to analyze the data. Collaboration networks are a perfect example. So let’s focus on one of the most powerful collaboration networks in the world: the US Congress.

Introducing legis-graph: US Congress in a graph

The goal of legis-graph is to provide a way of converting the data provided by Govtrack into a rich property-graph model that can be inserted into Neo4j for analysis or integrating with other datasets.

The code and instructions are available in this Github repo. The ETL workflow works like this:

  1. A shell script is used to rsync data from Govtrack for a specific Congress (i.e. the 114th Congress). The Govtrack data is a mix of JSON, CSV, and YAML files. It includes information about legislators, committees, votes, bills, and much more.
  2. To extract the pieces of data we are interested in for legis-graph a series of Python scripts are used to extract and transform the data from different formats into a series of flat CSV files.
  3. The third component is a series of Cypher statements that make use of LOAD CSV to efficiently import this data into Neo4j.

To get started with legis-graph in Neo4j you can follow the instructions here. Alternatively, a Neo4j data store dump is available here for use without having to execute the import scripts. We are currently working to streamline the ETL process so this may change in the future, but any updates will be appear in the Github README.

This project began during preparation for a graph database focused Meetup presentation. George Lesica and I wanted to provide an interesting dataset in Neo4j for new users to explore.

Whenever the U.S. Congress is mentioned, I am reminded of the Obi-Wan Kenobi’s line about Mos Eisley:

You will never find a more wretched hive of scum and villainy. We must be cautious.

The data model for William’s graph:


As you can see from the diagram above, the datamodel is rich and captures quite a bit of information about the actions of legislators, committees, and bills in Congress. Information about what properties are currently included is available here.

A great starting place that can be extended and enriched.

In terms of the data model, note that “subject” is now the title of a bill. Definitely could use some enrichment there.

Another association for the bill, “who_benefits.”

If you are really ambitious, try developing information on what individuals or groups make donations to the legislator on an annual basis.

To clear noise out of the data set, drop everyone who doesn’t contribute annually and even then, any total less than $5,000. Remember that members of congress depend on regular infusions of cash so erratic or one-time donors may get a holiday card but they are not on the ready access list.

The need for annual cash is one reason why episodic movements may make the news but they rarely make a difference. To make a difference requires years of steady funding and grooming of members of congress and improving your access, plus your influence.

Don’t be disappointed if you can “prove” member of congress X is in the pocket of Y or Z organization/corporation and nothing happens. More likely than not, such proof will increase their credibility for fund raising.

As Leonard Cohen says (Everybody Knows):

Everybody knows the dice are loaded, Everybody rolls with their fingers crossed

BBC News Labs Project List

September 29th, 2015

Not only does the BBC News Lab have a cool logo:


They have an impressive list of projects as well:

BBC News Labs Project List

  • News Switcher – News Switcher lets BBC journalists easily switch between the differents editions of the News website
  • Pool of Video – BBC News Labs is looking into some new research questions based on AV curation.
  • Suggestr – connecting the News industry with BBC tags – This prototype, by for BBC News Labs, is about enabling local News organisations to tag with BBC News tags
  • Linked data on the TV – LDPTV is a project for surfacing more News content via smart TVs
  • #newsHACK – Industry Collaboration – #newsHACK is all about intense multi-discipline collaboration on Journalism innovation.
  • BBC Rewind Collaboration – The News Labs team is working with BBC Rewind – a project liberating the BBC archive – to share tech and approaches.
  • The News Slicer – The News Slicer takes MOS running orders and transcripts from BBC Broadcast Playout, then auto tags and auto segments the stories.
  • News Rig – The future of multilingual workflow – A prototype workflow for reversioning content into multiple languages, and controlling an "on demand" multilingual news service.
  • Atomised News – with BBC R&D – News Labs has been working with BBC R&D to explore a mobile-focussed, structured breadth & depth approach to News experiences
  • Connected Studio – World Service – A programme of international innovation activities, aimed at harnessing localised industry talent to explore new opportunites.
  • Language Technology – BBC News Labs started a stream of Language Technology projects in 2014, in order to scale our storytelling globally
  • Blue Labs – BBC Blue Room and News Labs are working together to more efficiently demonstrate innovation opportunities with emerging consumer Tech.
  • Immersive News – 360 Video & VR – We are looking into the craft and applications around 360 video filming and VR for Immersive News.
  • The Journalist Toolbox – In June 2015, a hack team at #newsHACK created a concept called The Journalist Toolbox, proposing that newsroom content publishing & presentation tools needed to be more accessible for journalists.
  • SUMMA – Scalable Understanding of Multilingual MediA – This Big Data project will leverage machines to do the heavy lifting in multilingual media monitoring.
  • Window on the Newsroom – Window on the Newsroom is a prototype discovery interface to help Journalists look quickly across Newsroom system by story or metadata properties
  • The Juicer – The Juicer takes news content, automatically semantically tags it, then provides a fully featured API to access this content and data.

This list is current as of 29 September 2015 so check with the project page for new and updated project information from the BBC.

The BBC is a British institution that merits admiration. Counting English, its content is available in twenty-eight (28) languages.

Alas, “breathless superlative American English,” a staple of the Fox network, is not one of them.

Discovering Likely Mappings between APIs using Text Mining [Likely Merging?]

September 28th, 2015

Discovering Likely Mappings between APIs using Text Mining by Rahul Pandita, Raoul Praful Jetley, Sithu D Sudarsan, Laurie Williams.


Developers often release different versions of their applications to support various platform/programming-language application programming interfaces (APIs). To migrate an application written using one API (source) to another API (target), a developer must know how the methods in the source API map to the methods in the target API. Given a typical platform or language exposes a large number of API methods, manually writing API mappings is prohibitively resource-intensive and may be error prone. Recently, researchers proposed to automate the mapping process by mining API mappings from existing codebases. However, these approaches require as input a manually ported (or at least functionally similar) code across source and target APIs. To address the shortcoming, this paper proposes TMAP: Text Mining based approach to discover likely API mappings using the similarity in the textual description of the source and target API documents. To evaluate our approach, we used TMAP to discover API mappings for 15 classes across: 1) Java and C# API, and 2) Java ME and Android API. We compared the discovered mappings with state-of-the-art source code analysis based approaches: Rosetta and StaMiner. Our results indicate that TMAP on average found relevant mappings for 57% more methods compared to previous approaches. Furthermore, our results also indicate that TMAP on average found exact mappings for 6.5 more methods per class with a maximum of 21 additional exact mappings for a single class as compared to previous approaches.

From the introduction:

Our intuition is: since the API documents are targeted towards developers, there may be an overlap in the language used to describe similar concepts that can be leveraged.

There are a number of insights in this paper but this statement of intuition alone is enough to justify reading the paper.

What if instead of API documents we were talking about topics that had been written for developers? Isn’t it fair to assume that concepts would have the same or similar vocabularies?

The evidence from this paper certainly suggests that to be the case.

Of course, merging rules would have to allow for “likely” merging of topics, which could then be refined by readers.

Readers who hopefully contribute more information to make “likely” merging more “precise.” (At least in their view.)

That’s one of the problems with most semantic technologies isn’t it?

“Precision” can only be defined from a point of view, which by definition varies from user to user.

What would it look like to allow users to determine their desired degree of semantic precision?


Neo4j 2.3.0 Milestone 3 Release

September 28th, 2015

Neo4j 2.3.0 Milestone 3 Release by Andreas Kollegger.

From the post:

Great news for all those Neo4j alpha-developers out there: The third (and final) milestone release of Neo4j 2.3 is now ready for download! Get your copy of Neo4j 2.3.0-M03 here.

Quick clarification: Milestone releases like this one are for development and experimentation only, and not all features are in their finalized form. Click here for the most fully stable version of Neo4j (2.2.5).

So, what cool new features made it into this milestone release of Neo4j? Let’s dive in.

If you want to “kick the tires” on Neo4j 2.3.0 before the production version arrives, now would be a good time.

Andreas covers new features in Neo4j 2.3.0, such as triadic selection, constraints, deletes and others.

As I read Andreas’ explanation of “triadic selection” and its role in recommendations, I started to wonder how effective recommendations are outside of movies, books, fast food, etc.

The longer I thought about it, the harder I was pressed to come up with a buying decision that isn’t influenced by recommendations, either explicit or implicit.

Can a “rational” market exist if we are all more or less like lemmings when it comes to purchasing (and other) decisions?

You don’t have to decide that question to take Neo4j 2.3.0 for a spin.

Balisage 2016!

September 28th, 2015

From my inbox this morning:

Mark Your Calendars: Balisage 2016
 - pre-conference symposium 1 August 2016
 - Balisage: The Markup Conference 2-5 August 2016

Bethesda North Marriott Hotel & Conference Center
5701 Marinelli Road  
North Bethesda, Maryland  20852

This much advanced notice makes me think someone had a toe curling good time at Balisage 2015.

Was Bill Clinton there? ;-)

Attend Balisage 2016, look for Bill Clinton or someone else with a silly grin on their faces!

What Does Probability Mean in Your Profession? [Divergences in Meaning]

September 27th, 2015

What Does Probability Mean in Your Profession? by Ben Orlin.

Impressive drawings that illustrate the divergence in meaning of “probability” for various professions.

I’m not sold on the “actual meaning” drawing because if everyone in a discipline understands “probability” to mean something else, on what basis can you argue for the “actual meaning?”

If I am reading a paper by someone who subscribes to a different meaning than your claimed “actual” one, then I am going to reach erroneous conclusions about their paper. Yes?

That is in order to understand a paper I have to understand the words as they are being used by the author. Yes?

If I understand “democracy and freedom” to mean “serves the interest of U.S.-based multinational corporations,” then calls for “democracy and freedom” in other countries isn’t going to impress me all that much.

Enjoy the drawings!

1150 Free Online Courses from Top Universities (update) [Collating Content]

September 27th, 2015

1150 Free Online Courses from Top Universities (update).

From the webpage:

Get 1150 free online courses from the world’s leading universities — Stanford, Yale, MIT, Harvard, Berkeley, Oxford and more. You can download these audio & video courses (often from iTunes, YouTube, or university web sites) straight to your computer or mp3 player. Over 30,000 hours of free audio & video lectures, await you now.

An ever improving resource!

As of last January (2015), it listed 1100 courses.

Another fifty courses have been added and I discovered a course in Hittite!

The same problem with collating content across resources that I mentioned for data science books, obtains here as you take courses in the same discipline or read primary/secondary literature.

What if I find references that are helpful in the Hittite course in the image PDFs of the Chicago Assyrian Dictionary? How do I combine that with the information from the Hittite course so if you take Hittite, you don’t have to duplicate my search?

That’s the ticket isn’t it? Not having different users performing the same task over and over again? One user finds the answer and for all other users, it is simply “there.”

Quite a different view of the world of information than the repetitive, non-productive, ad-laden and often irrelevant results from the typical search engine.

The World’s First $9 Computer is Shipping Today!

September 27th, 2015

The World’s First $9 Computer is Shipping Today! by The World’s First $9 Computer is Shipping Today!Khyati Jain.

From the post:

Remember Project: C.H.I.P. ?

A $9 Linux-based, super-cheap computer that raised some $2 Million beyond a pledge goal of just $50,000 on Kickstarter will be soon in your pockets.

Four months ago, Dave Rauchwerk, CEO of Next Thing Co., utilized the global crowd-funding corporation ‘Kickstarter’ for backing his project C.H.I.P., a fully functioning computer that offers more than what you could expect for just $9.

See Khyati’s post for technical specifications.

Security by secrecy is meaningless when potential hackers (14-64) number 4.8 billion.

With enough hackers, all bugs can be found.

Writing “Python Machine Learning”

September 26th, 2015

Writing “Python Machine Learning” by Sebastian Raschka.

From the post:

It’s been about time. I am happy to announce that “Python Machine Learning” was finally released today! Sure, I could just send an email around to all the people who were interested in this book. On the other hand, I could put down those 140 characters on Twitter (minus what it takes to insert a hyperlink) and be done with it. Even so, writing “Python Machine Learning” really was quite a journey for a few months, and I would like to sit down in my favorite coffeehouse once more to say a few words about this experience.

A delightful tale for those of us who have authored books and an inspiration (with some practical suggestions) for anyone who hopes to write a book.

Sebastian’s productivity hints will ring familiar for those with similar habits and bear study by those who hope to become more productive.

Sebastian never comes out and says it but his writing approach breaks each stage of the book into manageable portions. It is far easier to say (and do) “write an outline” than to “write the complete and fixed outline for an almost 500 page book.”

If the task is too large, the complete and immutable outline, you won’t get up enough momentum to make a reasonable start.

After reading Sebastian’s post, what book are you thinking about writing?

Free Data Science Books (Update, + 53 books, 117 total)

September 26th, 2015

Free Data Science Books (Update).

From the post:

Pulled from the web, here is a great collection of eBooks (most of which have a physical version that you can purchase on Amazon) written on the topics of Data Science, Business Analytics, Data Mining, Big Data, Machine Learning, Algorithms, Data Science Tools, and Programming Languages for Data Science.

While every single book in this list is provided for free, if you find any particularly helpful consider purchasing the printed version. The authors spent a great deal of time putting these resources together and I’m sure they would all appreciate the support!

Note: Updated books as of 9/21/15 are post-fixed with an asterisk (*). Scroll to updates

Great news but also more content.

Unlike big data, you have to read this content in detail to obtain any benefit from it.

And books in the same area are going to have overlapping content as well as some unique content.

Imagine how useful it would be to compose a free standing work with the “best” parts from several works.

Copyright laws would be a larger barrier but no more than if you cut-n-pasted your own version for personal use.

If such an approach could be made easy enough, the resulting value would drown out dissenting voices.

I think PDF is the principal practical barrier.

Do you suspect others?

I first saw this in a tweet by Kirk Borne.

Data Science Glossary

September 26th, 2015

Data Science Glossary by Bob DuCharme.

From the about page:

Terms included in this glossary are the kind that typically come up in data science discussions and job postings. Most are from the worlds of statistics, machine learning, and software development. A Wikipedia entry icon links to the corresponding Wikipedia entry, although these are often quite technical. Email corrections and suggestions to bob at this domain name.

Is your favorite term included?

You can follow Bob on Twitter @bobdc.

Or read his blog at:

Thanks Bob!

Attention Law Students: You Can Change the Way People Interact with the Law…

September 25th, 2015

Attention Law Students: You Can Change the Way People Interact with the Law…Even Without a J.D. by Katherine Anton.

From the post:

A lot of people go to law school hoping to change the world and make their mark on the legal field. What if we told you that you could accomplish that, even as a 1L?

Today we’re launching the WeCite contest: an opportunity for law students to become major trailblazers in the legal field. WeCite is a community effort to explain the relationship between judicial cases, and will be a driving force behind making the law free and understandable.

To get involved, all you have to do is go to and choose the treatment that best describes a newer case’s relationship with an older case. Law student contributors, as well as the top contributing schools, will be recognized and rewarded for their contributions to WeCite.

Read on to learn why WeCite will quickly become your new favorite pastime and how to get started!

Shepard’s Citations began publication in 1873 and by modern times, had such an insurmountable lead, that the cost of creating a competing service were a barrier to anyone else entering the field.

To be useful to lawyers, a citation index can’t index some of the citations but all of the citations.

The WeCite project, based on crowd-sourcing, is poised to demonstrate creation of a public law citation index is doable.

While the present project is focused on law students, I am hopeful that the project opens up for contributions from more senior survivors of law school, practicing or not.

Three Reasons You May Not Want to Learn Clojure [One of these reasons applies to XQuery]

September 25th, 2015

Three Reasons You May Not Want to Learn Clojure by Mark Bastian.

From the post:

I’ve been coding in Clojure for over a year now and not everything is unicorns and rainbows. Whatever language you choose will affect how you think and work and Clojure is no different. Read on to see what some of these effects are and why you might want to reconsider learning Clojure.

If you are already coding in Clojure, you will find this amusing.

If you are not already coding in Clojure, you may find this compelling.

I won’t say which one of these reasons applies to XQuery, at least not today. Watch this blog on Monday of next week.

Apache Lucene 5.3.1, Solr 5.3.1 Available

September 24th, 2015

Apache Lucene 5.3.1, Solr 5.3.1 Available

From the post:

The Lucene PMC is pleased to announce the release of Apache Lucene 5.3.1 and Apache Solr 5.3.1

Lucene can be downloaded from
and Solr can be downloaded from

Highlights of this Lucene release include:

Bug Fixes

  • Remove classloader hack in MorfologikFilter
  • UsageTrackingQueryCachingPolicy no longer caches trivial queries like MatchAllDocsQuery
  • Fixed BoostingQuery to rewrite wrapped queries

Highlights of this Solr release include:

Bug Fixes

  • security.json is not loaded on server start
  • RuleBasedAuthorization plugin does not work for the collection-admin-edit permission
  • VelocityResponseWriter template encoding issue. Templates must be UTF-8 encoded
  • SimplePostTool (also bin/post) -filetypes “*” now works properly in ‘web’ mode
  • example/files update-script.js to be Java 7 and 8 compatible.
  • SolrJ could not make requests to handlers with ‘/admin/’ prefix
  • Use of timeAllowed can cause incomplete filters to be cached and incorrect results to be returned on subsequent requests
  • VelocityResponseWriter’s $resource.get(key,baseName,locale) to use specified locale.
  • Resolve XSS issue in Admin UI stats page

Time to upgrade!


Data Analysis for the Life Sciences… [No Ur-Data Analysis Book?]

September 24th, 2015

Data Analysis for the Life Sciences – a book completely written in R markdown by Rafael Irizarry.

From the post:

Data analysis is now part of practically every research project in the life sciences. In this book we use data and computer code to teach the necessary statistical concepts and programming skills to become a data analyst. Following in the footsteps of Stat Labs, instead of showing theory first and then applying it to toy examples, we start with actual applications and describe the theory as it becomes necessary to solve specific challenges. We use simulations and data analysis examples to teach statistical concepts. The book includes links to computer code that readers can use to program along as they read the book.

It includes the following chapters: Inference, Exploratory Data Analysis, Robust Statistics, Matrix Algebra, Linear Models, Inference for High-Dimensional Data, Statistical Modeling, Distance and Dimension Reduction, Practical Machine Learning, and Batch Effects.

Have you ever wondered about the growing proliferation of data analysis books?

The absence of one Ur-Data Analysis book that everyone could read and use?

I have a longer post coming on a this idea but if each discipline has the need for its own view on data analysis, it is really surprising that no one system of semantics satisfies all communities?

In other words, is the evidence of heterogeneous semantics so strong that we should abandon attempts at uniform semantics and focus on communicating across systems of semantics?

I’m sure there are other examples of where every niche has its own vocabulary, tables in relational databases or column headers in spreadsheets for example.

What is your favorite example of heterogeneous semantics?

Assuming heterogeneous semantics are here to stay (they have been around since the start of human to human communication, possibly earlier), what solution do you suggest?

I first saw this in a tweet by Christophe Lalanne.

Guesstimating the Future

September 24th, 2015

I ran across some introductory slides on Neo4j with the line:

Forrester estimates that over 25% of enterprises will be using graph databases by 2017.

Well, Forrester also predicted that tablet sales would over take laptops sales in 2015: Forrester: Tablet Sales Will Eclipse Laptop Sales by 2015.

You might want to check that prediction against: Laptop sales ‘stronger than ever’ versus tablets – PCR Retail Advisory Board.

The adage “It is difficult to make predictions, especially about the future.,” remains appropriate.

Neo4j doesn’t need lemming-like behavior among consumers of technology to make a case for itself.

Compare Neo4j and its query language, Cypher, to your use cases and I think you will agree.

A review of learning vector quantization classifiers

September 23rd, 2015

A review of learning vector quantization classifiers by David Nova, Pablo A. Estevez.


In this work we present a review of the state of the art of Learning Vector Quantization (LVQ) classifiers. A taxonomy is proposed which integrates the most relevant LVQ approaches to date. The main concepts associated with modern LVQ approaches are defined. A comparison is made among eleven LVQ classifiers using one real-world and two artificial datasets.

From the introduction:

Learning Vector Quantization (LVQ) is a family of algorithms for statistical pattern classification, which aims at learning prototypes (codebook vectors) representing class regions. The class regions are defined by hyperplanes between prototypes, yielding Voronoi partitions. In the late 80’s Teuvo Kohonen introduced the algorithm LVQ1 [36, 38], and over the years produced several variants. Since their inception LVQ algorithms have been researched by a small but active community. A search on the ISI Web of Science in November, 2013, found 665 journal articles with the keywords “Learning Vector Quantization” or “LVQ” in their titles or abstracts. This paper is a review of the progress made in the field during the last 25 years.

Heavy sledding but if you want to review the development of a classification algorithm with a manageable history, this is a likely place to start.


Fast k-NN search

September 23rd, 2015

Fast k-NN search by Ville Hyvönen, Teemu Pitkänen, Sotiris Tasoulis, Liang Wang, Teemu Roos, Jukka Corander.


Random projection trees have proven to be effective for approximate nearest neighbor searches in high dimensional spaces where conventional methods are not applicable due to excessive usage of memory and computational time. We show that building multiple trees on the same data can improve the performance even further, without significantly increasing the total computational cost of queries when executed in a modern parallel computing environment. Our experiments identify suitable parameter values to achieve accurate searches with extremely fast query times, while also retaining a feasible complexity for index construction.

Not a quick read but an important one if you want to use multiple dimensions for calculation of similarity or sameness between two or more topics.

The technique requires you to choose a degree of similarity that works for your use case.

This paper makes a nice jumping off point for discussing how much precision does a particular topic map application need? Absolute precision is possible, but only in a limited number of cases and I suspect at high cost.

For some use cases, such as searching for possible suspects in crimes, some lack of precision is necessary to build up a large enough pool of suspects to include the guilty parties.

Any examples of precision and topic maps that come to mind?


September 23rd, 2015


From the about page:

SymbolHound is a search engine that doesn’t ignore special characters. This means you can easily search for symbols like &, %, and π. We hope SymbolHound will help programmers find information about their chosen languages and frameworks more easily.

SymbolHound was started by David Crane and Thomas Feldtmose while students at the University of Connecticut. Another project by them is Toned Ear, a website for ear training.

I first saw SymbolHound mentioned in a discussion of delimiter options for a future XQuery feature.

For syntax drafting you need to have SymbolHound on your toolbar, not just bookmarked.

Government Travel Cards at Casinos or Adult Entertainment Establishments

September 23rd, 2015

Audit of DoD Cardholders Who Used Government Travel Cards at Casinos or Adult Entertainment Establishments by Michael J. Roark, Assistant Inspector General, Contract Management and Payments, Department of Defense.

From the memorandum:

We plan to begin the subject audit in September 2015. The Senate Armed Services Committee requested this audit as a follow-on review of transactions identified in Report No. DODIG-2015-125, “DoD Cardholders Used Their Government Travel Cards for Personal Use at Casinos and Adult Entertainment Establishments,” May 19, 2015. Our objective is to determine whether DoD cardholders who used government travel cards at casinos and adult entertainment establishments for personal use sought or received reimbursement for the charges. In addition, we will determine whether disciplinary actions have been taken in cases of personal use and if the misuse was repo1ted to the appropriate security office. We will consider suggestions from management on additional or revised objectives.

This project is a follow up to: Report No. DODIG-2015-125, “DoD Cardholders Used Their Government Travel Cards for Personal Use at Casinos and Adult Entertainment Establishments” (May 19, 2015), which summarizes its findings as:

We are providing this report for your review and comment. We considered management comments on a draft of this report when preparing the final report. DoD cardholders improperly used their Government Travel Charge Card for personal use at casinos and adult entertainment establishments. From July 1, 2013, through June 30, 2014, DoD cardholders had 4,437 transactions totaling $952,258, where they likely used their travel cards at casinos for personal use and had 900 additional transactions for $96,576 at adult entertainment establishments. We conducted this audit in accordance with generally accepted government auditing standards.

Let me highlight that for you:

July 1, 2013 through June 30, 2014, DoD cardholders:

4,437 transactions at casinos for $952,258

900 transactions at adult entertainment establishments for $96,576

Are lap dances that cheap? ;-)

Almost no one goes to a casino or adult entertainment establishment alone, so topic maps would be a perfect fit for finding “associations” between DoD personnel.

The current project is to track the outcome of the earlier report, that is what if any actions resulted.

What do you think?

Will the DoD personnel claim they were doing off the record surveillance of suspected information leaks? Or just checking their resistance to temptation?

Before I forget, here is the breakdown by service (from the May 19, 2015 report, page 6):


I don’t know what to make up the distribution of “adult transactions” between the services.


5.6 Million Fingerprints Stolen in OPM Hack [Still No Competence or Transparency]

September 23rd, 2015

5.6 Million Fingerprints Stolen in OPM Hack by Chris Brook.

The management follies continue at the Office of Personnel Management (OPM), which I mentioned the other day had declined to use modern project management practices.

A snippet from Chris’ post, which you should read in it entirety:

OPM said at the beginning of September that it would begin sending letters to victims of the breach “in a few weeks,” yet the agency’s recent statement reiterates that an interagency team is still working in tandem with the Department of Defense to prep the letters.

“An interagency team will continue to analyze and refine the data as it prepares to mail notification letters to impacted individuals,” Schumach wrote.

Did you read between the lines to intuit the cause of the delay in letter preparation?

The next big shoe to drop, either on prodding by Congress or news media:

The Office of Personnel Management doesn’t have current addresses on all 21.5 million government workers.

When a data breach occurs at a major bank, credit card company, etc., sending the breach letter is a matter of composing it and hiring a mail house to do the mailing.

This is going on four months after OPM admitted the hack and still no letters?

I may be over estimating the competency of OPM management when it comes to letter writing but my bet would be on a lack of current addresses for a large portion of the employees impacted.

FYI, hiring former OPM staff has a name. It’s called assumption of risk.

Sharing Economy – Repeating the Myth of Code Reuse

September 23rd, 2015

Bitcoin and sharing economy pave the way for new ‘digital state’ by Sophie Curtis.

Sophie quotes Cabinet Office minister Matthew Hancock MP as saying:

For example, he said that Cloud Foundry, a Californian company that provides platform-as-a-service technology, could help to create a code library for digital public services, helping governments around the world to share their software.

“Governments across the world need the same sorts of services for their citizens, and if we write them in an open source way, there’s no need to start from scratch each time,” he said.

“So if the Estonian government writes a program for licensing, and we do loads of licensing in the UK, it means we’ll be able to pull that code down and build the technology cheaper. Local governments deliver loads of services too and they can base their services on the same platforms.”

However, he emphasised that this is about sharing programs, code and techniques – not about sharing data. Citizens’ personal data will remain the responsibility of the government in question, and will not be shared across borders, he said.

I’m guess that “The Rt Hon Matt Hancock MP” hasn’t read:

The code reuse myth: why internal software reuse initiatives tend to fail by Ben Morris

The largest single barrier to effective code reuse is that it is difficult. It doesn’t just happen by accident. Reusable code has to be specifically designed for a generalised purpose and it is unlikely to appear spontaneously as a natural by-product of development projects.

Reusable components are usually designed to serve an abstract purpose and this absence of tangible requirements can make them unusually difficult to design, develop and test. Their development requires specific skills and knowledge of design patterns that is not commonly found in development teams. Developing for reuse is an art in itself and it takes experience to get the level of abstraction right without making components too specific for general use or too generalised to add any real value.

These design challenges can be exasperated by organisational issues in larger and more diffused development environments. If you are going to develop common components then you will need a very deep understanding of a range of constantly evolving requirements. As the number of projects and teams involved in reuse grow it can be increasingly difficult to keep track of these and assert any meaningful consistency.

Successful code reuse needs continuous effort to evolve shared assets in step with the changing business and technical landscape. This demands ownership and governance to ensure that assets don’t fall into disrepair after the initial burst of effort that established them. It also requires a mature development process that provides sufficient time to design, test, maintain and enhance reusable assets. Above all, you need a team of skilled architects and developers who are sufficiently motivated and empowered to take a lead in implementing code reuse.

Reuse Myth – can you afford reusable code? by Allan Kelly

In my Agile Business Conference present (“How much quality can we afford?”) I talked about the Reuse Myth, this is something always touch on when I deliver a training course but I’ve never taken time to write it down. Until now.

Lets take as our starting point Kevlin Henney’s observation that “there is no such thing as reusable code, only code that is reused.” Kevlin (given the opportunity!) goes on to examine what constitutes “reuse” over simple “use.” A good discussion itself but right now the I want to suggest that an awful lot of code which is “designed for reuse” is never actually re-used.

In effect that design effort is over engineering, waste in other words. One of the reasons developers want to “design for reuse” is not so much because the code will be reused but rather because they desire a set of properties (modularity, high cohesion, low coupling, etc.) which are desirable engineering properties but sound a bit abstract.

In other words, striving for “re-usability” is a developers way of striving for well engineered code. Unfortunately in striving for re-usability we loose focus which brings us to the second consideration…. cost of re-usability.

In Mythical Man Month (1974) Fred Brooks suggests that re-usable code costs three times as much to develop as single use code. I haven’t seen any better estimates so I tend to go with this one. (If anyone has any better estimates please send them over.)

Think about this. This means that you have to use your “reusable&#82#8221; code three times before you break even. And it means you only see a profit (saving) on the fourth reuse.

How much code which is built for reuse is reused four times?

Those are two “hits” out of 393,000 that I got this afternoon searching on (with the quotes) “code reuse.”

Let’s take The Rt Hon Matt Hancock MP statement and re-write it a bit:

Hypothetical Statement – Not an actual statement by The Rt Hon Matt Hancock MP, he’s not that well informed:

“So if the Estonian government spends three (3) times as much to write a program for licensing, and we do loads of licensing in the UK, it means we’ll be able to pull that code down and build the technology cheaper. Local governments deliver loads of services too and they can base their services on the same platforms.”

Will the Estonian government, which is like other governments, will spend three (3) times as much developing software on the off chance that the UK may want to use it?

Would any government undertake software development on that basis?

Do you have an answer other than NO! to either of those questions?

There are lots of competent computer people in the UK but none of them are advising The Rt Hon Matt Hancock MP. Or he isn’t listening. Amounts to the same thing.