GDS unveils ‘Gov.UK Verify’ public services identity assurance scheme

September 19th, 2014

GDS unveils ‘Gov.UK Verify’ public services identity assurance scheme

From the post:

Gov.UK Verify is designed to overcome concerns about government setting up a central database of citizens’ identities to enable access to online public services – similar criticism led to the demise of the hugely unpopular identity card scheme set up under the Labour government.

Instead, users will register their details with one of several independent identity assurance providers – certified companies which will establish and verify a user’s identity outside government systems. When the user then logs in to a digital public service, the Verify system will electronically “ask” the external third-party provider to confirm the person is who they claim to be.


Help me make sure I am reading this story of citizen identity correctly.

Citizens are fearful of their government having a central database of citizens’ identities but are comfortable with commercial firms, regulated by same government, managing those identities?

Do you think citizens of the UK are aware that commercial firms betray their customers to the U.S. government at the drop of a secret subpoena every day?

To say nothing of the failures of commercial firms to protect data from their customers, when they aren’t using that data to directly manipulate their customers.

Strikes me as damned odd that anyone would trust commercial firms more than they would trust the government. Neither one is actually trustworthy.

Am I reading this story correctly?

I first saw this in a tweet by Richard Copley.

Named Entity Recognition: A Literature Survey

September 19th, 2014

Named Entity Recognition: A Literature Survey by Rahul Sharnagat.


In this report, we explore various methods that are applied to solve NER. In section 1, we introduce the named entity problem. In section 2, various named entity recognition methods are discussed in three three broad categories of machine learning paradigm and explore few learning techniques in them. In the first part, we discuss various supervised techniques. Subsequently we move to semi-supervised and unsupervised techniques. In the end we discuss about the method from deep learning to solve NER.

If you are new to the named entity recognition issue or want to pass on an introduction, this may be the paper for you. It covers all the high points with a three page bibliography to get your started in the literature.

I first saw this in a tweet by Christopher.

You can be a kernel hacker!

September 19th, 2014

You can be a kernel hacker! by Julia Evans.

From the post:

When I started Hacker School, I wanted to learn how the Linux kernel works. I’d been using Linux for ten years, but I still didn’t understand very well what my kernel did. While there, I found out that:

  • the Linux kernel source code isn’t all totally impossible to understand
  • kernel programming is not just for wizards, it can also be for me!
  • systems programming is REALLY INTERESTING
  • I could write toy kernel modules, for fun!
  • and, most surprisingly of all, all of this stuff was useful.

I hadn’t been doing low level programming at all – I’d written a little bit of C in university, and otherwise had been doing web development and machine learning. But it turned out that my newfound operating systems knowledge helped me solve regular programming tasks more easily.

Post by the same name as her presentation at Strange Loop 2014.

Another reason to study the Linux kernel: The closer to the metal your understanding, the more power you have over the results.

That’s true for the Linux kernel, machine learning algorithms, NLP, etc.

You can have a canned result prepared by someone else, which may be good enough, or you can bake something more to your liking.

I first saw this in a tweet by Felienne Hermans.

Digital Dashboards: Strategic & Tactical: Best Practices, Tips, Examples

September 19th, 2014

Digital Dashboards: Strategic & Tactical: Best Practices, Tips, Examples by Avinash Kaushik.

From the post:

The Core Problem: The Failure of Just Summarizing Performance.

I humbly believe the challenge is that in a world of too much data, with lots more on the way, there is a deep desire amongst executives to get “summarize data,” to get “just a snapshot,” or to get the “top-line view.” This is understandable of course.

But this summarization, snapshoting and toplining on your part does not actually change the business because of one foundational problem:

People who are closest to the data, the complexity, who’ve actually done lots of great analysis, are only providing data. They don’t provide insights and recommendations.

People who are receiving the summarized snapshot top-lined have zero capacity to understand the complexity, will never actually do analysis and hence are in no position to know what to do with the summarized snapshot they see.

The end result? Nothing.

Standstill. Gut based decision making. No real appreciation of the delicious opportunity in front of every single company on the planet right now to have a huger impact with data.

So what’s missing from this picture that will transform numbers into action?

I believe the solution is multi-fold (and when is it not? : )). We need to stop calling everything a dashboard. We need to create two categories of dashboards. For both categories, especially the valuable second kind of dashboards, we need words – lots of words and way fewer numbers.

Be aware that the implication of that last part I’m recommending is that you are going to become a lot more influential, and indispensable, to your organization. Not everyone is ready for that, but if you are this is going to be a fun ride!

A long post on “dashboards” but I find it relevant to the design of interfaces.

In particular the advice:

This will be controversial but let me say it anyway. The primary purpose of a dashboard is not to inform, and it is not to educate. The primary purpose is to drive action!

Hence: List the next steps. Assign responsibility for action items to people. Prioritize, prioritize, prioritize. Never forget to compute business impact.

Curious how exploration using a topic map could feed into an action process? Would you represent actors in the map and enable the creation of associations that represent assigned tasks? Other ideas?

I found this in a post, Don’t data puke, says Avinash Kaushik by Kaiser Fung and followed it to the original post.

Tokenizing and Named Entity Recognition with Stanford CoreNLP

September 19th, 2014

Tokenizing and Named Entity Recognition with Stanford CoreNLP by Sujit Pal.

From the post:

I got into NLP using Java, but I was already using Python at the time, and soon came across the Natural Language Tool Kit (NLTK), and just fell in love with the elegance of its API. So much so that when I started working with Scala, I figured it would be a good idea to build a NLP toolkit with an API similar to NLTKs, primarily as a way to learn NLP and Scala but also to build something that would be as enjoyable to work with as NLTK and have the benefit of Java’s rich ecosystem.

The project is perenially under construction, and serves as a test bed for my NLP experiments. In the past, I have used OpenNLP and LingPipe to build Tokenizer implementations that expose an API similar to NLTK’s. More recently, I have built an Named Entity Recognizer (NER) with OpenNLP’s NameFinder. At the recommendation of one of my readers, I decided to take a look at Stanford CoreNLP, with which I ended up building a Tokenizer and a NER implementation. This post describes that work.

Truly a hard core way to learn NLP and Scala!


Looking forward to hearing more about this project.

Libraries may digitize books without permission, EU top court rules [Nation-wide Site Licenses?]

September 19th, 2014

Libraries may digitize books without permission, EU top court rules by Loek Essers.

From the post:

European libraries may digitize books and make them available at electronic reading points without first gaining consent of the copyright holder, the highest European Union court ruled Thursday.

The Court of Justice of the European Union (CJEU) ruled in a case in which the Technical University of Darmstadt digitized a book published by German publishing house Eugen Ulmer in order to make it available at its electronic reading posts, but refused to license the publisher’s electronic textbooks.

A spot of good news to remember next on the next 9/11 anniversary. A Member State may authorise libraries to digitise, without the consent of the rightholders, books they hold in their collection so as to make them available at electronic reading points

Users can’t make copies onto a USB stick but under contemporary fictions about property rights represented in copyright statutes that isn’t surprising.

What is surprising is that nations have not yet stumbled upon the idea of nation-wide site licenses for digital materials.

A nation acquiring a site license the ACM Digital Library, IEEE, Springer and a dozen or so other resources/collections would have these positive impacts:

  1. Access to core computer science publications for everyone located in that nation
  2. Publishers would have one payor and could reduce/eliminate the staff that manage digital access subscriptions
  3. Universities and colleges would not require subscriptions nor the staff to manage those subscriptions (integration of those materials into collections would remain a library task)
  4. Simplify access software based on geographic IP location (fewer user/password issues)
  5. Universities and colleges could spend funds now dedicated to subscriptions for other materials
  6. Digitization of both periodical and monograph literature would be encouraged
  7. Avoids tiresome and not-likely-to-succeed arguments about balancing the public interest in IP rights discussions.

For me, #7 is the most important advantage of nation-wide licensing of digital materials. As you can tell by my reference to “contemporary fictions about property rights” I fall quite firmly on a particular side of the digital rights debate. However, I am more interested in gaining access to published materials for everyone than trying to convince others of the correctness of my position. Therefore, let’s adopt a new strategy: “Pay the man.”

As I outline above, there are obvious financial advantages to publishers from nation-wide site licenses, in the form of reduced internal costs, reduced infrastructure costs and a greater certainty in cash flow. There are advantages for the public as well as universities and colleges, so I would call that a win-win solution.

The Developing World Initiatives by Francis & Taylor is described as:

Taylor & Francis Group is committed to the widest distribution of its journals to non-profit institutions in developing countries. Through agreements with worldwide organisations, academics and researchers in more than 110 countries can access vital scholarly material, at greatly reduced or no cost.

Why limit access to materials to “non-profit institutions in developing countries?” Granting that the site-license fees for the United States would be higher than Liberia but the underlying principle is the same. The less you regulate access the simpler the delivery model and the higher the profit to the publisher. What publisher would object to that?

There are armies of clerks currently invested in the maintenance of one-off subscription models but the greater public interest in access to materials consistent with publisher IP rights should carry the day.

If Tim O’Reilly and friends are serious about changing access models to information, let’s use nation-wide site licenses to eliminate firewalls and make effective linking and transclusion a present day reality.

Publishers get paid, readers get access. It’s really that simple. Just on a larger scale than is usually discussed.

PS: Before anyone raises the issues of cost for national-wide site licenses, remember that the United States has spent more than $1 trillion in a “war” on terrorism that has made no progress in making the United States or its citizens more secure.

If the United Stated decided to pay Spinger Science+Business Media the €866m ($1113.31m) total revenue it made in 2012, for the cost of its ‘war” on terrorism, it could have purchased a site license to all Spinger Science+Business Media content for the entire United States for 898.47 years. (Check my math: 1,000,000,000,000 / 1,113,000,000 = 898.472.)

I first saw this in Nat Torkington’s Four short links: 15 September 2014.

Common Sense and Statistics

September 18th, 2014

Common Sense and Statistics by John D. Cook.

From the post:

…, common sense is vitally important in statistics. Attempts to minimize the need for common sense can lead to nonsense. You need common sense to formulate a statistical model and to interpret inferences from that model. Statistics is a layer of exact calculation sandwiched between necessarily subjective formulation and interpretation. Even though common sense can go badly wrong with probability, it can also do quite well in some contexts. Common sense is necessary to map probability theory to applications and to evaluate how well that map works.

No matter how technical or complex analysis may appear, do not hesitate to ask for explanations if the data or results seem “off” to you. I witnessed a presentation several years ago when the manual for a statistics package was cited for the proposition that a result was significant.

I know you have never encountered that situation but you may know others who have.

Never fear asking questions about methods or results. Your colleagues are wondering the same things but are too afraid of appearing ignorant to ask questions.

Ignorance is curable. Willful ignorance is not.

If you aren’t already following John D. Cook, you should.

Learn Datalog Today

September 18th, 2014

Learn Datalog Today by Jonas Enlund.

From the homepage:

Learn Datalog Today is an interactive tutorial designed to teach you the Datomic dialect of Datalog. Datalog is a declarative database query language with roots in logic programming. Datalog has similar expressive power as SQL.

Datomic is a new database with an interesting and novel architecture, giving its users a unique set of features. You can read more about Datomic at and the architecture is described in some detail in this InfoQ article.

Table of Contents

You have been meaning to learn Datalog but it just hasn’t happened.

Now is the time to break that cycle and do the deed!

This interactive tutorial should ease you on your way to learning Datalog.

It can’t learn Datalog for you but it can make the journey a little easier.


From Frequency to Meaning: Vector Space Models of Semantics

September 18th, 2014

From Frequency to Meaning: Vector Space Models of Semantics by Peter D. Turney and Patrick Pantel.


Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term–document, word–context, and pair–pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

At forty-eight (48) pages with a thirteen (13) page bibliography, this survey of vector space models (VSMs) of semantics should keep you busy for a while. You will have to fill in VSMs developments since 2010 but mastery of this paper will certain give you the foundation to do so. Impressive work.

I do disagree with the authors when they say:

Computers understand very little of the meaning of human language.

Truth be told, I would say:

Computers have no understanding of the meaning of human language.

What happens with a VSM of semantics is that we as human readers choose a model we think represents semantics we see in a text. Our computers blindly apply that model to text and report the results. We as human readers choose results that we think are closer to the semantics we see in the text, and adjust the model accordingly. Our computers then blindly apply the adjusted model to the text again and so on. At no time does the computer have any “understanding” of the text or of the model that it is applying to the text. Any “understanding” in such a model is from a human reader who adjusted the model based on their perception of the semantics of a text.

I don’t dispute that VSMs have been incredibly useful and like the authors, I think there is much mileage left in their development for text processing. That is not the same thing as imputing “understanding” of human language to devices that in fact have none at all. (full stop)


I first saw this in a tweet by Christopher Phipps.

PS: You probably recall that VSMs are based on creating a metric space for semantics, which have no preordained metric space. Transitioning from a non-metric space to a metric space isn’t subject to validation, at least in my view.

2015 Medical Subject Headings (MeSH) Now Available

September 18th, 2014

2015 Medical Subject Headings (MeSH) Now Available

From the post:

Introduction to MeSH 2015
The Introduction to MeSH 2015 is now available, including information on its use and structure, as well as recent updates and availability of data.

MeSH Browser
The default year in the MeSH Browser remains 2014 MeSH for now, but the alternate link provides access to 2015 MeSH. The MeSH Section will continue to provide access via the MeSH Browser for two years of the vocabulary: the current year and an alternate year. Sometime in November or December, the default year will change to 2015 MeSH and the alternate link will provide access to the 2014 MeSH.

Download MeSH
Download 2015 MeSH in XML and ASCII formats. Also available for 2015 from the same MeSH download page are:

  • Pharmacologic Actions (Forthcoming)
  • New Headings with Scope Notes
  • MeSH Replaced Headings
  • MeSH MN (tree number) changes
  • 2015 MeSH in MARC format


Convince your boss to use Clojure

September 17th, 2014

Convince your boss to use Clojure by Eric Normand.

From the post:

Do you want to get paid to write Clojure? Let’s face it. Clojure is fun, productive, and more concise than many languages. And probably more concise than the one you’re using at work, especially if you are working in a large company. You might code on Clojure at home. Or maybe you want to get started in Clojure but don’t have time if it’s not for work.

One way to get paid for doing Clojure is to introduce Clojure into your current job. I’ve compiled a bunch of resources for getting Clojure into your company.

Take these resources and do your homework. Bringing a new language into an existing company is not easy. I’ve summarized some of the points that stood out to me, but the resources are excellent so please have a look yourself.

Great strategy and list of resources for Clojure folks.

How would you adapt this strategy to topic maps and what resources are we missing?

I first saw this in a tweet by Christophe Lalanne.

Elementary Applied Topology

September 17th, 2014

Elementary Applied Topology by Robert Ghrist.

From the introduction:

What topology can do

Topology was built to distinguish qualitative features of spaces and mappings. It is good for, inter alia:

  1. Characterization: Topological properties encapsulate qualitative signatures. For example, the genus of a surface, or the number of connected components of an object, give global characteristics important to classification.
  2. Continuation: Topological features are robust. The number of components or holes is not something that should change with a small error in measurement. This is vital to applications in scientific disciplines, where data is never noisy.
  3. Integration: Topology is the premiere tool for converting local data into global properties. As such, it is rife with principles and tools (Mayer-Vietoris, Excision, spectral sequences, sheaves) for integrating from local to global.
  4. Obstruction: Topology often provides tools for answering feasibility of certain problems, even when the answers to the problems themselves are hard to compute. These characteristics, classes, degrees, indices, or obstructions take the form of algebraic-topological entities.

What topology cannot do

Topology is fickle. There is no resource to tweaking epsilons should desiderata fail to be found. If the reader is a scientist or applied mathematician hoping that topological tools are a quick fix, take this text with caution. The reward of reading this book with care may be limited to the realization of new questions as opposed to new answers. It is not uncommon that a new mathematical tool contributes to applications not by answering a pressing question-of-the-day but by revealing a different (and perhaps more significant) underlying principle.

The text will require more than casual interest but what a tool to add to your toolbox!

I first saw this in a tweet from Topology Fact.

Stuff Goes Bad: Erlang in Anger

September 17th, 2014

Stuff Goes Bad: Erlang in Anger by Fred Herbert.

From the webpage:

This book intends to be a little guide about how to be the Erlang medic in a time of war. It is first and foremost a collection of tips and tricks to help understand where failures come from, and a dictionary of different code snippets and practices that helped developers debug production systems that were built in Erlang.

From the introduction:

This book is not for beginners. There is a gap left between most tutorials, books, training sessions, and actually being able to operate, diagnose, and debug running systems once they’ve made it to production. There’s a fumbling phase implicit to a programmer’s learning of a new language and environment where they just have to figure how to get out of the guidelines and step into the real world, with the community that goes with it.

This book assumes that the reader is proficient in basic Erlang and the OTP framework. Erlang/OTP features are explained as I see fit — usually when I consider them tricky — and it is expected that a reader who feels confused by usual Erlang/OTP material will have an idea of where to look for explanations if necessary.

What is not necessarily assumed is that the reader knows how to debug Erlang software, dive into an existing code base, diagnose issues, or has an idea of the best practices about deploying Erlang in a production environment. (footnote numbers omitted)

With exercises no less.

Reminds me of a book I had some years ago on causing and then debugging Solr core dumps. ;-) I don’t think it was ever a best seller but it was a fun read.

Great title by the way.

I first saw this in a tweet by Chris Meiklejean.

Understanding weak isolation is a serious problem

September 17th, 2014

Understanding weak isolation is a serious problem by Peter Bailis.

From the post:

Modern transactional databases overwhelmingly don’t operate under textbook “ACID” isolation, or serializability. Instead, these databases—like Oracle 11g and SAP HANA—offer weaker guarantees, like Read Committed isolation or, if you’re lucky, Snapshot Isolation. There’s a good reason for this phenomenon: weak isolation is faster—often much faster—and incurs fewer aborts than serializability. Unfortunately, the exact behavior of these different isolation levels is difficult to understand and is highly technical. One of 2008 Turing Award winner Barbara Liskov’s Ph.D. students wrote an entire dissertation on the topic, and, even then, the definitions we have still aren’t perfect and can vary between databases.

To put this problem in perspective, there’s a flood of interesting new research that attempts to better understand programming models like eventual consistency. And, as you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional inherents and the research community as a whole have effectively ignored weak isolation—even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.2

That debates are occurring without full knowledge of the issues at hand isn’t all that surprising. Or as Job 38:2 (KJV) puts it: “Who is this that darkeneth counsel by words without knowledge?”

Peter raises a number of questions and points to resources that are good starting points for investigation of weak isolation.

What sort of weak isolation does your topic map storage mechanism use?

I first saw this in a tweet by Justin Sheehy.

Call for Review: HTML5 Proposed Recommendation Published

September 17th, 2014

Call for Review: HTML5 Proposed Recommendation Published

From the post:

The HTML Working Group has published a Proposed Recommendation of HTML5. This specification defines the 5th major revision of the core language of the World Wide Web: the Hypertext Markup Language (HTML). In this version, new features are introduced to help Web application authors, new elements are introduced based on research into prevailing authoring practices, and special attention has been given to defining clear conformance criteria for user agents in an effort to improve interoperability. Comments are welcome through 14 October. Learn more about the HTML Activity.

Now would be the time to submit comments, corrections, etc.

Deadline: 14 October 2014.

International Conference on Machine Learning 2014 Videos!

September 17th, 2014

International Conference on Machine Learning 2014 Videos!

You may recall my post on the ICML 2014 papers.

Speaking just for myself, I would prefer a resource with both the videos and relevant papers listed together.

Do you know of such a resource?

If not, when time permits I may conjure one up.

As with disclosed semantic mappings, it is more efficient if one person creates a mapping that is reused by many. As opposed to many separate and partial repetitions of the same mapping.

You may remember that one mapping, many reuses, is a central principal to indexes, library catalogs, filing systems, case citations, etc.

I first saw this in a tweet by EyeWire.

ODNI and the U.S. DOJ Commemorate 9/11

September 17th, 2014

Statement by the ODNI and the U.S. DOJ on the Declassification of Documents Related to the Protect America Act Litigation September 11, 2014

What better way to mark the anniversary of 9/11 than with a fuller account of another attack on the United States of American and its citizens. This attack not by a small band of criminals but a betrayal of the United States by those sworn to protect the rights of its citizens.

From the post:

On January 15, 2009, the U.S. Foreign Intelligence Surveillance Court of Review (FISC-R) published an unclassified version of its opinion in In Re: Directives Pursuant to Section 105B of the Foreign Intelligence Surveillance Act, 551 F.3d 1004 (Foreign Intel. Surv. Ct. Rev. 2008). The classified version of the opinion was issued on August 22, 2008, following a challenge by Yahoo! Inc. (Yahoo!) to directives issued under the Protect America Act of 2007 (PAA). Today, following a renewed declassification review, the Executive Branch is publicly releasing various documents from this litigation, including legal briefs and additional sections of the 2008 FISC-R opinion, with appropriate redactions to protect national security information. These documents are available at the website of the Office of the Director of National Intelligence (ODNI),; and ODNI’s public website dedicated to fostering greater public visibility into the intelligence activities of the U.S. Government, A summary of the underlying litigation follows.

In case you haven’t been following along, the crux of the case was Yahoo’s refusal on Fourth Amendment grounds to comply with a fishing expedition by the Director of National Intelligence and the Attorney General for information on one or more alleged foreign nationals. Motion to Compel Compliance with Directives of the Director of National Intelligence and Attorney General.

Not satisfied with violating their duties to uphold the Constitution, the DNI and AG decided to add strong arming/extortion to their list of crimes. Civil contemp fines, fines that started at $250,000 per day and then doubled each week thereafter that Yahoo! failed to comply with the court’s judgement were sought by the government. Government’s Motion for an Order of Civil Contempt.

Take care to note that all of this occurred in absolute secrecy. Would not do to have other corporations or the American public to be aware that rogue elements in the government were deciding what rights citizens of the United States enjoy and which ones they don’t.

You may also want to read Highlights from the Newly Declassified FISCR Documents by Marc Zwillinger and Jacob Sommer. They are the lawyers who represented Yahoo in the challenge covered by the released documents.

We all owe them a debt of gratitude for their hard work but we also have to acknowledge that Yahoo, Zwillinger and Sommer were complicit in enabling the Foreign Intelligence Surveillance Court (FISC) and the Foreign Intelligence Court of Review (FISCR) to continue their secret work.

Yes, Yahoo, Zwillinger and Sommer would have faced life changing consequences had they gone public with what they did know, but everyone has a choice when faced with oppressive government action. You can, as the parties did in this case and further the popular fiction that mining user metadata is an effective (rather than convenient) tool against terrorism.

Or you can decide to “blow the whistle” on wasteful and illegal activities by the government in question.

Had Yahoo, Zwillinger or Sommer known of any data in their possession that was direct evidence a terrorist attack or plot, they would have called the Office of the Director of National Intelligence, or at least their local FBI office. Yes? Wouldn’t any sane person do the same?

Ah, but you see, that’s the point. There wasn’t any such data. Not then. Not now. Read the affidavits, at least the parts that aren’t blacked out and you get the distinct impression that the government is not only fishing, but it is hopeful fishing. “There might be something, somewhere that somehow might be useful to somebody but we don’t know.” is a fair summary of the government’s position in the Yahoo case.

A better way to commemorate 9/11 next year would be with numerous brave souls taking the moral responsibility to denounce those who have betrayed their constitutional duties in the cause of fighting terrorism. I prefer occasional terrorism over the destruction of the Constitution of the United States.


I started the trail that lead to this post from a tweet by Ben Gilbert.

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise

September 16th, 2014

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise by Cailin O’Connor.

From the post:

Is our behavior determined by genetics, or are we products of our environments? What matters more for the development of living things—internal factors or external ones? Biologists have been hotly debating these questions since shortly after the publication of Darwin’s theory of evolution by natural selection. Charles Darwin’s half-cousin Francis Galton was the first to try to understand this interplay between “nature and nurture” (a phrase he coined) by studying the development of twins.

But are nature and nurture the whole story? It seems not. Even identical twins brought up in similar environments won’t really be identical. They won’t have the same fingerprints. They’ll have different freckles and moles. Even complex traits such as intelligence and mental illness often vary between identical twins.

Of course, some of this variation is due to environmental factors. Even when identical twins are raised together, there are thousands of tiny differences in their developmental environments, from their position in the uterus to preschool teachers to junior prom dates.

But there is more to the story. There is a third factor, crucial to development and behavior, that biologists overlooked until just the past few decades: random noise.

In recent years, noise has become an extremely popular research topic in biology. Scientists have found that practically every process in cells is inherently, inescapably noisy. This is a consequence of basic chemistry. When molecules move around, they do so randomly. This means that cellular processes that require certain molecules to be in the right place at the right time depend on the whims of how molecules bump around. (bold emphasis added)

Is another word for “noise” chaos?

The sort of randomness that impacts our understanding of natural languages? That leads us to use different words for the same thing and the same word for different things?

The next time you see a semantically deterministic system be sure to ask if they have accounted for the impact of noise on the understanding of people using the system. ;-)

To be fair, no system can but the pretense that noise doesn’t exist in some semantic environments (think description logic, RDF) is more than a little annoying.

You might want to start following the work of Cailin O’Connor (University of California, Irvine, Logic and Philosophy of Science).

Disclosure: I have always had a weakness for philosophy of science so your mileage may vary. This is real philosophy of science and not the strained crys of “science” you see on most mailing list discussions.

I first saw this in a tweet by John Horgan.

Getting Started with S4, The Self-Service Semantic Suite

September 16th, 2014

Getting Started with S4, The Self-Service Semantic Suite by Marin Dimitrov.

From the post:

Here’s how S4 developers can get started with The Self-Service Semantic Suite. This post provides you with practical information on the following topics:

  • Registering a developer account and generating API keys
  • RESTful services & free tier quotas
  • Practical examples of using S4 for text analytics and Linked Data querying

Ontotext is up front about the limitations on the “free” service:

  • 250 MB of text processed monthly (via the text analytics services)
  • 5,000 SPARQL queries monthly (via the LOD SPARQL service)

The number of pages in a megabyte of text varies depends on text content but assuming a working average of one (1) megabyte = five hundred (500) pages of text, you can analyze up to one hundred and twenty-five thousand (125,000) pages of text a month. Chump change for serious NLP but it is a free account.

The post goes on to detail two scenarios:

  • Annotate a news document via the News analytics service
  • Send a simple SPARQL query to the Linked Data service

Learn how effective entity recognition and SPARQL are with data of interest to you, at a minimum of investment.

I first saw this in a tweet by Tony Agresta.

Stephen Wolfram Launching Today: Mathematica Online! (w/ secret pricing)

September 16th, 2014

Launching Today: Mathematica Online! by Stephen Wolfram.

From the post:

It’s been many years in the making, and today I’m excited to announce the launch of Mathematica Online: a version of Mathematica that operates completely in the cloud—and is accessible just through any modern web browser.

In the past, using Mathematica has always involved first installing software on your computer. But as of today that’s no longer true. Instead, all you have to do is point a web browser at Mathematica Online, then log in, and immediately you can start to use Mathematica—with zero configuration.

Some of the advantages that Stephen outlines:

  • Manipulate can be embedded in any web page
  • Files are stored in the Cloud to be accessed from anywhere or easily shared
  • Mathematica can now be used on mobile devices

What’s the one thing that isn’t obvious from Stephen’s post?

The pricing for access to Mathematical Online.

A Wolfram insider, proofing Stephen’s post probably said: “Oh, shit! Our pricing information is secret! What do you say in the post?

So Stephen writes:

But get Mathematica Online too (which is easy to do—through Premier Service Plus for individuals, or a site license add-on).

You do that, or at least try to do that. If you manage to hunt down Premier Service, you will find you need an activation key in order to possibly get the pricing information.

If you don’t have a copy of Mathematica, you aren’t going to be ordering Mathematica Online today.

Sad that such remarkable software has such poor marketing.

Shout out to Stephen: Lots of people are interested in using Mathematica Online or off. Byzantine marketing excludes waiting, would be paying, customers.

I first saw this in a tweet by Alex Popescu.

A $23 million venture fund for the government tech set

September 16th, 2014

A $23 million venture fund for the government tech set by Nancy Scola.

Nancy tells a compelling story of a new VC firm, GovTech, which is looking for startups focused on providing governments with better technology infrastructure.

Three facts from the story stand out:

“The U.S. government buys 10 eBays’ worth of stuff just to operate,” from software to heavy-duty trucking equipment.

…working with government might be a tortuous slog, but Bouganim says that he saw that behind that red tape lay a market that could be worth in the neighborhood of $500 billion a year.

What most people don’t realize is government spends nearly $74 billion on technology annually. As a point of comparison, the video game market is a $15 billion annual market.

See Nancy’s post for the full flavor of the story but it sounds like there is gold buried in government IT.

Another way to look at it is the government is already spending $74 billion a year on technology that is largely an object of mockery and mirth. Effective software may be sufficiently novel and threatening to either attract business or a buy-out.

While you are pondering possible opportunities, existing systems, their structures and data are “subjects” in topic map terminology. Which means topic maps can protect existing contracts and relationships, while delivering improved capabilities and data.

Promote topic maps as “in addition to” existing IT systems and you will encounter less resistance both from within and without the government.

Don’t be squeamish about associating with governments, of whatever side. Their money spends just like everyone else’s. You can ask At&T and IBM about supporting both sides in a conflict.

I first saw this in a tweet by Mike Bracken.

User Onboarding

September 16th, 2014

User Onboarding by Samuel Hulick.

From the webpage:

Want to see how popular web apps handle their signup experiences? Here’s every one I’ve ever reviewed, in one handy list.

I have substantially altered Samuel’s presentation to fit the list onto one screen and to open new tabs, enabling quick comparison of onboarding experiences.

Asana iOS Instagram OkCupid Slingshot
Basecamp InVision Optimizely Snapchat
Buffer LessAccounting Pinterest Trello
Evernote LiveChat Pocket Tumblr
Foursquare Mailbox for Mac Quora Twitter
GetResponse Meetup Shopify Vimeo
Gmail Netflix Slack WhatsApp

Writers become better by reading good writers.

Non-random good onboarding comes from studying previous good onboarding.


I first saw this in a tweet by Jason Ziccardi.

Shanghai Library adds 2 million records to WorldCat…

September 16th, 2014

Shanghai Library adds 2 million records to WorldCat to share its collection with the world Compiled by Ming POON, Josephine SCHE, and Mi Chu WIENS (November, 2004).

From the post:

Shanghai Library, the largest public library in China and one of the largest libraries in the world, has contributed 2 million holdings to WorldCat, including some 770,000 unique bibliographic records, to share its collection worldwide.

These records, which represent books and journals published between 1911 and 2013, were loaded in WorldCat earlier this year. The contribution from Shanghai Library, an OCLC member since 1996, enhances the richness and depth of Chinese materials in WorldCat as well as the discoverability of these collections around the world.

“We are pleased to add Shanghai Library’s holdings to WorldCat, which is the global union catalog of library collections,” said Dr. Jianzhong Wu, Director, Shanghai Library “Shanghai is a renowned, global city, and the library should be as well. With WorldCat, we not only raise the visibility of our collection to a global level but we also share our national heritage and identity with other libraries and their users through the OCLC WorldShare Interlibrary Loan service.”

“The leadership of Shanghai Library has a bold global vision,” says Andrew H. Wang, Vice President, OCLC Asia Pacific. “The addition of Shanghai Library’s holdings and unique records enriches coverage of the Chinese collection in WorldCat for researchers everywhere.”

I don’t have a feel for how many unique Chinese bibliographic records are online but 770,000 sounds like a healthy addition.

You may also be interested in: Online Resources for Chinese Studies in North American Libraries.

Given the compilation date, 2004, I ran the W3C Link Checker on

You can review the results at:

Summary of results:

Code Occurrences What to do
(N/A) 6 The link was not checked due to robots exclusion rules. Check the link manually, and see also the link checker documentation on robots exclusion.
(N/A) 2 The hostname could not be resolved. Check the link for typos.
403 1 The link is forbidden! This needs fixing. Usual suspects: a missing index.html or Overview.html, or a missing ACL.
404 61 The link is broken. Double-check that you have not made any typo, or mistake in copy-pasting. If the link points to a resource that no longer exists, you may want to remove or fix the link.
500 5 This is a server side problem. Check the URI.

(emphasis added)

At a minimum, the broken links need to be corrected but updating the listing to include new resources would make a nice graduate student project.

I don’t have the background or language skills with Chinese resources to embark on such a project but would be happy to assist anyone who undertakes the task.

New Directions in Vector Space Models of Meaning

September 16th, 2014

New Directions in Vector Space Models of Meaning by Edward Grefenstette, Karl Moritz Hermann, Georgiana Dinu, and Phil Blunsom. (video)

From the description:

This is the video footage, aligned with slides, of the ACL 2014 Tutorial on New Directions in Vector Space Models of Meaning, by Edward Grefenstette (Oxford), Karl Moritz Hermann (Oxford), Georgiana Dinu (Trento) and Phil Blunsom (Oxford).

This tutorial was presented at ACL 2014 in Baltimore by Ed, Karl and Phil.

The slides can be found at

Running time is 2:45:12 so you had better get a cup of coffee before you start.

Includes a review of distributional models of semantics.

The sound isn’t bad but the acoustics are so you will have to listen closely. Having the slides in front of you helps as well.

The semantics part starts to echo topic map theory with the realization that having a single token isn’t going to help you with semantics. Tokens don’t stand alone but in a context of other tokens. Each of which has some contribution to make to the meaning of a token in question.

Topic maps function in a similar way with the realization that identifying any subject of necessity involves other subjects, which have their own identifications. For some purposes, we may assume some subjects are sufficiently identified without specifying the subjects that in our view identify it, but that is merely a design choice that others may choose to make differently.

Working through this tutorial and the cited references (one advantage to the online version) will leave you with a background in vector space models and the contours of the latest research.

I first saw this in a tweet by Kevin Safford.


September 16th, 2014


From the webpage:

  • Fast: KaTeX renders its math synchronously and doesn’t need to reflow the page.
  • Print quality: KaTeX’s layout is based on Donald Knuth’s TeX, the gold standard for math typesetting.
  • Self contained: KaTeX has no dependencies and can easily be bundled with your website resources.
  • Server side rendering: KaTeX produces the same output regardless of browser or environment, so you can pre-render expressions using Node.js and send them as plain HTML.

Is it just a matter of time before someone implements TeX in JS and we have a typographic solution for the Web?

I first saw this in a tweet by John Resig.

GraphX: Graph Processing in a Distributed Dataflow Framework

September 15th, 2014

GraphX: Graph Processing in a Distributed Dataflow Framework by Joseph Gonzalez, Reynold Xin, Ankur Dave, Dan Crankshaw, Michael Franklin, Ion Stoica.


In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with specialized graph systems, GraphX recasts graph-specific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distributed dataflow frameworks, GraphX brings low-cost fault tolerance to graph processing. We evaluate GraphX on real workloads and demonstrate that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of specialized graph processing systems while enabling a wider range of computation.

GraphX: Graph Processing in a Distributed Dataflow Framework (as PDF file)

The “other” systems for comparison were GraphLab and Giraph. Those systems were tuned in cooperation with experts in their use. These are some of the “fairest” benchmarks you are likely to see this year. Quite different from “shiny graph engine” versus lame or misconfigured system benchmarks.

Definitely the slow-read paper for this week!

I first saw this in a tweet by Arnon Rotem-Gal-Oz.

A Guide To Who Hates Whom In The Middle East

September 15th, 2014

A Guide To Who Hates Whom In The Middle East by John Brown Lee.

John reviews an interactive visualization of players with an interest in the Middle East by David McCandless of Information is Beautiful.

The full interactive version of The Middle East Key players & notable relationships.

I would use this graphic with caution, mostly because if you select Jordan, it show no relationship to Israel. As you know, Jordan signed a peace agreement with Israel twenty years ago and Israel recently agreed to sell gas to Jordan’s state-owned National Electric Power Co.

Nor does it show any relationship between Turkey and the United States. At the very least, the United States and Turkey have a complicated relationship. Would you include the reported pettiness of Senator John McCain towards Turkey in an enhanced map?

Not to take anything away from a useful way to explore the web of relationships in the Middle East but more in the nature of a request for a fuller story.

Uncovering Hidden Text on a 500-Year-Old Map That Guided Columbus

September 15th, 2014

Uncovering Hidden Text on a 500-Year-Old Map That Guided Columbus by Greg Miller.

Martellus map

Christopher Columbus probably used the map above as he planned his first voyage across the Atlantic in 1492. It represents much of what Europeans knew about geography on the verge discovering the New World, and it’s packed with text historians would love to read—if only the faded paint and five centuries of wear and tear hadn’t rendered most of it illegible.

But that’s about to change. A team of researchers is using a technique called multispectral imaging to uncover the hidden text. They scanned the map last month at Yale University and expect to start extracting readable text in the next few months, says Chet Van Duzer, an independent map scholar who’s leading the project, which was funded by the National Endowment for the Humanities.

The map was made in or around 1491 by Henricus Martellus, a German cartographer working in Florence. It’s not known how many were made, but Yale owns the only surviving copy. It’s a big map, especially for its time: about 4 by 6.5 feet. “It’s a substantial map, meant to be hung on a wall,” Van Duzer said.

Extracting the text is going to take some effort but expectations are that high resolution images will appear at the Beinecke Digital Library at Yale in 2015.

Greg covers a number of differences between the Martellus map (1491) and the Waldseeüller map (1507), as well as their places in historical context.

You should pass this post onto any friends who think Columbus “discovered” the world was round. I don’t see any end of the world markers on the Martellus map.

Do you?

Norwegian Ethnological Research [The Early Years]

September 15th, 2014

Norwegian Ethnological Research [The Early Years] by Lars Marius Garshol.

From the post:

The definitive book on Norwegian farmhouse ale is Odd Nordland’s “Brewing and beer traditions in Norway,” published in 1969. That book is now sadly totally unavailable, except from libraries. In the foreword Nordland writes that the book is based on a questionnaire issued by Norwegian Ethnological Research in 1952 and 1957. After digging a little I discovered that this material is actually still available at the institute. The questionnaire is number 35, running to 103 questions.

Because the questionnaire responses in general often contain descriptions of quite personal matters, access to the answers is restricted. However, by paying a quite stiff fee, describing the research I wanted to use the material for, and signing a legal agreement, I was sent a CD with all the answers to questionnaire 35. The contents are quite daunting: 1264 numbered JPEG files, with no metadata of any kind. The files are scans of individual pages of responses, plus one cover page for each Norwegian province. Most of the responses are handwritten, and legibility varies dramatically. Some, happily, are typewritten.

I appended “[The Early Years]” to the title because Lars has embarked on an adventure that can last as long as he remains interested.

Sixty-two year old survey results leave Lars wondering exactly what was meant in some cases. Keep that in mind the next time you search for word usage across centuries. Matching exact strings isn’t the same thing as matching the meanings attached to those strings.

You can imagine what gaps and ambiguities might exist when the time period stretches to centuries, if not millennia, and our knowledge of the languages is learned in a modern context.

The understanding we capture is our own, which hopefully has some connection to earlier witnesses. Recording that process is a uniquely human activity and one that I am glad Lars is sharing with a larger audience.

Looking forward to hearing about more results!

PS: Do you have a similar “data mining” story to share? Including the use of command line tool stories but working with non-electronic resources as well.

Open source datacenter computing with Apache Mesos

September 15th, 2014

Open source datacenter computing with Apache Mesos by Sachin P. Bappalige.

From the post:

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos is a open source software originally developed at the University of California at Berkeley. It sits between the application layer and the operating system and makes it easier to deploy and manage applications in large-scale clustered environments more efficiently. It can run many applications on a dynamically shared pool of nodes. Prominent users of Mesos include Twitter, Airbnb, MediaCrossing, Xogito and Categorize.

Mesos leverages features of the modern kernel—”cgroups” in Linux, “zones” in Solaris—to provide isolation for CPU, memory, I/O, file system, rack locality, etc. The big idea is to make a large collection of heterogeneous resources. Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. It is a thin resource sharing layer that enables fine-grained sharing across diverse cluster computing frameworks, by giving frameworks a common interface for accessing cluster resources.The idea is to deploy multiple distributed systems to a shared pool of nodes in order to increase resource utilization. A lot of modern workloads and frameworks can run on Mesos, including Hadoop, Memecached, Ruby on Rails, Storm, JBoss Data Grid, MPI, Spark and Node.js, as well as various web servers, databases and application servers.

This introduction to Apache Mesos will give you a quick overview of what Mesos has to offer without getting bogged down in details. Details will come later, either if you want to run a datacenter using Mesos or to map a datacenter being run with Mesos.