Archive for September, 2014

Know Your Algorithms and Data!

Sunday, September 21st, 2014

average of legs

If you let me pick the algorithm or the data, I can produce any result you want.

Something to keep in mind when listening to reports of “facts.”

Or as Nietzsche would say:

There are no facts, only interpretations.

There are people who are so naive that they don’t realize interpretations other than their are possible. Avoid them unless you have need of followers for some reason.

I first saw this in a tweet by Chris Arnold.

Fixing Pentagon Intelligence [‘data glut but an information deficit’]

Sunday, September 21st, 2014

Fixing Pentagon Intelligence by John R. Schindler.

From the post:

The U.S. Intelligence Community (IC), that vast agglomeration of seventeen different hush-hush agencies, is an espionage behemoth without peer anywhere on earth in terms of budget and capabilities. Fully eight of those spy agencies, plus the lion’s share of the IC’s budget, belong to the Department of Defense (DoD), making the Pentagon’s intelligence arm something special. It includes the intelligence agencies of all the armed services, but the jewel in the crown is the National Security Agency (NSA), America’s “big ears,” with the National Geospatial-Intelligence Agency (NGA), which produces amazing imagery, following close behind.

None can question the technical capabilities of DoD intelligence, but do the Pentagon’s spies actually know what they are talking about? This is an important, and too infrequently asked, question. Yet it was more or less asked this week, in a public forum, by a top military intelligence leader. The venue was an annual Washington, DC, intelligence conference that hosts IC higher-ups while defense contractors attempt a feeding frenzy, and the speaker was Rear Admiral Paul Becker, who serves as the Director of Intelligence (J2) on the Joint Chiefs of Staff (JCS). A career Navy intelligence officer, Becker’s job is keeping the Pentagon’s military bosses in the know on hot-button issues: it’s a firehose-drinking position, made bureaucratically complicated because JCS intelligence support comes from the Defense Intelligence Agency (DIA), which is an all-source shop that has never been a top-tier IC agency, and which happens to have some serious leadership churn at present.

Admiral Becker’s comments on the state of DoD intelligence, which were rather direct, merit attention. Not surprisingly for a Navy guy, he focused on China. He correctly noted that we have no trouble collecting the “dots” of (alleged) 9/11 infamy, but can the Pentagon’s big battalions of intel folks actually derive the necessary knowledge from all those tasty SIGINT, HUMINT, and IMINT morsels? Becker observed — accurately — that DoD intelligence possesses a “data glut but an information deficit” about China, adding that “We need to understand their strategy better.” In addition, he rued the absence of top-notch intelligence analysts of the sort the IC used to possess, asking pointedly: “Where are those people for China? We need them.”

Admiral Becker’s:

data glut but an information deficit” (emphasis added)

captures the essence of phone record subpoenas, mass collection of emails, etc., all designed to give the impression of frenzied activity, with no proof of effectiveness. That is an “information deficit.”

Be reassured you can host a data glut in a topic map so topic maps per se are not a threat to current data gluts. It is possible, however, to use topic maps over existing data gluts to create information and actionable intelligence. Without disturbing the underlying data gluts and their contractors.

I tried to find a video of Adm. Becker’s presentation but apparently the Intelligence and National Security Security Summit 2014 does not provide video recording of presentations. Whether that is to prevent any contemporaneous record being kept of remarks or just being low-tech kinda folks isn’t clear.

I can point out the meeting did have a known liar, “The Honorable James Clapper,” on the agenda. Hard to know if having perjured himself in front of Congress has made him gun shy of recorded speeches or not. (For Clapper’s latest “spin,” on “the least untruthful,” see: James Clapper says he misspoke, didn’t lie about NSA surveillance.) One hopes by next year’s conference Clapper will appear as: James Clapper, former DNI, convicted felon, Federal Prison Register #….

If you are interested in intelligence issues, you should be following John R. Schindler. A U.S. perspective but handling issues in intelligence with topic maps will vary in the details but not the underlying principles from one intelligence service to another.

Disclosure: I rag on the intelligence services of the United States due to greater access to public information on those services. Don’t take that as greater interest how their operations could be improved by topic maps over other intelligence services.

I am happy to discuss how your intelligence services can (or can’t) be improved by topic maps. There are problems, such as those discussed by Admiral Becker, that can’t be fixed by using topic maps. I will be as quick to point those out as I will problems where topic maps are relevant. My goal is your satisfaction that topic maps made a difference for you, not having a government entity in a billing database.

A Closed Future for Mathematics?

Sunday, September 21st, 2014

A Closed Future for Mathematics? by Eric Raymond.

From the post:

In a blog post on Computational Knowledge and the Future of Pure Mathematics Stephen Wolfram lays out a vision that is in many ways exciting and challenging. What if all of mathematics could be expressed in a common formal notation, stored in computers so it is searchable and amenable to computer-assisted discovery and proof of new theorems?

… to be trusted, the entire system will need to be transparent top to bottom. The design, the data representations, and the implementation code for its software must all be freely auditable by third-party mathematical topic experts and mathematically literate software engineers.

Eric identifies three (3) types of errors that may exist inside the proposed closed system from Wolfram.

Is transparency of a Wolfram solution the only way to trust a Wolfram solution?

For any operation or series of operations performed with Wolfram software, you could perform the same operation in one or more open or closed source systems and see if the results agree. The more often they agree for some set of operations the greater your confidence in those operations with Wolfram software.

That doesn’t mean that the next operation or a change in the order of operations is going to produce a trustworthy result. Just that for some specified set of operations in a particular order with specified data that you obtained the same result from multiple software solutions.

It could be that all the software solutions implement the same incorrect algorithm, the same valid algorithm incorrectly, or errors in search engines searching a mathematical database (which could only be evaluated against the data being searched).

Where N is the number of non-Wolfram software packages you are using to check the Wolfram-based solution and W represents the amount of work to obtain a solution, the total work required is N x W.

In addition to not resulting in the trust Eric is describing, it is an increase in your workload.

I first saw this in a tweet by Michael Nielsen.

Medical Heritage Library (MHL)

Sunday, September 21st, 2014

Medical Heritage Library (MHL)

From the post:

The Medical Heritage Library (MHL) and DPLA are pleased to announce that MHL content can now be discovered through DPLA.

The MHL, a specialized research collection stored in the Internet Archive, currently includes nearly 60,000 digital rare books, serials, audio and video recordings, and ephemera in the history of medicine, public health, biomedical sciences, and popular medicine from the medical special collections of 22 academic, special, and public libraries. MHL materials have been selected through a rigorous process of curation by subject specialist librarians and archivists and through consultation with an advisory committee of scholars in the history of medicine, public health, gender studies, digital humanities, and related fields. Items, selected for their educational and research value, extend from 1235 (Liber Aristotil[is] de nat[u]r[a] a[nima]li[u]m ag[res]tium [et] marino[rum]), to 2014 (The Grog Issue 40 2014) with the bulk of the materials dating from the 19th century.

“The rich history of medicine content curated by the MHL is available for the first time alongside collections like those from the Biodiversity Heritage Library and the Smithsonian, and offers users a single access point to hundreds of thousands of scientific and history of science resources,” said DPLA Assistant Director for Content Amy Rudersdorf.

The collection is particularly deep in American and Western European medical publications in English, although more than a dozen languages are represented. Subjects include anatomy, dental medicine, surgery, public health, infectious diseases, forensics and legal medicine, gynecology, psychology, anatomy, therapeutics, obstetrics, neuroscience, alternative medicine, spirituality and demonology, diet and dress reform, tobacco, and homeopathy. The breadth of the collection is illustrated by these popular items: the United States Naval Bureau of Medical History’s audio oral history with Doctor Walter Burwell (1994) who served in the Pacific theatre during World War II and witnessed the first Japanese kamikaze attacks; History and medical description of the two-headed girl : sold by her agents for her special benefit, at 25 cents (1869), the first edition of Gray’s Anatomy (1858) (the single most-downloaded MHL text at more than 2,000 downloads annually), and a video collection of Hanna – Barbera Production Flintstones (1960) commercials for Winston cigarettes.

“As is clear from today’s headlines, science, health, and medicine have an impact on the daily lives of Americans,” said Scott H. Podolsky, chair of the MHL’s Scholarly Advisory Committee. “Vaccination, epidemics, antibiotics, and access to health care are only a few of the ongoing issues the history of which are well documented in the MHL. Partnering with the DPLA offers us unparalleled opportunities to reach new and underserved audiences, including scholars and students who don’t have access to special collections in their home institutions and the broader interested public.“

Quick links:

Digital Public Library of America

Internet Archive

Medical Heritage Library website

I remember the Flintstone commercials for Winston cigarettes. Not all that effective a campaign, I smoked Marboros (reds in a box) for almost forty-five (45) years. 😉

As old vices die out, new ones, like texting and driving take their place. On behalf of current and former smokers, I am confident that smoking was not a factor in 1,600,000 accidents per year and 11 teen deaths every day.

Apache Lucene and Solr 4.10

Sunday, September 21st, 2014

Apache Lucene and Solr 4.10

From the post:

Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.10. This is a next release continuing the 4th version of both Apache Lucene and Apache Solr.

Here are some of the changes that were made comparing to the 4.9:


  • Simplified Version handling for analyzers
  • TermAutomatonQuery was added
  • Optimizations and bug fixes


  • Ability to automatically add replicas in SolrCloud mode in HDFS
  • Ability to export full results set
  • Distributed support for facet.pivot
  • Optimizations and bugfixes from Lucene 4.9

Full changes list for Lucene can be found at Full list of changes in Solr 4.10 can be found at:

Apache Lucene 4.10 library can be downloaded from the following address: Apache Solr 4.10 can be downloaded at the following URL address: Please remember that the mirrors are just starting to update so not all of them will contain the 4.10 version of Lucene and Solr.

A belated note about Apache Lucene and Solr 4.10.

I must have been distracted by the continued fumbling with the Ebola crisis. I no longer wonder how the international community would respond to an actual world wide threat. In a word, ineffectively.

WWW 2015 Call for Research Papers

Saturday, September 20th, 2014

WWW 2015 Call for Research Papers

From the webpage:

Important Dates:

  • Research track abstract registration:
    Monday, November 3, 2014 (23:59 Hawaii Standard Time)
  • Research track full paper submission:
    Monday, November 10, 2014 (23:59 Hawaii Standard Time)
  • Notifications of acceptance:
    Saturday, January 17, 2015
  • Final Submission Deadline for Camera-ready Version:
    Sunday, March 8, 2015
  • Conference dates:
    May 18 – 22, 2015

Research papers should be submitted through EasyChair at:

For more than two decades, the International World Wide Web (WWW) Conference has been the premier venue for researchers, academics, businesses, and standard bodies to come together and discuss latest updates on the state and evolutionary path of the Web. The main conference program of WWW 2015 will have 11 areas (or themes) for refereed paper presentations, and we invite you to submit your cutting-edge, exciting, new breakthrough work to the relevant area. In addition to the main conference, WWW 2015 will also have a series of co-located workshops, keynote speeches, tutorials, panels, a developer track, and poster and demo sessions.

The list of areas for this year is as follows:

  • Behavioral Analysis and Personalization
  • Crowdsourcing Systems and Social Media
  • Content Analysis
  • Internet Economics and Monetization
  • Pervasive Web and Mobility
  • Security and Privacy
  • Semantic Web
  • Social Networks and Graph Analysis
  • Web Infrastructure: Datacenters, Content Delivery Networks, and Cloud Computing
  • Web Mining
  • Web Search Systems and Applications

Great conference, great weather (weather for Florence in May) and it is in Florence, Italy. What other reasons do you need to attend? 😉

Why news organizations need to invest in better linking and tagging

Saturday, September 20th, 2014

Why news organizations need to invest in better linking and tagging by Frédéric Filloux.

From the post:

Most media organizations are still stuck in version 1.0 of linking. When they produce content, they assign tags and links mostly to other internal content. This is done out of fear that readers would escape for good if doors were opened too wide. Assigning tags is not exact science: I recently spotted a story about the new pregnancy in the British royal family; it was tagged “demography,” as if it was some piece about Germany’s weak fertility rate.

But there is much more to come in that field. Two factors are are at work: APIs and semantic improvements. APIs (Application Programming Interfaces) act like the receptors of a cell that exchanges chemical signals with other cells. It’s the way to connect a wide variety of content to the outside world. A story, a video, a graph can “talk” to and be read by other publications, databases, and other “organisms.” But first, it has to pass through semantic filters. From a text, the most basic tools extract sets of words and expressions such as named entities, patronyms, places.

Another higher level involves extracting meanings like “X acquired Y for Z million dollars” or “X has been appointed finance minister.” But what about a video? Some go with granular tagging systems; others, such as Ted Talks, come with multilingual transcripts that provide valuable raw material for semantic analysis. But the bulk of content remains stuck in a dumb form: minimal and most often unstructured tagging. These require complex treatments to make them “readable” by the outside world. For instance, a untranscribed video seen as interesting (say a Charlie Rose interview) will have to undergo a speech-to-text analysis to become usable. This processes requires both human curation (finding out what content is worth processing) and sophisticated technology (transcribing a speech by someone speaking super-fast or with a strong accent.)

Great piece on the value of more robust tagging by news organizations.

Rather than tagging as an after-the-fact of publication activity, tagging needs to be part of the work flow that produces content. Tagging as a step in the process of content production avoids creating a mountain of untagged content.

To what end? Well, imagine simple tagging that associates a reporter with named sources in a report. When the subject of that report comes up in the future, wouldn’t it be a time saver to whistle up all the reporters on that subject with a list of their named contacts?

Never having worked in a newspaper I can’t say but that sounds like an advantage to an outsider.

That lesson can be broadened to any company producing content. The data in the content had a point of origin, it was delivered from someone, reported by someone else, etc. Capture those relationships and track the ebb and flow of your data and not just the values it represents.

I first saw this in a tweet by Marin Dimitrov.

Growing a Language

Saturday, September 20th, 2014

Growing a Language by Guy L. Steele, Jr.

The first paper in a new series of posts from the Hacker School blog, “Paper of the Week.”

I haven’t found a good way to summarize Steele’s paper but can observe that a central theme is the growth of programming languages.

While enjoying the Steele paper, ask yourself how would you capture the changing nuances of a language, natural or artificial?


ApacheCon EU 2014

Saturday, September 20th, 2014

ApacheCon EU 2014

ApacheCon Europe 2014 – November 17-21 in Budapest, Hungary.

November is going to be here sooner than you think. You need to register now and start making travel arrangements.

A quick scroll down the schedule page will give you an idea of the breath of the Apache Foundation activities.

219 million stars: a detailed catalogue of the visible Milky Way

Saturday, September 20th, 2014

219 million stars: a detailed catalogue of the visible Milky Way

From the post:

A new catalogue of the visible part of the northern part of our home Galaxy, the Milky Way, includes no fewer than 219 million stars. Geert Barentsen of the University of Hertfordshire led a team who assembled the catalogue in a ten year programme using the Isaac Newton Telescope (INT) on La Palma in the Canary Islands. Their work appears today in the journal Monthly Notices of the Royal Astronomical Society.

The production of the catalogue, IPHAS DR2 (the second data release from the survey programme The INT Photometric H-alpha Survey of the Northern Galactic Plane, IPHAS), is an example of modern astronomy’s exploitation of ‘big data’. It contains information on 219 million detected objects, each of which is summarised in 99 different attributes.

The new work appears in Barentsen et al, “The second data release of the INT Photometric Hα Survey of the Northern Galactic Plane (IPHAS DR2)“, Monthly Notices of the Royal Astronomical Society, vol. 444, pp. 3230-3257, 2014, published by Oxford University Press. A preprint version is available on the arXiv server.

The catalogue is accessible in queryable form via the VizieR service at the Centre de Données astronomiques de Strasbourg. The processed IPHAS images it is derived from are also publically available.

At 219 million detected objects, each with 99 different attributes, that sounds like “big data” to me. 😉



Saturday, September 20th, 2014

Transducers by Rich Hickey. (Strange Loop, 2014)

Rich has another go at explaining transducers at Strange Loop 2014.

You may want to look at Felienne’s Hermans live blog on the presentation: Transducers – Rich Hickey

I first saw Rich’s video in a tweet by Michael Klishin.

GDS unveils ‘Gov.UK Verify’ public services identity assurance scheme

Friday, September 19th, 2014

GDS unveils ‘Gov.UK Verify’ public services identity assurance scheme

From the post:

Gov.UK Verify is designed to overcome concerns about government setting up a central database of citizens’ identities to enable access to online public services – similar criticism led to the demise of the hugely unpopular identity card scheme set up under the Labour government.

Instead, users will register their details with one of several independent identity assurance providers – certified companies which will establish and verify a user’s identity outside government systems. When the user then logs in to a digital public service, the Verify system will electronically “ask” the external third-party provider to confirm the person is who they claim to be.


Help me make sure I am reading this story of citizen identity correctly.

Citizens are fearful of their government having a central database of citizens’ identities but are comfortable with commercial firms, regulated by same government, managing those identities?

Do you think citizens of the UK are aware that commercial firms betray their customers to the U.S. government at the drop of a secret subpoena every day?

To say nothing of the failures of commercial firms to protect data from their customers, when they aren’t using that data to directly manipulate their customers.

Strikes me as damned odd that anyone would trust commercial firms more than they would trust the government. Neither one is actually trustworthy.

Am I reading this story correctly?

I first saw this in a tweet by Richard Copley.

Named Entity Recognition: A Literature Survey

Friday, September 19th, 2014

Named Entity Recognition: A Literature Survey by Rahul Sharnagat.


In this report, we explore various methods that are applied to solve NER. In section 1, we introduce the named entity problem. In section 2, various named entity recognition methods are discussed in three three broad categories of machine learning paradigm and explore few learning techniques in them. In the first part, we discuss various supervised techniques. Subsequently we move to semi-supervised and unsupervised techniques. In the end we discuss about the method from deep learning to solve NER.

If you are new to the named entity recognition issue or want to pass on an introduction, this may be the paper for you. It covers all the high points with a three page bibliography to get your started in the literature.

I first saw this in a tweet by Christopher.

You can be a kernel hacker!

Friday, September 19th, 2014

You can be a kernel hacker! by Julia Evans.

From the post:

When I started Hacker School, I wanted to learn how the Linux kernel works. I’d been using Linux for ten years, but I still didn’t understand very well what my kernel did. While there, I found out that:

  • the Linux kernel source code isn’t all totally impossible to understand
  • kernel programming is not just for wizards, it can also be for me!
  • systems programming is REALLY INTERESTING
  • I could write toy kernel modules, for fun!
  • and, most surprisingly of all, all of this stuff was useful.

I hadn’t been doing low level programming at all – I’d written a little bit of C in university, and otherwise had been doing web development and machine learning. But it turned out that my newfound operating systems knowledge helped me solve regular programming tasks more easily.

Post by the same name as her presentation at Strange Loop 2014.

Another reason to study the Linux kernel: The closer to the metal your understanding, the more power you have over the results.

That’s true for the Linux kernel, machine learning algorithms, NLP, etc.

You can have a canned result prepared by someone else, which may be good enough, or you can bake something more to your liking.

I first saw this in a tweet by Felienne Hermans.

Update: Video of You can be a kernel hacker!

Digital Dashboards: Strategic & Tactical: Best Practices, Tips, Examples

Friday, September 19th, 2014

Digital Dashboards: Strategic & Tactical: Best Practices, Tips, Examples by Avinash Kaushik.

From the post:

The Core Problem: The Failure of Just Summarizing Performance.

I humbly believe the challenge is that in a world of too much data, with lots more on the way, there is a deep desire amongst executives to get “summarize data,” to get “just a snapshot,” or to get the “top-line view.” This is understandable of course.

But this summarization, snapshoting and toplining on your part does not actually change the business because of one foundational problem:

People who are closest to the data, the complexity, who’ve actually done lots of great analysis, are only providing data. They don’t provide insights and recommendations.

People who are receiving the summarized snapshot top-lined have zero capacity to understand the complexity, will never actually do analysis and hence are in no position to know what to do with the summarized snapshot they see.

The end result? Nothing.

Standstill. Gut based decision making. No real appreciation of the delicious opportunity in front of every single company on the planet right now to have a huger impact with data.

So what’s missing from this picture that will transform numbers into action?

I believe the solution is multi-fold (and when is it not? : )). We need to stop calling everything a dashboard. We need to create two categories of dashboards. For both categories, especially the valuable second kind of dashboards, we need words – lots of words and way fewer numbers.

Be aware that the implication of that last part I’m recommending is that you are going to become a lot more influential, and indispensable, to your organization. Not everyone is ready for that, but if you are this is going to be a fun ride!

A long post on “dashboards” but I find it relevant to the design of interfaces.

In particular the advice:

This will be controversial but let me say it anyway. The primary purpose of a dashboard is not to inform, and it is not to educate. The primary purpose is to drive action!

Hence: List the next steps. Assign responsibility for action items to people. Prioritize, prioritize, prioritize. Never forget to compute business impact.

Curious how exploration using a topic map could feed into an action process? Would you represent actors in the map and enable the creation of associations that represent assigned tasks? Other ideas?

I found this in a post, Don’t data puke, says Avinash Kaushik by Kaiser Fung and followed it to the original post.

Tokenizing and Named Entity Recognition with Stanford CoreNLP

Friday, September 19th, 2014

Tokenizing and Named Entity Recognition with Stanford CoreNLP by Sujit Pal.

From the post:

I got into NLP using Java, but I was already using Python at the time, and soon came across the Natural Language Tool Kit (NLTK), and just fell in love with the elegance of its API. So much so that when I started working with Scala, I figured it would be a good idea to build a NLP toolkit with an API similar to NLTKs, primarily as a way to learn NLP and Scala but also to build something that would be as enjoyable to work with as NLTK and have the benefit of Java’s rich ecosystem.

The project is perenially under construction, and serves as a test bed for my NLP experiments. In the past, I have used OpenNLP and LingPipe to build Tokenizer implementations that expose an API similar to NLTK’s. More recently, I have built an Named Entity Recognizer (NER) with OpenNLP’s NameFinder. At the recommendation of one of my readers, I decided to take a look at Stanford CoreNLP, with which I ended up building a Tokenizer and a NER implementation. This post describes that work.

Truly a hard core way to learn NLP and Scala!


Looking forward to hearing more about this project.

Libraries may digitize books without permission, EU top court rules [Nation-wide Site Licenses?]

Friday, September 19th, 2014

Libraries may digitize books without permission, EU top court rules by Loek Essers.

From the post:

European libraries may digitize books and make them available at electronic reading points without first gaining consent of the copyright holder, the highest European Union court ruled Thursday.

The Court of Justice of the European Union (CJEU) ruled in a case in which the Technical University of Darmstadt digitized a book published by German publishing house Eugen Ulmer in order to make it available at its electronic reading posts, but refused to license the publisher’s electronic textbooks.

A spot of good news to remember next on the next 9/11 anniversary. A Member State may authorise libraries to digitise, without the consent of the rightholders, books they hold in their collection so as to make them available at electronic reading points

Users can’t make copies onto a USB stick but under contemporary fictions about property rights represented in copyright statutes that isn’t surprising.

What is surprising is that nations have not yet stumbled upon the idea of nation-wide site licenses for digital materials.

A nation acquiring a site license the ACM Digital Library, IEEE, Springer and a dozen or so other resources/collections would have these positive impacts:

  1. Access to core computer science publications for everyone located in that nation
  2. Publishers would have one payor and could reduce/eliminate the staff that manage digital access subscriptions
  3. Universities and colleges would not require subscriptions nor the staff to manage those subscriptions (integration of those materials into collections would remain a library task)
  4. Simplify access software based on geographic IP location (fewer user/password issues)
  5. Universities and colleges could spend funds now dedicated to subscriptions for other materials
  6. Digitization of both periodical and monograph literature would be encouraged
  7. Avoids tiresome and not-likely-to-succeed arguments about balancing the public interest in IP rights discussions.

For me, #7 is the most important advantage of nation-wide licensing of digital materials. As you can tell by my reference to “contemporary fictions about property rights” I fall quite firmly on a particular side of the digital rights debate. However, I am more interested in gaining access to published materials for everyone than trying to convince others of the correctness of my position. Therefore, let’s adopt a new strategy: “Pay the man.”

As I outline above, there are obvious financial advantages to publishers from nation-wide site licenses, in the form of reduced internal costs, reduced infrastructure costs and a greater certainty in cash flow. There are advantages for the public as well as universities and colleges, so I would call that a win-win solution.

The Developing World Initiatives by Francis & Taylor is described as:

Taylor & Francis Group is committed to the widest distribution of its journals to non-profit institutions in developing countries. Through agreements with worldwide organisations, academics and researchers in more than 110 countries can access vital scholarly material, at greatly reduced or no cost.

Why limit access to materials to “non-profit institutions in developing countries?” Granting that the site-license fees for the United States would be higher than Liberia but the underlying principle is the same. The less you regulate access the simpler the delivery model and the higher the profit to the publisher. What publisher would object to that?

There are armies of clerks currently invested in the maintenance of one-off subscription models but the greater public interest in access to materials consistent with publisher IP rights should carry the day.

If Tim O’Reilly and friends are serious about changing access models to information, let’s use nation-wide site licenses to eliminate firewalls and make effective linking and transclusion a present day reality.

Publishers get paid, readers get access. It’s really that simple. Just on a larger scale than is usually discussed.

PS: Before anyone raises the issues of cost for national-wide site licenses, remember that the United States has spent more than $1 trillion in a “war” on terrorism that has made no progress in making the United States or its citizens more secure.

If the United Stated decided to pay Spinger Science+Business Media the €866m ($1113.31m) total revenue it made in 2012, for the cost of its ‘war” on terrorism, it could have purchased a site license to all Spinger Science+Business Media content for the entire United States for 898.47 years. (Check my math: 1,000,000,000,000 / 1,113,000,000 = 898.472.)

I first saw this in Nat Torkington’s Four short links: 15 September 2014.

Common Sense and Statistics

Thursday, September 18th, 2014

Common Sense and Statistics by John D. Cook.

From the post:

…, common sense is vitally important in statistics. Attempts to minimize the need for common sense can lead to nonsense. You need common sense to formulate a statistical model and to interpret inferences from that model. Statistics is a layer of exact calculation sandwiched between necessarily subjective formulation and interpretation. Even though common sense can go badly wrong with probability, it can also do quite well in some contexts. Common sense is necessary to map probability theory to applications and to evaluate how well that map works.

No matter how technical or complex analysis may appear, do not hesitate to ask for explanations if the data or results seem “off” to you. I witnessed a presentation several years ago when the manual for a statistics package was cited for the proposition that a result was significant.

I know you have never encountered that situation but you may know others who have.

Never fear asking questions about methods or results. Your colleagues are wondering the same things but are too afraid of appearing ignorant to ask questions.

Ignorance is curable. Willful ignorance is not.

If you aren’t already following John D. Cook, you should.

Learn Datalog Today

Thursday, September 18th, 2014

Learn Datalog Today by Jonas Enlund.

From the homepage:

Learn Datalog Today is an interactive tutorial designed to teach you the Datomic dialect of Datalog. Datalog is a declarative database query language with roots in logic programming. Datalog has similar expressive power as SQL.

Datomic is a new database with an interesting and novel architecture, giving its users a unique set of features. You can read more about Datomic at and the architecture is described in some detail in this InfoQ article.

Table of Contents

You have been meaning to learn Datalog but it just hasn’t happened.

Now is the time to break that cycle and do the deed!

This interactive tutorial should ease you on your way to learning Datalog.

It can’t learn Datalog for you but it can make the journey a little easier.


From Frequency to Meaning: Vector Space Models of Semantics

Thursday, September 18th, 2014

From Frequency to Meaning: Vector Space Models of Semantics by Peter D. Turney and Patrick Pantel.


Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term–document, word–context, and pair–pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

At forty-eight (48) pages with a thirteen (13) page bibliography, this survey of vector space models (VSMs) of semantics should keep you busy for a while. You will have to fill in VSMs developments since 2010 but mastery of this paper will certain give you the foundation to do so. Impressive work.

I do disagree with the authors when they say:

Computers understand very little of the meaning of human language.

Truth be told, I would say:

Computers have no understanding of the meaning of human language.

What happens with a VSM of semantics is that we as human readers choose a model we think represents semantics we see in a text. Our computers blindly apply that model to text and report the results. We as human readers choose results that we think are closer to the semantics we see in the text, and adjust the model accordingly. Our computers then blindly apply the adjusted model to the text again and so on. At no time does the computer have any “understanding” of the text or of the model that it is applying to the text. Any “understanding” in such a model is from a human reader who adjusted the model based on their perception of the semantics of a text.

I don’t dispute that VSMs have been incredibly useful and like the authors, I think there is much mileage left in their development for text processing. That is not the same thing as imputing “understanding” of human language to devices that in fact have none at all. (full stop)


I first saw this in a tweet by Christopher Phipps.

PS: You probably recall that VSMs are based on creating a metric space for semantics, which have no preordained metric space. Transitioning from a non-metric space to a metric space isn’t subject to validation, at least in my view.

2015 Medical Subject Headings (MeSH) Now Available

Thursday, September 18th, 2014

2015 Medical Subject Headings (MeSH) Now Available

From the post:

Introduction to MeSH 2015
The Introduction to MeSH 2015 is now available, including information on its use and structure, as well as recent updates and availability of data.

MeSH Browser
The default year in the MeSH Browser remains 2014 MeSH for now, but the alternate link provides access to 2015 MeSH. The MeSH Section will continue to provide access via the MeSH Browser for two years of the vocabulary: the current year and an alternate year. Sometime in November or December, the default year will change to 2015 MeSH and the alternate link will provide access to the 2014 MeSH.

Download MeSH
Download 2015 MeSH in XML and ASCII formats. Also available for 2015 from the same MeSH download page are:

  • Pharmacologic Actions (Forthcoming)
  • New Headings with Scope Notes
  • MeSH Replaced Headings
  • MeSH MN (tree number) changes
  • 2015 MeSH in MARC format


Convince your boss to use Clojure

Wednesday, September 17th, 2014

Convince your boss to use Clojure by Eric Normand.

From the post:

Do you want to get paid to write Clojure? Let’s face it. Clojure is fun, productive, and more concise than many languages. And probably more concise than the one you’re using at work, especially if you are working in a large company. You might code on Clojure at home. Or maybe you want to get started in Clojure but don’t have time if it’s not for work.

One way to get paid for doing Clojure is to introduce Clojure into your current job. I’ve compiled a bunch of resources for getting Clojure into your company.

Take these resources and do your homework. Bringing a new language into an existing company is not easy. I’ve summarized some of the points that stood out to me, but the resources are excellent so please have a look yourself.

Great strategy and list of resources for Clojure folks.

How would you adapt this strategy to topic maps and what resources are we missing?

I first saw this in a tweet by Christophe Lalanne.

Elementary Applied Topology

Wednesday, September 17th, 2014

Elementary Applied Topology by Robert Ghrist.

From the introduction:

What topology can do

Topology was built to distinguish qualitative features of spaces and mappings. It is good for, inter alia:

  1. Characterization: Topological properties encapsulate qualitative signatures. For example, the genus of a surface, or the number of connected components of an object, give global characteristics important to classification.
  2. Continuation: Topological features are robust. The number of components or holes is not something that should change with a small error in measurement. This is vital to applications in scientific disciplines, where data is never noisy.
  3. Integration: Topology is the premiere tool for converting local data into global properties. As such, it is rife with principles and tools (Mayer-Vietoris, Excision, spectral sequences, sheaves) for integrating from local to global.
  4. Obstruction: Topology often provides tools for answering feasibility of certain problems, even when the answers to the problems themselves are hard to compute. These characteristics, classes, degrees, indices, or obstructions take the form of algebraic-topological entities.

What topology cannot do

Topology is fickle. There is no resource to tweaking epsilons should desiderata fail to be found. If the reader is a scientist or applied mathematician hoping that topological tools are a quick fix, take this text with caution. The reward of reading this book with care may be limited to the realization of new questions as opposed to new answers. It is not uncommon that a new mathematical tool contributes to applications not by answering a pressing question-of-the-day but by revealing a different (and perhaps more significant) underlying principle.

The text will require more than casual interest but what a tool to add to your toolbox!

I first saw this in a tweet from Topology Fact.

Stuff Goes Bad: Erlang in Anger

Wednesday, September 17th, 2014

Stuff Goes Bad: Erlang in Anger by Fred Herbert.

From the webpage:

This book intends to be a little guide about how to be the Erlang medic in a time of war. It is first and foremost a collection of tips and tricks to help understand where failures come from, and a dictionary of different code snippets and practices that helped developers debug production systems that were built in Erlang.

From the introduction:

This book is not for beginners. There is a gap left between most tutorials, books, training sessions, and actually being able to operate, diagnose, and debug running systems once they’ve made it to production. There’s a fumbling phase implicit to a programmer’s learning of a new language and environment where they just have to figure how to get out of the guidelines and step into the real world, with the community that goes with it.

This book assumes that the reader is proficient in basic Erlang and the OTP framework. Erlang/OTP features are explained as I see fit — usually when I consider them tricky — and it is expected that a reader who feels confused by usual Erlang/OTP material will have an idea of where to look for explanations if necessary.

What is not necessarily assumed is that the reader knows how to debug Erlang software, dive into an existing code base, diagnose issues, or has an idea of the best practices about deploying Erlang in a production environment. (footnote numbers omitted)

With exercises no less.

Reminds me of a book I had some years ago on causing and then debugging Solr core dumps. 😉 I don’t think it was ever a best seller but it was a fun read.

Great title by the way.

I first saw this in a tweet by Chris Meiklejean.

Understanding weak isolation is a serious problem

Wednesday, September 17th, 2014

Understanding weak isolation is a serious problem by Peter Bailis.

From the post:

Modern transactional databases overwhelmingly don’t operate under textbook “ACID” isolation, or serializability. Instead, these databases—like Oracle 11g and SAP HANA—offer weaker guarantees, like Read Committed isolation or, if you’re lucky, Snapshot Isolation. There’s a good reason for this phenomenon: weak isolation is faster—often much faster—and incurs fewer aborts than serializability. Unfortunately, the exact behavior of these different isolation levels is difficult to understand and is highly technical. One of 2008 Turing Award winner Barbara Liskov’s Ph.D. students wrote an entire dissertation on the topic, and, even then, the definitions we have still aren’t perfect and can vary between databases.

To put this problem in perspective, there’s a flood of interesting new research that attempts to better understand programming models like eventual consistency. And, as you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional inherents and the research community as a whole have effectively ignored weak isolation—even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.2

That debates are occurring without full knowledge of the issues at hand isn’t all that surprising. Or as Job 38:2 (KJV) puts it: “Who is this that darkeneth counsel by words without knowledge?”

Peter raises a number of questions and points to resources that are good starting points for investigation of weak isolation.

What sort of weak isolation does your topic map storage mechanism use?

I first saw this in a tweet by Justin Sheehy.

Call for Review: HTML5 Proposed Recommendation Published

Wednesday, September 17th, 2014

Call for Review: HTML5 Proposed Recommendation Published

From the post:

The HTML Working Group has published a Proposed Recommendation of HTML5. This specification defines the 5th major revision of the core language of the World Wide Web: the Hypertext Markup Language (HTML). In this version, new features are introduced to help Web application authors, new elements are introduced based on research into prevailing authoring practices, and special attention has been given to defining clear conformance criteria for user agents in an effort to improve interoperability. Comments are welcome through 14 October. Learn more about the HTML Activity.

Now would be the time to submit comments, corrections, etc.

Deadline: 14 October 2014.

International Conference on Machine Learning 2014 Videos!

Wednesday, September 17th, 2014

International Conference on Machine Learning 2014 Videos!

You may recall my post on the ICML 2014 papers.

Speaking just for myself, I would prefer a resource with both the videos and relevant papers listed together.

Do you know of such a resource?

If not, when time permits I may conjure one up.

As with disclosed semantic mappings, it is more efficient if one person creates a mapping that is reused by many. As opposed to many separate and partial repetitions of the same mapping.

You may remember that one mapping, many reuses, is a central principal to indexes, library catalogs, filing systems, case citations, etc.

I first saw this in a tweet by EyeWire.

ODNI and the U.S. DOJ Commemorate 9/11

Wednesday, September 17th, 2014

Statement by the ODNI and the U.S. DOJ on the Declassification of Documents Related to the Protect America Act Litigation September 11, 2014

What better way to mark the anniversary of 9/11 than with a fuller account of another attack on the United States of American and its citizens. This attack not by a small band of criminals but a betrayal of the United States by those sworn to protect the rights of its citizens.

From the post:

On January 15, 2009, the U.S. Foreign Intelligence Surveillance Court of Review (FISC-R) published an unclassified version of its opinion in In Re: Directives Pursuant to Section 105B of the Foreign Intelligence Surveillance Act, 551 F.3d 1004 (Foreign Intel. Surv. Ct. Rev. 2008). The classified version of the opinion was issued on August 22, 2008, following a challenge by Yahoo! Inc. (Yahoo!) to directives issued under the Protect America Act of 2007 (PAA). Today, following a renewed declassification review, the Executive Branch is publicly releasing various documents from this litigation, including legal briefs and additional sections of the 2008 FISC-R opinion, with appropriate redactions to protect national security information. These documents are available at the website of the Office of the Director of National Intelligence (ODNI),; and ODNI’s public website dedicated to fostering greater public visibility into the intelligence activities of the U.S. Government, A summary of the underlying litigation follows.

In case you haven’t been following along, the crux of the case was Yahoo’s refusal on Fourth Amendment grounds to comply with a fishing expedition by the Director of National Intelligence and the Attorney General for information on one or more alleged foreign nationals. Motion to Compel Compliance with Directives of the Director of National Intelligence and Attorney General.

Not satisfied with violating their duties to uphold the Constitution, the DNI and AG decided to add strong arming/extortion to their list of crimes. Civil contemp fines, fines that started at $250,000 per day and then doubled each week thereafter that Yahoo! failed to comply with the court’s judgement were sought by the government. Government’s Motion for an Order of Civil Contempt.

Take care to note that all of this occurred in absolute secrecy. Would not do to have other corporations or the American public to be aware that rogue elements in the government were deciding what rights citizens of the United States enjoy and which ones they don’t.

You may also want to read Highlights from the Newly Declassified FISCR Documents by Marc Zwillinger and Jacob Sommer. They are the lawyers who represented Yahoo in the challenge covered by the released documents.

We all owe them a debt of gratitude for their hard work but we also have to acknowledge that Yahoo, Zwillinger and Sommer were complicit in enabling the Foreign Intelligence Surveillance Court (FISC) and the Foreign Intelligence Court of Review (FISCR) to continue their secret work.

Yes, Yahoo, Zwillinger and Sommer would have faced life changing consequences had they gone public with what they did know, but everyone has a choice when faced with oppressive government action. You can, as the parties did in this case and further the popular fiction that mining user metadata is an effective (rather than convenient) tool against terrorism.

Or you can decide to “blow the whistle” on wasteful and illegal activities by the government in question.

Had Yahoo, Zwillinger or Sommer known of any data in their possession that was direct evidence a terrorist attack or plot, they would have called the Office of the Director of National Intelligence, or at least their local FBI office. Yes? Wouldn’t any sane person do the same?

Ah, but you see, that’s the point. There wasn’t any such data. Not then. Not now. Read the affidavits, at least the parts that aren’t blacked out and you get the distinct impression that the government is not only fishing, but it is hopeful fishing. “There might be something, somewhere that somehow might be useful to somebody but we don’t know.” is a fair summary of the government’s position in the Yahoo case.

A better way to commemorate 9/11 next year would be with numerous brave souls taking the moral responsibility to denounce those who have betrayed their constitutional duties in the cause of fighting terrorism. I prefer occasional terrorism over the destruction of the Constitution of the United States.


I started the trail that lead to this post from a tweet by Ben Gilbert.

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise

Tuesday, September 16th, 2014

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise by Cailin O’Connor.

From the post:

Is our behavior determined by genetics, or are we products of our environments? What matters more for the development of living things—internal factors or external ones? Biologists have been hotly debating these questions since shortly after the publication of Darwin’s theory of evolution by natural selection. Charles Darwin’s half-cousin Francis Galton was the first to try to understand this interplay between “nature and nurture” (a phrase he coined) by studying the development of twins.

But are nature and nurture the whole story? It seems not. Even identical twins brought up in similar environments won’t really be identical. They won’t have the same fingerprints. They’ll have different freckles and moles. Even complex traits such as intelligence and mental illness often vary between identical twins.

Of course, some of this variation is due to environmental factors. Even when identical twins are raised together, there are thousands of tiny differences in their developmental environments, from their position in the uterus to preschool teachers to junior prom dates.

But there is more to the story. There is a third factor, crucial to development and behavior, that biologists overlooked until just the past few decades: random noise.

In recent years, noise has become an extremely popular research topic in biology. Scientists have found that practically every process in cells is inherently, inescapably noisy. This is a consequence of basic chemistry. When molecules move around, they do so randomly. This means that cellular processes that require certain molecules to be in the right place at the right time depend on the whims of how molecules bump around. (bold emphasis added)

Is another word for “noise” chaos?

The sort of randomness that impacts our understanding of natural languages? That leads us to use different words for the same thing and the same word for different things?

The next time you see a semantically deterministic system be sure to ask if they have accounted for the impact of noise on the understanding of people using the system. 😉

To be fair, no system can but the pretense that noise doesn’t exist in some semantic environments (think description logic, RDF) is more than a little annoying.

You might want to start following the work of Cailin O’Connor (University of California, Irvine, Logic and Philosophy of Science).

Disclosure: I have always had a weakness for philosophy of science so your mileage may vary. This is real philosophy of science and not the strained crys of “science” you see on most mailing list discussions.

I first saw this in a tweet by John Horgan.

Getting Started with S4, The Self-Service Semantic Suite

Tuesday, September 16th, 2014

Getting Started with S4, The Self-Service Semantic Suite by Marin Dimitrov.

From the post:

Here’s how S4 developers can get started with The Self-Service Semantic Suite. This post provides you with practical information on the following topics:

  • Registering a developer account and generating API keys
  • RESTful services & free tier quotas
  • Practical examples of using S4 for text analytics and Linked Data querying

Ontotext is up front about the limitations on the “free” service:

  • 250 MB of text processed monthly (via the text analytics services)
  • 5,000 SPARQL queries monthly (via the LOD SPARQL service)

The number of pages in a megabyte of text varies depends on text content but assuming a working average of one (1) megabyte = five hundred (500) pages of text, you can analyze up to one hundred and twenty-five thousand (125,000) pages of text a month. Chump change for serious NLP but it is a free account.

The post goes on to detail two scenarios:

  • Annotate a news document via the News analytics service
  • Send a simple SPARQL query to the Linked Data service

Learn how effective entity recognition and SPARQL are with data of interest to you, at a minimum of investment.

I first saw this in a tweet by Tony Agresta.