Archive for October, 2014

Total Anonymity

Friday, October 31st, 2014

Hard-Nosed Advice From Veteran Lobbyist: ‘Win Ugly or Lose Pretty’ by Eric Lipton.

By way of explanation, Richard Berman was secretly recorded making a pitch to energy companies, which is described as follows in the story:

Mr. Berman repeatedly boasted about how he could take checks from the oil and gas industry executives — he said he had already collected six-figure contributions from some of the executives in the room — and then hide their role in funding his campaigns.

“People always ask me one question all the time: ‘How do I know that I won’t be found out as a supporter of what you’re doing?’ ” Mr. Berman told the crowd. “We run all of this stuff through nonprofit organizations that are insulated from having to disclose donors. There is total anonymity. People don’t know who supports us.”

Given all the data leaks from the NSA and others, I doubt Berman’s claim of “…total anonymity….” It would take more than searching the WWW, but unravel one campaign or donor and the seeds of doubt would be placed for all the others.

If you need incentive, Berman has been described as “…a despicable human being.”

It would make a nice demonstration of topic maps as a tool for breaking “…total anonymity….”

COMMON LISP: An Interactive Approach

Friday, October 31st, 2014

COMMON LISP: An Interactive Approach by Stuart C. Shapiro.

From the preface:

Lisp is the language of choice for work in artificial intelligence and in symbolic algebra. It is also important in the study of programming languages, because, since its inception over thirty years ago, it has had full recursion, the conditional expression, the equivalence of program and data structure, its own evaluator available to the programmer, and extensibility—the syntactic indistinguishability of programmer-defined functions and “built-in” operators. It is also the paradigm of “functional,” or “applicative,” programming. Because of the varied interests in Lisp, I have tried to present it in a general and neutral setting, rather than specifically in the context of any of the special fields in which it is used.

Although published in 1992, this book remains quite relevant today. How many languages can you name that have with equivalence of program and data structure? Yes, that’s what I thought.

Notes from CSE 202 (2002) are available along with Stuart C. Shapiro & David R. Pierce, A Short Course in Common Lisp (2004).

DBpedia now available as triple pattern fragments

Friday, October 31st, 2014

DBpedia now available as triple pattern fragments by Ruben Verborgh.

From the post:

DBpedia is perhaps the most widely known Linked Data source on the Web. You can use DBpedia in a variety of ways: by querying the SPARQL endpoint, by browsing Linked Data documents, or by downloading one of the data dumps. Access to all of these data sources is offered free of charge.

Last week, a fourth way of accessing DBpedia became publicly available: DBpedia’s triple pattern fragments at This interface offers a different balance of trade-offs: it maximizes the availability of DBpedia by offering a simple server and thus moving SPARQL query execution to the client side. Queries will execute slower than on the public SPARQL endpoint, but their execution should be possible close to 100% of the time.

Here are some fun things to try:
– browse the new interface:
– make your browser execute a SPARQL query:
– add live queries to your application:

Learn all about triple pattern fragments at the Linked Data Fragments website, the ISWC2014 paper,
and ISWC2014 slides:

A new effort to achieve robust processing of triples.


Enhancing open data with identifiers

Friday, October 31st, 2014

Enhancing open data with identifiers

From the webpage:

The Open Data Institute and Thomson Reuters have published a new white paper, explaining how to use identifiers to create extra value in open data.

Identifiers are at the heart of how data becomes linked. It’s a subject that is fundamentally important to the open data community, and to the evolution of the web itself. However, identifiers are also in relatively early stages of adoption, and not many are aware of what they are.
Put simply, identifiers are labels used to refer to an object being discussed or exchanged, such as products, companies or people. The foundation of the web is formed by connections that hold pieces of information together. Identifiers are the anchors that facilitate those links.

This white paper, ‘Creating value with identifiers in an open data world’ is a joint effort between Thomson Reuters and the Open Data Institute. It is written as a guide to identifier schemes:

  • why identity can be difficult to manage;
  • why it is important for open data;
  • what challenges there are today and recommendations for the community to address these in the future.

Illustrative examples of identifier schemes are used to explain these points.

The recommendations are based on specific issues found to occur across different datasets, and should be relevant for anyone using, publishing or handling open data, closed data and/or their own proprietary data sets.

Are you a data consumer?
Learn how identifiers can help you create value from discovering and connecting to other sources of data that add relevant context.

Are you a data publisher?
Learn how understanding and engaging with identifier schemes can reduce your costs, and help you manage complexity.

Are you an identifier publisher?
Learn how open licensing can grow the open data commons and bring you extra value by increasing the use of your identifier scheme.

The design and use of successful identifier schemes requires a mix of social, data and technical engineering. We hope that this white paper will act as a starting point for discussion about how identifiers can and will create value by empowering linked data.

Read the blog post on Linked data and the future of the web, from Chief Enterprise Architect for Thomson Reuters, Dave Weller.

When citing this white paper, please use the following text: Open Data Institute and Thomson Reuters, 2014, Creating Value with Identifiers in an Open Data World, retrieved from

Creating Value with Identifiers in an Open Data World [full paper]

Creating Value with Identifiers in an Open Data World [management summary]

From the paper:

The coordination of identity is thus not just an inherent component of dataset design, but should be acknowledged as a distinct discipline in its own right.

A great presentation on identity and management of identifiers, echoing many of the themes discussed in topic maps.

A must read!

Next week I will begin a series of posts on the individual issues identified in this white paper.

I first saw this in a tweet by Bob DuCharme.

Books: Inquiry-Based Learning Guides

Thursday, October 30th, 2014

Books: Inquiry-Based Learning Guides

From the webpage:

The DAoM library includes 11 inquiry-based books freely available for classroom use. These texts can be used as semester-long content for themed courses (e.g. geometry, music and dance, the infinite, games and puzzles), or individual chapters can be used as modules to experiment with inquiry-based learning and to help supplement typical topics with classroom tested, inquiry based approaches (e.g. rules for exponents, large numbers, proof). The topic index provides an overview of all our book chapters by topic.

From the about page:

Discovering the Art of Mathematics (DAoM), is an innovative approach to teaching mathematics to liberal arts and humanities students, that offers the following vision:

Mathematics for Liberal Arts students will be actively involved in authentic mathematical experiences that

  • are both challenging and intellectually stimulating,
  • provide meaningful cognitive and metacognitive gains, and,
  • nurture healthy and informed perceptions of mathematics, mathematical ways of thinking, and the ongoing impact of mathematics not only on STEM fields but also on the liberal arts and humanities.

DAoM provides a wealth of resources for mathematics faculty to help realize this vision in their Mathematics for Liberal Arts (MLA) courses: a library of 11 inquiry-based learning guides, extensive teacher resources and many professional development opportunities. These tools enable faculty to transform their classrooms to be responsive to current research on learning (e.g. National Academy Press’s How People Learn) and the needs and interests of MLA students without enormous start-up costs or major restructuring.

All of these books are concerned with mathematics from a variety of perspectives but I didn’t see anything in How People Learn: Brain, Mind, Experience, and School: Expanded Edition (2000) that suggested such techniques are limited to the teaching of mathematics.

Easy to envision teaching of CS or semantic technologies using the same methods.

What inquiries would you construct for the exploration of semantic diversity? Roles? Contexts? Or the lack of a solution to semantic diversity? What are its costs?

Thinking semantic integration could become a higher priority if the costs of semantic diversity or the savings of semantic integration could be demonstrated.

For example, most Americans nod along with public service energy conservation messages. Just like people do with semantic integration pitches.

But if it was demonstrated for a particular home that 1/8 of the energy for heat or cooling was being wasted and that $X investment would lower utility bills by $N, there would be a much different reaction.

There are broad numbers on the losses from semantic diversity but broad numbers are not “in our budget” line items. It’s time to develop strategies that can expose the hidden costs of semantic diversity. Perhaps inquiry-based learning could be that tool.

I first saw this in a tweet by Steven Strogatz.

Pinned Tabs: myNoSQL

Thursday, October 30th, 2014

Alex Popescu & Ana-Maria Bacalu have added a new feature at myNoSQL called “Pinned Tabs.”

The feature started on 28 Oct. 2014 and consists of very short (2-3 sentence descriptions) with links on NoSQL, BigData, etc. topics.

Today’s “pinned tabs” included:

03: If you don’t test for the possible failures, you might be in for a surprise. Stripe has tried a more organized chaos monkey attack and discovered a scenario in which their Redis cluster is losing all the data. They’ll move to Amazon RDS PostgreSQL. From an in-memory smart key-value engine to a relational database.

Game Day Exercises at Stripe: Learning from kill -9

04: How a distributed database should really behave in front of massive failures. Netflix recounts their recent experience of having 218 Cassandra nodes rebooted without losing availability. At all.

How Netflix Handled the Reboot of 218 Cassandra Nodes

Curated news saves time and attention span!


How to run the Caffe deep learning vision library…

Wednesday, October 29th, 2014

How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board by Pete Warden.

From the post:

Jetson boardPhoto by Gareth Halfacree

My colleague Yangqing Jia, creator of Caffe, recently spent some free time getting the framework running on Nvidia’s Jetson board. If you haven’t heard of the Jetson, it’s a small development board that includes Nvidia’s TK1 mobile GPU chip. The TK1 is starting to appear in high-end tablets, and has 192 cores so it’s great for running computational tasks like deep learning. The Jetson’s a great way to get a taste of what we’ll be able to do on mobile devices in the future, and it runs Ubuntu so it’s also an easy environment to develop for.

Caffe comes with a pre-built ‘Alexnet’ model, a version of the Imagenet-winning architecture that recognizes 1,000 different kinds of objects. Using this as a benchmark, the Jetson can analyze an image in just 34ms! Based on this table I’m estimating it’s drawing somewhere around 10 or 11 watts, so it’s power-intensive for a mobile device but not too crazy.

Yangqing passed along his instructions, and I’ve checked them on my own Jetson, so here’s what you need to do to get Caffe up and running.

Hardware fun for the middle of your week!

192 cores for under $200, plus GPU experience.

Introducing osquery

Wednesday, October 29th, 2014

Introducing osquery by Mike Arpaia.

From the post:

Maintaining real-time insight into the current state of your infrastructure is important. At Facebook, we’ve been working on a framework called osquery which attempts to approach the concept of low-level operating system monitoring a little differently.

Osquery exposes an operating system as a high-performance relational database. This design allows you to write SQL-based queries efficiently and easily to explore operating systems. With osquery, SQL tables represent the current state of operating system attributes, such as:

  • running processes
  • loaded kernel modules
  • open network connections

SQL tables are implemented via an easily extendable API. Several tables already exist and more are being written. To best understand the expressiveness that is afforded to you by osquery, consider the following examples….

I haven’t installed osquery, yet, but suspect that most of the data it collects is available now through a variety of admin tools. But not through a single tool that enables you to query across tables to combine that data. That is the part that intrigues me.

Code and documentation on Github.

AsterixDB: Better than Hadoop? Interview with Mike Carey

Wednesday, October 29th, 2014

AsterixDB: Better than Hadoop? Interview with Mike Carey by Roberto V. Zicari.

The first two questions should be enough incentive to read the full interview and get your blood pumping in the middle of the week:

Q1. Why build a new Big Data Management System?

Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.

To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”).
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:

  • a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
  • a full query language with at least the expressive power of SQL;
  • support for data storage, data management, and automatic indexing;
  • support for a wide range of query sizes, with query processing cost being proportional to the given query;
  • support for continuous data ingestion, hence the accumulation of Big Data;
  • the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
  • built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.

So that’s what we set out to do.

Q2. What was wrong with the current Open Source Big Data Stack?

Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?”

We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.

One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed.

Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”

Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)

I knew words would fail me if I tried to describe the AsterixDB logo so I simply reproduce the logo:

asterickdb logo

Read the interview in full and then grab a copy of AsterixDB.

The latest beta release is 0.8.6. The software appears under the Apache Software 2.0 license.

Microsoft Garage

Wednesday, October 29th, 2014

Microsoft Garage

From the webpage:

Hackers, makers, artists, tinkerers, musicians, inventors — on any given day you’ll find them in The Microsoft Garage.

We are a community of interns, employees, and teams from everywhere in the company who come together to turn our wild ideas into real projects. This site gives you early access to projects as they come to life.

Tell us what rocks, and what doesn’t.

Welcome to The Microsoft Garage.

Two projects (out of several) that I thought were interesting:


Host or join collaboration sessions on canvases that hold text cards and images. Ink on the canvas to organize your content, or manipulate the text and images using pinch, drag, and rotate gestures.


Floatz, a Microsoft Garage project, lets you float an idea out to the people around you, and see what they think. Join in on any nearby Floatz conversation, or start a new one with a question, idea, or image that you share anonymously with people nearby.

Share your team spirit at a sporting event, or your awesome picture of the band at a rock concert. Ask the locals where to get a good meal when visiting an unfamiliar neighborhood. Speak your mind, express your feelings, and find out if there are others around you who feel the same way—all from the safety of an anonymous screen name in Floatz.

I understand the theory of asking for advice anonymously, but I assume that also means the person answering is anonymous as well. Yes? I don’t have a cellphone so I can’t test that theory. Comments?

On the other hand, if you are sharing data with known and unknown others, so you know which “anonymous” screen names to trust (for example, don’t trust name with FBI, CIA or NSA preceded or followed by hyphens), then Floatz could very useful.

I first saw this in Nat Torkington’s Four short links: 23 October 2014.

UX Directory

Wednesday, October 29th, 2014

UX Directory

Two Hundred and six (206) resources listed under the following categories:

  • A/B Testing
  • Blogroll
  • Design Evaluation Tools
  • Dummy Text Generators
  • Find Users to Test
  • Gamification Companies
  • Heatmaps / Mouse Tracking Tools
  • Information Architecture Creation Tools
  • Information Architecture Evaluation Tools
  • Live Chat Support Tools
  • Marketing Automation Tools
  • Mobile Prototyping
  • Mockup User Testing
  • Multi-Use UX Tools
  • Screen Capture Tools
  • Synthetic Eye-Tracking Tools
  • User Testing Companies
  • UX Agencies / Consultants
  • UX Survey Tools
  • Web Analytics Tools
  • Webinar / Web Conference Platforms
  • Wirefram/Mockup Tools

If you have a new resource that should be on this list, contact

I first saw this in Nat Torkington’s Four short links: 28 October 2014.

Datomic Pull API

Tuesday, October 28th, 2014

Datomic Pull API by Stuart Holloway.

From the post:

Datomic‘s new Pull API is a declarative way to make hierarchical selections of information about entities. You supply a pattern to specify which attributes of the entity (and nested entities) you want to pull, and db.pull returns a map for each entity.

Pull API vs. Entity API

The Pull API has two important advantages over the existing Entity API:

Pull uses a declarative, data-driven spec, whereas Entity encourages building results via code. Data-driven specs are easier to build, compose, transmit and store. Pull patterns are smaller than entity code that does the same job, and can be easier to understand and maintain.

Pull API results match standard collection interfaces (e.g. Java maps) in programming languages, where Entity results do not. This eliminates the need for an additional allocation/transformation step per entity.

A sign that it is time to catch up on what has been happening with Datomic!

HTML5 is a W3C Recommendation

Tuesday, October 28th, 2014

HTML5 is a W3C Recommendation

From the post:

(graphic omitted) The HTML Working Group today published HTML5 as W3C Recommendation. This specification defines the fifth major revision of the Hypertext Markup Language (HTML), the format used to build Web pages and applications, and the cornerstone of the Open Web Platform.

Today we think nothing of watching video and audio natively in the browser, and nothing of running a browser on a phone,” said Tim Berners-Lee, W3C Director. “We expect to be able to share photos, shop, read the news, and look up information anywhere, on any device. Though they remain invisible to most users, HTML5 and the Open Web Platform are driving these growing user expectations.

HTML5 brings to the Web video and audio tracks without needing plugins; programmatic access to a resolution-dependent bitmap canvas, which is useful for rendering graphs, game graphics, or other visual images on the fly; native support for scalable vector graphics (SVG) and math (MathML); annotations important for East Asian typography (Ruby); features to enable accessibility of rich applications; and much more.

The HTML5 test suite, which includes over 100,000 tests and continues to grow, is strengthening browser interoperability. Learn more about the Test the Web Forward community effort.

With today’s publication of the Recommendation, software implementers benefit from Royalty-Free licensing commitments from over sixty companies under W3C’s Patent Policy. Enabling implementers to use Web technology without payment of royalties is critical to making the Web a platform for innovation.

Read the Press Release, testimonials from W3C Members, and
acknowledgments. For news on what’s next after HTML5, see W3C CEO Jeff Jaffe’s blog post: Application Foundations for the Open Web Platform. We also invite you to check out our video Web standards for the future.

Just in case you have been holding off on HTML5 until it became an W3C Recommendation. 😉


Category Theory for Programmers: The Preface

Tuesday, October 28th, 2014

Category Theory for Programmers: The Preface by Bartosz Milewski.

From the post:

For some time now I’ve been floating the idea of writing a book about category theory that would be targeted at programmers. Mind you, not computer scientists but programmers — engineers rather than scientists. I know this sounds crazy and I am properly scared. I can’t deny that there is a huge gap between science and engineering because I have worked on both sides of the divide. But I’ve always felt a very strong compulsion to explain things. I have tremendous admiration for Richard Feynman who was the master of simple explanations. I know I’m no Feynman, but I will try my best. I’m starting by publishing this preface — which is supposed to motivate the reader to learn category theory — in hopes of starting a discussion and soliciting feedback.

I will attempt, in the space of a few paragraphs, to convince you that this book is written for you, and whatever objections you might have to learning one of the most abstracts branches of mathematics in your “copious spare time” are totally unfounded.

My optimism is based on several observations. First, category theory is a treasure trove of extremely useful programming ideas. Haskell programmers have been tapping this resource for a long time, and the ideas are slowly percolating into other languages, but this process is too slow. We need to speed it up.

Second, there are many different kinds of math, and they appeal to different audiences. You might be allergic to calculus or algebra, but it doesn’t mean you won’t enjoy category theory. I would go as far as to argue that category theory is the kind of math that is particularly well suited for the minds of programmers. That’s because category theory — rather than dealing with particulars — deals with structure. It deals with the kind of structure that makes programs composable.

Composition is at the very root of category theory — it’s part of the definition of the category itself. And I will argue strongly that composition is the essence of programming. We’ve been composing things forever, long before some great engineer came up with the idea of a subroutine. Some time ago the principles of structural programming revolutionized programming because they made blocks of code composable. Then came object oriented programming, which is all about composing objects. Functional programming is not only about composing functions and algebraic data structures — it makes concurrency composable — something that’s virtually impossible with other programming paradigms.

See the rest of the preface and the promise to provide examples in code for most major concepts.

Are you ready for discussion and feedback?

On Excess: Susan Sontag’s Born-Digital Archive

Tuesday, October 28th, 2014

On Excess: Susan Sontag’s Born-Digital Archive by Jeremy Schmidt & Jacquelyn Ardam.

From the post:

In the case of the Sontag materials, the end result of Deep Freeze and a series of other processing procedures is a single IBM laptop, which researchers can request at the Special Collections desk at UCLA’s Research Library. That laptop has some funky features. You can’t read its content from home, even with a VPN, because the files aren’t online. You can’t live-Tweet your research progress from the laptop — or access the internet at all — because the machine’s connectivity features have been disabled. You can’t copy Annie Leibovitz’s first-ever email — “Mat and I just wanted to let you know we really are working at this. See you at dinner. xxxxxannie” (subject line: “My first Email”) — onto your thumb drive because the USB port is locked. And, clearly, you can’t save a new document, even if your desire to type yourself into recent intellectual history is formidable. Every time it logs out or reboots, the laptop goes back to ground zero. The folders you’ve opened slam shut. The files you’ve explored don’t change their “Last Accessed” dates. The notes you’ve typed disappear. It’s like you were never there.

Despite these measures, real limitations to our ability to harness digital archives remain. The born-digital portion of the Sontag collection was donated as a pair of external hard drives, and that portion is composed of documents that began their lives electronically and in most cases exist only in digital form. While preparing those digital files for use, UCLA archivists accidentally allowed certain dates to refresh while the materials were in “thaw” mode; the metadata then had to be painstakingly un-revised. More problematically, a significant number of files open as unreadable strings of symbols because the software with which they were created is long out of date. Even the fully accessible materials, meanwhile, exist in so many versions that the hapless researcher not trained in computer forensics is quickly overwhelmed.

No one would dispute the need for an authoritative copy of Sontag‘s archive, or at least as close to authoritative as humanly possible. The heavily protected laptop makes sense to me, assuming that the archive considers that to be the authoritative copy.

What has me puzzled, particularly since there are binary formats not recognized in the archive, is why isn’t a non-authoritative copy of the archive online. Any number of people may still possess the software necessary to read the files and/or be able to decrypt the file formats. That would be a net gain to the archive if recovery could be practiced on a non-authoritative copy. They may well encounter such files in the future.

After searching the Online Archive of California, I did encounter Finding Aid for the Susan Sontag papers, ca. 1939-2004 which reports:

Restrictions Property rights to the physical object belong to the UCLA Library, Department of Special Collections. Literary rights, including copyright, are retained by the creators and their heirs. It is the responsibility of the researcher to determine who holds the copyright and pursue the copyright owner or his or her heir for permission to publish where The UC Regents do not hold the copyright.

Availability Open for research, with following exceptions: Boxes 136 and 137 of journals are restricted until 25 years after Susan Sontag’s death (December 28, 2029), though the journals may become available once they are published.

Unfortunately, this finding aid does not mention Sontag’s computer or the transfer of the files to a laptop. A search of Melvyl (library catalog) finds only one archival collection and that is the one mentioned above.

I have written to the special collections library for clarification and will update this post when an answer arrives.

I mention this collection because of Sontag’s importance for a generation and because digital archives will soon be the majority of cases. One hopes the standard practice will be to donate all rights to an archival repository to insure its availability to future generations of scholars.

Text Visualization Browser [100 Techniques]

Tuesday, October 28th, 2014

Text Visualization Browser: A Visual Survey of Text Visualization Techniques by Kostiantyn Kucher and Andreas Kerren.

From the abstract:

Text visualization has become a growing and increasingly important subfield of information visualization. Thus, it is getting harder for researchers to look for related work with specific tasks or visual metaphors in mind. In this poster, we present an interactive visual survey of text visualization techniques that can be used for the purposes of search for related work, introduction to the subfield and gaining insight into research trends.

Even better is the Text Visual Browser webpage where one hundred (100) different techniques have thumbnails and links to the original papers.

Quite remarkable. I don’t think I can name anywhere close to all the techniques.


Announcing Clasp

Tuesday, October 28th, 2014

Announcing Clasp by Christian Schafmeister.

From the post:

Click here for up to date build instructions

Today I am happy to make the first release of the Common Lisp implementation “Clasp”. Clasp uses LLVM as its back-end and generates native code. Clasp is a super-set of Common Lisp that interoperates smoothly with C++. The goal is to integrate these two very different languages together as seamlessly as possible to provide the best of both worlds. The C++ interoperation allows Common Lisp programmers to easily expose powerful C++ libraries to Common Lisp and solve complex programming challenges using the expressive power of Common Lisp. Clasp is licensed under the LGPL.

Common Lisp is considered by many to be one of the most expressive programming languages in existence. Individuals and small teams of programmers have created fantastic applications and operating systems within Common Lisp that require much larger effort when written in other languages. Common Lisp has many language features that have not yet made it into the C++ standard. Common Lisp has first-class functions, dynamic variables, true macros for meta-programming, generic functions, multiple return values, first-class symbols, exact arithmetic, conditions and restarts, optional type declarations, a programmable reader, a programmable printer and a configurable compiler. Common Lisp is the ultimate programmable programming language.

Clojure is a dialect of Lisp, which means you may spot situations where Lisp would be the better solution. Especially if you can draw upon C++ libraries.

The project is “actively looking” for new developers. Could be your opportunity to get in on the ground floor!

Madison: Semantic Listening Through Crowdsourcing

Tuesday, October 28th, 2014

Madison: Semantic Listening Through Crowdsourcing by Jane Friedhoff.

From the post:

Our recent work at the Labs has focused on semantic listening: systems that obtain meaning from the streams of data surrounding them. Chronicle and Curriculum are recent examples of tools designed to extract semantic information (from our corpus of news coverage and our group web browsing history, respectively). However, not every data source is suitable for algorithmic analysis–and, in fact, many times it is easier for humans to extract meaning from a stream. Our new projects, Madison and Hive, are explorations of how to best design crowdsourcing projects for gathering data on cultural artifacts, as well as provocations for the design of broader, more modular kinds of crowdsourcing tools.

(image omitted)

Madison is a crowdsourcing project designed to engage the public with an under-viewed but rich portion of The New York Times’s archives: the historical ads neighboring the articles. News events and reporting give us one perspective on our past, but the advertisements running alongside these articles provide a different view, giving us a sense of the culture surrounding these events. Alternately fascinating, funny and poignant, they act as commentary on the technology, economics, gender relations and more of that time period. However, the digitization of our archives has primarily focused on news, leaving the ads with no metadata–making them very hard to find and impossible to search for them. Complicating the process further is that these ads often have complex layouts and elaborate typefaces, making them difficult to differentiate algorithmically from photographic content, and much more difficult to scan for text. This combination of fascinating cultural information with little structured data seemed like the perfect opportunity to explore how crowdsourcing could form a source of semantic signals.

From the projects homepage:

Help preserve history with just one click.

The New York Times archives are full of advertisements that give glimpses into daily life and cultural history. Help us digitize our historic ads by answering simple questions. You’ll be creating a unique resource for historians, advertisers and the public — and leaving your mark on history.

Get started with our collection of ads from the 1960s (additional decades will be opened later)!

I would like to see a Bible transcription project that was that user friendly!

But, then the goal of the New York Times is to include as many people as possible.

Looking forward to more news on Madison!

Guide to Law Online

Tuesday, October 28th, 2014

Guide to Law Online

From the post:

The Guide to Law Online, prepared by the Law Library of Congress Public Services Division, is an annotated guide to sources of information on government and law available online. It includes selected links to useful and reliable sites for legal information.

Select a Link:

The Guide to Law Online is an annotated compendium of Internet links; a portal of Internet sources of interest to legal researchers. Although the Guide is selective, inclusion of a site by no means constitutes endorsement by the Law Library of Congress.

In compiling this list, emphasis wherever possible has been on sites offering the full texts of laws, regulations, and court decisions, along with commentary from lawyers writing primarily for other lawyers. Materials related to law and government that were written by or for lay persons also have been included, as have government sites that provide even quite general information about themselves or their agencies.

Every direct source listed here was successfully tested before being added to the list. Users, however, should be aware that changes of Internet addresses and file names are frequent, and even sites that usually function well do not always do so. Thus a successful connection may sometimes require several attempts. If such an attempt to access a file indicates an error, the information can sometimes still be accessed by truncating the URL address to access a directory at the site.

Last Updated: 07/10/2014

While I was the Library of Congress site today I encountered this set of law guides and thought they might be of interest. Updated in July of this year so most of the links should still work. Officially Out of Beta

Tuesday, October 28th, 2014 Officially Out of Beta

From the post:

The free legislative information website,, is officially out of beta form, and beginning today includes several new features and enhancements. URLs that include will be redirected to The site now includes the following:

New Feature: Resources

  • A new resources section providing an A to Z list of hundreds of links related to Congress
  • An expanded list of “most viewed” bills each day, archived to July 20, 2014

New Feature: House Committee Hearing Videos

  • Live streams of House Committee hearings and meetings, and an accompanying archive to January, 2012

Improvement: Advanced Search

  • Support for 30 new fields, including nominations, Congressional Record and name of member

Improvement: Browse

  • Days in session calendar view
  • Roll Call votes
  • Bill by sponsor/co-sponsor

When the Library of Congress, in collaboration with the U.S. Senate, U.S. House of Representatives and the Government Printing Office (GPO) released as a beta site in the fall of 2012, it included bill status and summary, member profiles and bill text from the two most recent congresses at that time – the 111th and 112th.

Since that time, has expanded with the additions of the Congressional Record, committee reports, direct links from bills to cost estimates from the Congressional Budget Office, legislative process videos, committee profile pages, nominations, historic access reaching back to the 103rd Congress and user accounts enabling saved personal searches. Users have been invited to provide feedback on the site’s functionality, which has been incorporated along with the data updates.

Plans are in place for ongoing enhancements in the coming year, including addition of treaties, House and Senate Executive Communications and the Congressional Record Index.

Field Value Lists:

Use search fields in the main search box (available on most pages), or via the advanced and command line search pages. Use terms or codes from the Field Value Lists with corresponding search fields: Congress [congressId], Action – Words and Phrases [billAction], Subject – Policy Area [billSubject], or Subject (All) [allBillSubjects].

Congresses (44, stops with 70th Congress (1927-1929))

Legislative Subject Terms, Subject Terms (541), Geographic Entities (279), Organizational Names (173). (total 993)

Major Action Codes (98)

Policy Area (33)

Search options:

Search Form: “Choose collections and fields from dropdown menus. Add more rows as needed. Use Major Action Codes and Legislative Subject Terms for more precise results.”

Command Line: “Combine fields with operators. Refine searches with field values: Congresses, Major Action Codes, Policy Areas, and Legislative Subject Terms. To use facets in search results, copy your command line query and paste it into the home page search box.”

Search Tips Overview: “You can search using the quick search available on most pages or via the advanced search page. Advanced search gives you the option of using a guided search form or a command line entry box.” (includes examples)


You can follow this project @congressdotgov.

Orientation to Legal Research & is available both as a seminar (in-person) and webinar (online).


I first saw this at is Out of Beta with New Features by Africa S. Hands.

Qatar Digital Library

Tuesday, October 28th, 2014

New Qatar Digital Library Offers Readers Unrivalled Collection of Precious Heritage Material

From the post:

The Qatar Digital Library which provides new public access to over half a million pages of precious historic archive and manuscript material has been launched today thanks to the British Library-Qatar Foundation Partnership project. This incredible resource makes documents and other items relating to the modern history of Qatar, the Gulf region and beyond, fully accessible and free of charge to researchers and the general public through a state-of-the-art online portal.

In line with the principles of the Qatar National Vision 2030, which aims to preserve the nation’s heritage and enhance Arab and Islamic values and identity, the launch of the Qatar Digital Library supports QF’s aim of unlocking human potential for the benefit of Qatar and the world.

Qatar National Library, a member of Qatar Foundation, has a firm commitment to preserving and showcasing Qatar’s heritage and promoting education and community development by sharing knowledge and providing resources to students, researchers, and the wider community.

With Qatar Foundation’s support, an expert, technical team has been preserving and digitising materials from the UK’s India Office Records archives over the past two years in order to be shared publicly on the portal owned and managed by Qatar National Library.

The Qatar Digital Library provides online access to over 475,000 pages from the India Office Records that date from the mid-18th century to 1951, and relate to modern historic events in Qatar, the Gulf and the Middle East region.

In addition, the Qatar Digital Library shares 25,000 pages of medieval Arab Islamic sciences manuscripts, historical maps, photographs and sound recordings.

These precious materials are being made available online for the first time. The Qatar Digital Library provides clear descriptions of the digitised materials in Arabic and English, and can be accessed for personal and research use from anywhere free of charge.

The Qatar Digital Library (homepage).

Simply awesome!

A great step towards unlocking the riches of Arab scholarship.

I first saw this in British Library Launches Qatar Digital Library by Africa S. Hands.

Building a language-independent keyword-based system with the Wikipedia Miner

Monday, October 27th, 2014

Building a language-independent keyword-based system with the Wikipedia Miner by Gauthier Lemoine.

From the post:

Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

The NSA can hire translators so I would not bother sharing this technique for harnessing the thousands of expert hours in Wikipedia with them.

Bear in mind that Wikipedia does not reach a large number of minority languages, dialects, and certainly not deliberate obscurity in any language. Your mileage will vary depending upon your particular use case.

On the Computational Complexity of MapReduce

Monday, October 27th, 2014

On the Computational Complexity of MapReduce by Jeremy Kun.

From the post:

I recently wrapped up a fun paper with my coauthors Ben Fish, Adam Lelkes, Lev Reyzin, and Gyorgy Turan in which we analyzed the computational complexity of a model of the popular MapReduce framework. Check out the preprint on the arXiv.

As usual I’ll give a less formal discussion of the research here, and because the paper is a bit more technically involved than my previous work I’ll be omitting some of the more pedantic details. Our project started after Ben Moseley gave an excellent talk at UI Chicago. He presented a theoretical model of MapReduce introduced by Howard Karloff et al. in 2010, and discussed his own results on solving graph problems in this model, such as graph connectivity. You can read Karloff’s original paper here, but we’ll outline his model below.

Basically, the vast majority of the work on MapReduce has been algorithmic. What I mean by that is researchers have been finding more and cleverer algorithms to solve problems in MapReduce. They have covered a huge amount of work, implementing machine learning algorithms, algorithms for graph problems, and many others. In Moseley’s talk, he posed a question that caught our eye:

Is there a constant-round MapReduce algorithm which determines whether a graph is connected?

After we describe the model below it’ll be clear what we mean by “solve” and what we mean by “constant-round,” but the conjecture is that this is impossible, particularly for the case of sparse graphs. We know we can solve it in a logarithmic number of rounds, but anything better is open.

In any case, we started thinking about this problem and didn’t make much progress. To the best of my knowledge it’s still wide open. But along the way we got into a whole nest of more general questions about the power of MapReduce. Specifically, Karloff proved a theorem relating MapReduce to a very particular class of circuits. What I mean is he proved a theorem that says “anything that can be solved in MapReduce with so many rounds and so much space can be solved by circuits that are yae big and yae complicated, and vice versa.

But this question is so specific! We wanted to know: is MapReduce as powerful as polynomial time, our classical notion of efficiency (does it equal P)? Can it capture all computations requiring logarithmic space (does it contain L)? MapReduce seems to be somewhere in between, but it’s exact relationship to these classes is unknown. And as we’ll see in a moment the theoretical model uses a novel communication model, and processors that never get to see the entire input. So this led us to a host of natural complexity questions:

  1. What computations are possible in a model of parallel computation where no processor has enough space to store even one thousandth of the input?
  2. What computations are possible in a model of parallel computation where processor’s can’t request or send specific information from/to other processors?
  3. How the hell do you prove that something can’t be done under constraints of this kind?
  4. How do you measure the increase of power provided by giving MapReduce additional rounds or additional time?

These questions are in the domain of complexity theory, and so it makes sense to try to apply the standard tools of complexity theory to answer them. Our paper does this, laying some brick for future efforts to study MapReduce from a complexity perspective.

Given the prevalence of MapReduce, progress on understanding what is or is not possible is an important topic.

The first two complexity questions strike me as the ones most relevant to topic map processing with map reduce. Depending upon the nature of your merging algorithm.


Data Modelling: The Thin Model [Entities with only identifiers]

Monday, October 27th, 2014

Data Modelling: The Thin Model by Mark Needham.

From the post:

About a third of the way through Mastering Data Modeling the authors describe common data modelling mistakes and one in particular resonated with me – ‘Thin LDS, Lost Users‘.

LDS stands for ‘Logical Data Structure’ which is a diagram depicting what kinds of data some person or group wants to remember. In other words, a tool to help derive the conceptual model for our domain.

They describe the problem that a thin model can cause as follows:

[…] within 30 minutes [of the modelling session] the users were lost…we determined that the model was too thin. That is, many entities had just identifying descriptors.

While this is syntactically okay, when we revisited those entities asking, What else is memorable here? the users had lots to say.

When there was flesh on the bones, the uncertainty abated and the session took a positive course.

I found myself making the same mistake a couple of weeks ago during a graph modelling session. I tend to spend the majority of the time focused on the relationships between the bits of data and treat the meta data or attributes almost as an after thought.

A good example of why subjects need multiple attributes, even multiple identifying attributes.

When sketching just a bare data model, the author, having prepared in advance is conversant with the scant identifiers. The audience, on the other hand is not. Additional attributes for each entity quickly reminds the audience of the entity in question.

Take this as anecdotal evidence that multiple attributes assist users in recognition of entities (aka subjects).

Will that impact how you identify subjects for your users?

Apache Flink (formerly Stratosphere) Competitor to Spark

Monday, October 27th, 2014

From the Apache Flink 0.6 release page:

What is Flink?

Apache Flink is a general-purpose data processing engine for clusters. It runs on YARN clusters on top of data stored in Hadoop, as well as stand-alone. Flink currently has programming APIs in Java and Scala. Jobs are executed via Flink's own runtime engine. Flink features:

Robust in-memory and out-of-core processing: once read, data stays in memory as much as possible, and is gracefully de-staged to disk in the presence of memory pressure from limited memory or other applications. The runtime is designed to perform very well both in setups with abundant memory and in setups where memory is scarce.

POJO-based APIs: when programming, you do not have to pack your data into key-value pairs or some other framework-specific data model. Rather, you can use arbitrary Java and Scala types to model your data.

Efficient iterative processing: Flink contains explicit "iterate" operators that enable very efficient loops over data sets, e.g., for machine learning and graph applications.

A modular system stack: Flink is not a direct implementation of its APIs but a layered system. All programming APIs are translated to an intermediate program representation that is compiled and optimized via a cost-based optimizer. Lower-level layers of Flink also expose programming APIs for extending the system.

Data pipelining/streaming: Flink's runtime is designed as a pipelined data processing engine rather than a batch processing engine. Operators do not wait for their predecessors to finish in order to start processing data. This results to very efficient handling of large data sets.

The latest version is Apache Flink 0.6.1

See more information at the incubator homepage. Or consult the Apache Flink mailing lists.

The Quickstart is…, wait for it: word count on Hamlet. Nothing against the Bard, but you do know that everyone dies at the end. Yes? Seems like a depressing example.

What you suggest as an example application(s) for this type of software?

I first saw this on Danny Bickson’s blog as Apache flink.

Extended Artificial Memory:…

Monday, October 27th, 2014

Extended Artificial Memory: Toward an Integral Cognitive Theory of Memory and Technology by Lars Ludwig. (PDF) (Or you can contribute to the cause by purchasing a printed or Kindle copy of: Information Technology Rethought as Memory Extension: Toward an integral cognitive theory of memory and technology.)

Convention book selling wisdom is that a title should provoke people to pick up the book. First step towards a sale. Must be the thinking behind this title. Just screams “Read ME!”


Seriously, I have read some of the PDF version and this is going on the my holiday wish list as a hard copy request.


This thesis introduces extended artificial memory, an integral cognitive theory of memory and technology. It combines cross-scientific analysis and synthesis for the design of a general system of essential knowledge-technological processes on a sound theoretical basis. The elaboration of this theory was accompanied by a long-term experiment for understanding [Erkenntnisexperiment]. This experiment included the agile development of a software prototype (Artificial Memory) for personal knowledge management.

In the introductory chapter 1.1 (Scientific Challenges of Memory Research), the negative effects of terminological ambiguity and isolated theorizing to memory research are discussed.

Chapter 2 focuses on technology. The traditional idea of technology is questioned. Technology is reinterpreted as a cognitive actuation process structured in correspondence with a substitution process. The origin of technological capacities is found in the evolution of eusociality. In chapter 2.2, a cognitive-technological model is sketched. In this thesis, the focus is on content technology rather than functional technology. Chapter 2.3 deals with different types of media. Chapter 2.4 introduces the technological role of language-artifacts from different perspectives, combining numerous philosophical and historical considerations. The ideas of chapter 2.5 go beyond traditional linguistics and knowledge management, stressing individual constraints of language and limits of artificial intelligence. Chapter 2.6 develops an improved semantic network model, considering closely associated theories.

Chapter 3 gives a detailed description of the universal memory process enabling all cognitive technological processes. The memory theory of Richard Semon is revitalized, elaborated and revised, taking into account important newer results of memory research.

Chapter 4 combines the insights on the technology process and the memory process into a coherent theoretical framework. Chapter 4.3.5 describes four fundamental computer-assisted memory technologies for personally and socially extended artificial memory. They all tackle basic problems of the memory-process (4.3.3). In chapter 4.3.7, the findings are summarized and, in chapter 4.4, extended into a philosophical consideration of knowledge.

Chapter 5 provides insight into the relevant system landscape (5.1) and the software prototype (5.2). After an introduction into basic system functionality, three exemplary, closely interrelated technological innovations are introduced: virtual synsets, semantic tagging, and Linear Unit tagging.

The common memory capture (of two or more speakers) imagery is quite powerful. It highlights a critical aspect of topic maps.

Be forewarned this is European style scholarship, where the reader is assumed to be comfortable with philosophy, linguistics, etc., in addition to the more narrow aspects of computer science.

To see these ideas in practice:

Slides on What is Artificial Memory.

I first saw this in a note from Jack Park, the source of many interesting and useful links, papers and projects.

Think Big Challenge 2014 [Census Data – Anonymized]

Monday, October 27th, 2014

Think Big Challenge 2014 [Census Data – Anonymized]

The Think Big Challenge 2014 closed October 19, 2014, but the data sets for that challenge remain available.

From the data download page:

This subdirectory contains a small extract of the data set (1,000 records). There are two data sets provided:

A complete set of records from after the year 1820 is available for download from Amazon S3 at The full data set is available for download from Amazon S3 at as a 127MB gzip file.

A sample of records pre-1820 for use in the data science “Learning of Common Ancestors” challenge. This can be downloaded at as a 4MB gzip file.

The records have been pre-processed:

The contest data set includes both publicly availabl[e] records (e.g., census data) and user-contributed submissions on To preserve user privacy, all surnames present in the data have been obscured with a hash function. The hash is constructed such that all occurrences of the same string will result in the same hash code.

Reader exercise: You can find multiple ancestors of yours in these records with different surnames and compare those against the hash function results. How many you will need to reverse the hash function and recover all the surnames? Use other ancestors of yours to check your results.

Take a look at the original contest tasks for inspiration. What other online records would you want to merge with these? Thinking local newspapers? What about law reporters?


I first saw this mentioned on Danny Bickson’s blog as: Interesting dataset from

Update: I meant to mention Risks of Not Understanding a One-Way Function by Bruce Schneier, to get you started on the deanonymization task. Apologies for the omission.

If you are interested in cryptography issues, following Bruce Schneier’s blog should be on your regular reading list.

Nothing to Hide

Sunday, October 26th, 2014

Nothing to Hide: Look out for yourself by Nicky Case.

Greg Linden describes it as:

Brilliantly done, free, open source, web-based puzzle game with wonderfully dark humor about ubiquitous surveillance

First and foremost, I sense there is real potential for this to develop into an enjoyable online game.

Second, this could be a way to educate users to security/surveillance threats.


I first saw this in Greg Linden’s Quick Links for Wednesday, October 01, 2014.

Death of Yahoo Directory

Sunday, October 26th, 2014

Progress Report: Continued Product Focus by Jay Rossiter, SVP, Cloud Platform Group.

From the post:

At Yahoo, focus is an important part of accomplishing our mission: to make the world’s daily habits more entertaining and inspiring. To achieve this focus, we have sunset more than 60 products and services over the past two years, and redirected those resources toward products that our users care most about and are aligned with our vision. With even more smart, innovative Yahoos focused on our core products – search, communications, digital magazines, and video – we can deliver the best for our users.

Directory: Yahoo was started nearly 20 years ago as a directory of websites that helped users explore the Internet. While we are still committed to connecting users with the information they’re passionate about, our business has evolved and at the end of 2014 (December 31), we will retire the Yahoo Directory. Advertisers will be upgraded to a new service; more details to be communicated directly.

Understandable but sad. Think of indexing a book that expanded as rapidly as the Internet over the last twenty (20) years. Especially if the content might or might not have any resemblance to already existing content.

Internet remains in serious need of a curated means to access quality information. Almost any search returns links ranging from high to questionable quality.

Imagine if Yahoo segregated the top 500 computer science publishers, archives, societies, departments, blogs into a block of searchable content. (The 500 number is wholly arbitrary, could be some other number) Users would pre-qualify themselves as interested in computer science materials and create a market segment for advertising purposes.

Users would get less trash in their results and advertisers would have pre-qualified targets.

A pre-curated search set might mean you would miss an important link, but realistically, few people read beyond the first twenty (20) links anyway. An analysis of search logs at PubMed show that 80% of users choose a link from the first twenty results.

In theory you may have > 10,000 “hits” but querying all of those up for serving to a user is a waste to time.

Suspect it varies by domain but twenty (20) high quality “hits” from curated content would be a far cry from average search results now.

I first saw this in Greg Linden’s Quick Links for Wednesday, October 01, 2014.

The Chapman University Survey on American Fears

Sunday, October 26th, 2014

The Chapman University Survey on American Fears

From the webpage:

Chapman University has initiated a nationwide poll on what strikes fear in Americans. The Chapman University Survey on American Fears included 1,500 participants from across the nation and all walks of life. The research team leading this effort pared the information down into four basic categories: personal fears, crime, natural disasters and fear factors. According to the Chapman poll, the number one fear in America today is walking alone at night.

A multi-disciplinary team of Chapman faculty and students wanted to capture this information on a year-over-year basis to draw comparisons regarding what items are increasing in fear as well as decreasing. The fears are presented according to fears vs. concerns because that was the necessary phrasing to capture the information correctly.

Your marketing department will find this of interest.

If you are not talking about power, fear or sex, then you aren’t talking about marketing.

IT is no different from any other product or service. Perhaps that’s why the kumbaya approach to selling semantic solutions has done so poorly.

You will need far deeper research than this to integrate fear into your marketing program but at least it is a starting point for discussion.

I first saw this at Full Text Reports as: The Chapman Survey on American Fears