Archive for April, 2014

DHS Warning on Internet Explorer

Wednesday, April 30th, 2014

DHS warns against using Internet Explorer until bug is patched by Mark Hachman.

From the post:

A vulnerability discovered in Internet Explorer over the weekend is serious—serious enough that the Department of Homeland Security is advising users to stop using it until it’s been patched.

On Monday, the United States Computer Emergency Readiness Team (US-CERT), part of the U.S. Department of Homeland Security, weighed in.

“US-CERT is aware of active exploitation of a use-after-free vulnerability in Microsoft Internet Explorer,” it said in a bulletin. “This vulnerability affects IE versions 6 through 11 and could lead to the complete compromise of an affected system.

Two questions that need answering:

First, how long as the NSA know about this vulnerability? Thinking the government should be helping the public and software vendors.

Second, is this really a zero-day bug? I ask because the source of the announcement was Microsoft itself. I thought “zero-day” referred to the advance notice given to the vendor before a bug is publicly identified. Yes?

Language is a Map

Wednesday, April 30th, 2014

Language is a Map by Tim O’Reilly.

From the post:

I’ve twice given an Ignite talk entitled Language is a Map, but I’ve never written up the fundamental concepts underlying that talk. Here I do that.

When I first moved to Sebastopol, before I raised horses, I’d look out at a meadow and I’d see grass. But over time, I learned to distinguish between oats, rye, orchard grass, and alfalfa. Having a language to make distinctions between different types of grass helped me to see what I was looking at.

I first learned this notion, that language is a map that reflects reality, and helps us to see it more deeply – or if wrong, blinds us to it – from George Simon, whom I first met in 1969. Later, George went on to teach workshops at the Esalen Institute, which was to the human potential movement of the 1970s as the Googleplex or Apple’s Infinite Loop is to the Silicon Valley of today. I taught at Esalen with George when I was barely out of high school, and his ideas have deeply influenced my thinking ever since.

If you accept Tim’s premise that “language is a map,” the next question that comes to mind is how faithfully can an information system represent your map?

Your map, not the map of an IT developer or a software vendor but your map?

Does your information system capture the shades and nuances of your map?


Question-answering system and method based on semantic labeling…

Tuesday, April 29th, 2014

Question-answering system and method based on semantic labeling of text documents and user questions

From the patent:

A question-answering system for searching exact answers in text documents provided in the electronic or digital form to questions formulated by user in the natural language is based on automatic semantic labeling of text documents and user questions. The system performs semantic labeling with the help of markers in terms of basic knowledge types, their components and attributes, in terms of question types from the predefined classifier for target words, and in terms of components of possible answers. A matching procedure makes use of mentioned types of semantic labels to determine exact answers to questions and present them to the user in the form of fragments of sentences or a newly synthesized phrase in the natural language. Users can independently add new types of questions to the system classifier and develop required linguistic patterns for the system linguistic knowledge base.

Another reason to hope the United States Supreme Court goes nuclear on processes and algorithms.

That’s not an opinion on this patent but on the cloud that all process/algorithm patents cast on innovation.

I first saw this at: IHS Granted Key Patent for Proprietary, Next-Generation Search Technology by Angela Guess.

Introduction to Process Maturity

Tuesday, April 29th, 2014

Introduction to Process Maturity by Michael Edson.

From the description:

Museum Web and New Media software projects offer tantalizing rewards, but the road to success can be paved with uncertainty and risk. To small organizations these risks can be overwhelming, and even large organizations with seemingly limitless resources can flounder in ways that profoundly affect staff morale, public impact, the health and fitness of our partners in the vendor community, and our own bottom lines. Something seems to happen between the inception of projects, when optimism and beneficial outcomes seem clear and attainable, and somewhere down the road when schedules, budgets, and outcomes go off course. What is it? And what can we do to gain control?

This paper, created for the 2008 annual conference of the American Association of Museums, describes some common ways that technology projects get into trouble. It examines a proven project-process framework called the Capability Maturity Model and how that model can provide insight and guidance to museum leaders and project participants, and it tells how to improve real-world processes that contribute to project success. The paper includes three brief case studies and a call-to-action which argues that museum leaders should make technology stewardship an urgent priority.

The intended audience is people who are interested in understanding and improving how museum-technology gets done. The paper’s primary focus is Web and New Media software projects, but the core ideas are applicable to projects of all kinds.

In web time it may seem like process advice from 2008 must be dated.

Not really, consider the following description of the then current federal government’s inability to complete technology projects:

As systems become increasingly complex, successful software development becomes increasingly difficult. Most major system developments are fraught with cost, schedule, and performance shortfalls. We have repeatedly reported on costs rising by millions of dollars, schedule delays of not months but years, and multibillion-dollar systems that don’t perform as envisioned.

The problem wasn’t just that the government couldn’t complete software projects on time or on budget, or that it couldn’t predict which projects it was currently working on would succeed or fail—though these were both significant and severe problems—but most worrisome from my perspective is that it couldn’t figure out which new projects it was capable of doing in the future. If a business case or museum mission justifies an investment in technology that justification is based on the assumption that the technology can be competently implemented. If instead the assumption is that project execution is a crap shoot, the business case and benefit-to-mission arguments crumble and managers are stuck, unable to move forward (because of the risk of failure) and unable to not move forward because business and mission needs still call.

There is no shortage of process/project management advice but I think Edson captures the essence needed for process/project success:

  • Honestly assess your current processes and capabilities
  • Improve processes and capabilities one level at a time

Very much worth your time.

European Computational Linguistics

Tuesday, April 29th, 2014

From the ACL Anthology:

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

A snapshot of the current state of computational linguistics and perhaps inspiration for the next advance.


Physical Manifestation of a Topic Map

Tuesday, April 29th, 2014

I saw a tweet today referencing Cartographies of Time: A Visual History of the Timeline by Maria Popova by The O.C.R. I have posted about it before Cartographies of Time:… but re-reading material can result in different takes on it. Today is an example of that.

Today when I read the post I recognized the potential of the Discus chronologicus (which has no Wikipedia entry), could be the physical manifestation of a topic map. Or at least one with undisclosed reasons for mapping between domains.

discus chronologicus - Christoph Weigel

Granting it does not provide you with the properties of each subject, save possibly a name (you need something to recognize), with each ring representing what Steve Newcomb calls a “universe of discourse,” and the movable arm represents warp holes between those universes of discourse at particular subjects.

This could be a useful prop for marketing topic maps.

First, it introduces the notion of different vocabularies (universes of discourse) in a very concrete way and demonstrates the advantage of being able to move from one to another. (Assuming here you have chosen universes of discourse of interest to the prospect.)

Second, the lack of space means that it is missing the properties that enabled the mapping, a nice analogy to the construction of most information systems. You can assure the prospect that digital topic maps include that information.

Third, unlike this fixed mapping, another analogy to current data systems, more universes of discourse and subjects can be added to a digital topic map. While at the same time, you retain all the previous mappings. “Recycling prior work,” “not paying 2, 3 or more times for mappings,” are just some of the phrases that come to mind.

I am assuming composing the map in Gimp or other graphics program is doable. The printing and assembly would be more problematic. Will be looking around. Suggestions welcome!

Apache Lucene/Solr 4.8.0 Available!

Monday, April 28th, 2014

The Lucene PMC is pleased to announce the availability of Apache Lucene 4.8.0 and Apache Solr 4.8.0.

Lucene can be downloaded from and Solr can be downloaded from

Both releases now require Java 7 or greater (recommended is Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions have known JVM bugs affecting Lucene and Solr). In addition, both are fully compatible with Java 8.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

Highlights of the Lucene release include:

  • All index files now store end-to-end checksums, which are now validated during merging and reading. This ensures that corruptions caused by any bit-flipping hardware problems or bugs in the JVM can be detected earlier. For full detection be sure to enable all checksums during merging (it’s disabled by default).
  • Lucene has a new Rescorer/QueryRescorer API to perform second-pass rescoring or reranking of search results using more expensive scoring functions after first-pass hit collection.
  • AnalyzingInfixSuggester now supports near-real-time autosuggest.
  • Simplified impact-sorted postings (using SortingMergePolicy and EarlyTerminatingCollector) to use Lucene’s Sort class to express the sort order.
  • Bulk scoring and normal iterator-based scoring were separated, so some queries can do bulk scoring more effectively.
  • Switched to MurmurHash3 to hash terms during indexing.
  • IndexWriter now supports updating of binary doc value fields.
  • HunspellStemFilter now uses 10 to 100x less RAM. It also loads all known OpenOffice dictionaries without error.
  • Lucene now also fsyncs the directory metadata on commits, if the operating system and file system allow it (Linux, MacOSX are known to work).
  • Lucene now uses Java 7 file system functions under the hood, so index files can be deleted on Windows, even when readers are still open.
  • A serious bug in NativeFSLockFactory was fixed, which could allow multiple IndexWriters to acquire the same lock. The lock file is no longer deleted from the index directory even when the lock is not held.

Highlights of the Solr release include:

  • <fields> and <types> tags have been deprecated from schema.xml. There is no longer any reason to keep them in the schema file, they may be safely removed. This allows intermixing of <fieldType>, <field> and <copyField> definitions if desired.
  • The new {!complexphrase} query parser supports wildcards, ORs etc. inside Phrase Queries.
  • New Collections API CLUSTERSTATUS action reports the status of collections, shards, and replicas, and also lists collection aliases and cluster properties.
  • Added managed synonym and stopword filter factories, which enable synonym and stopword lists to be dynamically managed via REST API.
  • JSON updates now support nested child documents, enabling {!child} and {!parent} block join queries.
  • Added ExpandComponent to expand results collapsed by the CollapsingQParserPlugin, as well as the parent/child relationship of nested child documents.
  • Long-running Collections API tasks can now be executed asynchronously; the new REQUESTSTATUS action provides status.
  • Added a hl.qparser parameter to allow you to define a query parser for hl.q highlight queries.
  • In Solr single-node mode, cores can now be created using named configsets.
  • New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the “TTL” expression, as well as automatically deleting expired documents on a periodic basis.

All exciting additions, except that today I finished configuring Tomcat7/Solr/Nutch, where Solr = 4.7.2.

Sigh, well, I suppose that was just a trial run. 😉

No Silver Lining For U.S. Cloud Providers

Monday, April 28th, 2014

Judge’s ruling spells bad news for U.S. cloud providers by Barb Darrow.

From the post:

A court ruling on Friday over search warrants means continued trouble for U.S. cloud providers eager to build their businesses abroad.

In his ruling, U.S. Magistrate Judge James Francis found that big ISPs — including name brands Microsoft and Google — must comply with valid warrants to turn over customer information, including emails, even if that material resides in data centers outside the U.S., according to several reports.

Microsoft challenged such a warrant a few months back and this ruling was the response.

See the post for details but the bottom line is that you can’t rely on the loyalty of a cloud provider to its customers. U.S. cloud providers or non-U.S. cloud providers. When push comes to shove, you are going to find cloud providers siding with their local governments.

My suggestion is that you handle critical data security tasks yourself and off of any cloud provider.

That’s not a foolproof solution but otherwise you may as well cc: the NSA with your data.

BTW, I would not trust privacy or due process assurances from a government that admits it would target its own citizens for execution.


Humanitarian Data Exchange

Monday, April 28th, 2014

Humanitarian Data Exchange

From the webpage:

A project by the United Nations Office for the Coordination of Humanitarian Affairs to make humanitarian data easy to find and use for analysis.

HDX will include a dataset repository, based on open-source software, where partners can share their data spreadsheets and make it easy for others to find and use that data.

HDX brings together a Common Humanitarian Dataset that can be compared across countries and crises, with tools for analysis and visualization.

HDX promotes community data standards (e.g. the Humanitarian Exchange Language) for sharing operational data across a network of actors.

Data from diverse sources always creates opportunities to use topic maps.

The pilot countries include Columbia, Kenya and Yemen so semantic diversity is a reasonable expectation.

BTW, they are looking for volunteers. Opportunities range from data science, development, visualization to the creation of data standards.

Evaluating Entity Linking with Wikipedia

Monday, April 28th, 2014

Evaluating Entity Linking with Wikipedia by Ben Hachey, et al.


Named Entity Linking (NEL) grounds entity mentions to their corresponding node in a Knowledge Base (KB). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or NIL. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets.

We reimplement three seminal NEL systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling lead to substantial improvement, and search strategies account for much of the variation between systems. This is an interesting finding, because these aspects of the problem have often been neglected in the literature, which has focused largely on complex candidate ranking algorithms.

A very deep survey of entity linking literature (including record linkage) and implementation of three complete entity linking systems for comparison.

At forty-eight (48) pages it isn’t a quick read but should be your starting point for pushing the boundaries on entity linking research.

I first saw this in a tweet by Alyona Medelyan.

Death to bad search results:… [Marketing Topic Maps]

Monday, April 28th, 2014

Death to bad search results: Elicit fixes website search with some context and a human touch by Michael Carney.

From the post:

Most major brand websites fail to satisfy their customers’ needs. It’s not because the right content isn’t available, but rather because users routinely struggle to find what they’re looking for and leave disappointed. Menu-based navigation systems are confusing and ineffective, while traditional search solutions are more likely to turn up corporate press releases than actual product- or service-related content.

This doesn’t have to be the case.

Elicit is a Chicago-based startup that has been solving this search and discovery problem for major brands like Motorola (previous), Blackberry, Xerox, Time Warner Cable, Bank of America, GoodYear, Whirlpool, and others. The SaaS company was founded in 2011 by a pair of former ad agency execs out of first-hand frustrations.

“We saw that customers and users increasingly start interacting with new sites via the search box,” Elicit co-founder and President Adam Heneghan says. “You spend so much money getting people to your site, but then do a bad job of satisfying them at that point. It makes absolutely no sense. More than 80 percent of site abandonment happens at search box.” (emphasis added)

From a bit further in the post:

“People typically assume that this is a huge, impossible problem to solve. But the reality is, when you look at the data, you can typically solve nearly 100 percent of search queries with just 100 or so keywords, once the data has been properly organized,” Eric Heneghan says.

I first saw this in a post on Facebook lamenting topic maps being ahead of their times.

Perhaps but I think the real difference is that Elicit is marketing a solution to a known problem. One that their customers suffer from and when relieved, the results are visible.

Think of it as being the difference between Walmart selling DIY condom kits versus condoms.

Which one would you drop by Walmart to buy?

Parsing English with 500 lines of Python

Monday, April 28th, 2014

Parsing English with 500 lines of Python by Matthew Honnibal.

From the post:

A syntactic parser describes a sentence’s grammatical structure, to help another application reason about it. Natural languages introduce many unexpected ambiguities, which our world-knowledge immediately filters out. A favourite example:

Definitely a post to savor if you have any interest in natural language processing.

I first saw this in a tweet by Jim Salmons.

Using NLTK for Named Entity Extraction

Sunday, April 27th, 2014

Using NLTK for Named Entity Extraction by Emily Daniels.

From the post:

Continuing on from the previous project, I was able to augment the functions that extract character names using NLTK’s named entity module and an example I found online, building my own custom stopwords list to run against the returned names to filter out frequently used words like “Come”, “Chapter”, and “Tell” which were caught by the named entity functions as potential characters but are in fact just terms in the story.

Whether you are trusting your software or using human proofing, named entity extraction is a key task in mining data.

Having extracted named entities, the harder task is uncovering relationships between them that may not be otherwise identified.

Challenging with the text of Oliver Twist but even more difficult when mining donation records and the Congressional record.

The Deadly Data Science Sin of Confirmation Bias

Sunday, April 27th, 2014

The Deadly Data Science Sin of Confirmation Bias by Michael Walker.

From the post:


Confirmation bias occurs when people actively search for and favor information or evidence that confirms their preconceptions or hypotheses while ignoring or slighting adverse or mitigating evidence. It is a type of cognitive bias (pattern of deviation in judgment that occurs in particular situations – leading to perceptual distortion, inaccurate judgment, or illogical interpretation) and represents an error of inductive inference toward confirmation of the hypothesis under study.

Data scientists exhibit confirmation bias when they actively seek out and assign more weight to evidence that confirms their hypothesis, and ignore or underweigh evidence that could disconfirm their hypothesis. This is a type of selection bias in collecting evidence.

Note that confirmation biases are not limited to the collection of evidence: even if two (2) data scientists have the same evidence, their respective interpretations may be biased. In my experience, many data scientists exhibit a hidden yet deadly form of confirmation bias when they interpret ambiguous evidence as supporting their existing position. This is difficult and sometimes impossible to detect yet occurs frequently.

Isn’t that a great graphic? Michael goes on to list several resources that will help in spotting confirmation bias, yours and that of others. Not 1005 but you will do better heeding his advice.

Be aware that the confirmation bias isn’t confined to statistical and/or data science methods. Decision makers, topic map authors, fact gatherers, etc. are all subject to confirmation bias.

Michael sees confirmation bias as dangerous to the credibility of data science, writing:

The evidence suggests confirmation bias is rampant and out of control in both the hard and soft sciences. Many academic or research scientists run thousands of computer simulations where all fail to confirm or verify the hypothesis. Then they tweak the data, assumptions or models until confirmatory evidence appears to confirm the hypothesis. They proceed to publish the one successful result without mentioning the failures! This is unethical, may be fraudulent and certainly produces flawed science where a significant majority of results can not be replicated. This has created a loss or confidence and credibility for science by the public and policy makers that has serious consequences for our future.
The danger for professional data science practitioners is providing clients and employers with flawed data science results leading to bad business and policy decisions. We must learn from the academic and research scientists and proactively avoid confirmation bias or data science risks loss of credibility.

I don’t think bad business and policy decisions need any help from “flawed data science.” You may recall that “policy makers” not all that many years ago dismissed a failure to find weapons of mass destruction, a key motivation for war, as irrelevant in hindsight.

My suggestion would be to make your data analysis as complete and accurate as possible and always keep digitally signed and encrypted copies of data and communications with your clients.

Net Neutrality – Priority Check

Sunday, April 27th, 2014

I remain puzzled over the “sky is falling” responses to rumors about possible FCC rules on Net Neutraility (NN). (See: New York Times, The Guardian and numerous others. ) There are no proposed rules at the moment but a lack of content for comment hasn’t slowed the production of commentary.

Should I be concerned about Netflix being set upon by an even more rapacious predator (Comcast)? (A common NN example.) What priority should NN have among the issues vying for my attention? (Is net neutrality dying? Has the FCC killed it? What comes next? Here’s what you need to know)

Every opinion is from a point of view and mine is from the perspective of a lifetime of privilege, at least when compared to the vast majority of humanity. So what priority does NN have among the world at large? For one answer to that question, I turned to the MyWorld2015 Project.

MY World is a United Nations global survey for citizens. Working with partners, we aim to capture people’s voices, priorities and views, so world leaders can be informed as they begin the process of defining the next set of global goals to end poverty.

world opinion

If I am reading the chart correctly, Phone and internet access come in at #14.

Perhaps being satiated with goods and services for the first thirteen priorities makes NN loom large.

Having 95% of all possible privileges isn’t the same as having 96% of all possible privileges.*

*(Estimate. Actual numbers for some concerned residents of the United States are significantly higher than 96%.)

Just in case you are interested:

FCC Inbox for Open Internet Comments

Tenative Agenda for 15 May 2014 Meeting, which includes Open Internet The Open Meeting is scheduled to commence at 10:30 a.m. in Room TW-C305, at 445 12th Street, S.W., Washington, D.C. The event will be shown live at

FCC website

Solr 4.8 Features

Saturday, April 26th, 2014

Solr 4.8 Features by Yonik Seeley.

Yonik reviews the coming new features for Solr 4.8:

  • Complex Phrase Queries
  • Indexing Child Documents in JSON
  • Expand Component
  • Named Config Sets
  • Stopwords and Synonyms REST API

Do you think traditional publishing models work well for open source projects that evolve as rapidly as Solr?

I first saw this in a tweet by Martin Grotzke.

Social Media Mining: An Introduction

Saturday, April 26th, 2014

Social Media Mining: An Introduction by Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu.

From the webpage:

The growth of social media over the last decade has revolutionized the way individuals interact and industries conduct business. Individuals produce data at an unprecedented rate by interacting, sharing, and consuming content through social media. Understanding and processing this new type of data to glean actionable patterns presents challenges and opportunities for interdisciplinary research, novel algorithms, and tool development. Social Media Mining integrates social media, social network analysis, and data mining to provide a convenient and coherent platform for students, practitioners, researchers, and project managers to understand the basics and potentials of social media mining. It introduces the unique problems arising from social media data and presents fundamental concepts, emerging issues, and effective algorithms for network analysis and data mining. Suitable for use in advanced undergraduate and beginning graduate courses as well as professional short courses, the text contains exercises of different degrees of difficulty that improve understanding and help apply concepts, principles, and methods in various scenarios of social media mining.

Another Cambridge University Press title that is available in pre-publication PDF format.

If you are contemplating writing a textbook, Cambridge University Press access policies should be one of your considerations in seeking a publisher.

You can download the entire books, chapters, and slides from Social Media Mining: An Introduction

Do remember that only 14% of the U.S. adult population uses Twitter. Whatever “trends” you extract from Twitter may or may not reflect “trends” in the larger population.

I first saw this in a tweet by Stat Fact.

The Feynman Lectures on Physics

Saturday, April 26th, 2014

The Feynman Lectures on Physics Online, Free!

From the webpage:

Caltech and The Feynman Lectures Website are pleased to present this online edition of The Feynman Lectures on Physics. Now, anyone with internet access and a web browser can enjoy reading a high quality up-to-date copy of Feynman’s legendary lectures.

The lectures:

Feynman writes in volume 1, 1-1:

You might ask why we cannot teach physics by just giving the basic laws on page one and then showing how they work in all possible circumstances, as we do in Euclidean geometry, where we state the axioms and then make all sorts of deductions. (So, not satisfied to learn physics in four years, you want to learn it in four minutes?) We cannot do it in this way for two reasons. First, we do not yet know all the basic laws: there is an expanding frontier of ignorance. Second, the correct statement of the laws of physics involves some very unfamiliar ideas which require advanced mathematics for their description. Therefore, one needs a considerable amount of preparatory training even to learn what the words mean. No, it is not possible to do it that way. We can only do it piece by piece. (emphasis added)

A remarkable parallel to the use of “logic” on the WWW.

First, logic is only a small part of human reasoning, as Boole acknowledges in the “Laws of Thought.” Second, a “considerable amount of preparatory training” is required to use it.

Feynman has a real talent for explanation. Enjoy!

PS: A disclosed mapping of Feynman’s terminology to current physics would make an interesting project.

From Geek to Clojure!

Saturday, April 26th, 2014

From Geek to Clojure! by Nada Amin and William Byrd.

From the description:

In his Lambda Jam keynote, “Everything I Have Learned I Have Learned From Someone Else,” David Nolen exposed the joys and benefits of reading academic papers and putting them to work. In this talk, we show how to translate the mathy figures in Computer Science papers into Clojure code using both core.match and core.logic. You’ll gain strategies for understanding concepts in academic papers by implementing them!

Nada Amin is a member of the Scala team at EPFL, where she studies type systems and hacks on programming languages. She has contributed to Clojure’s core.logic and Google’s Closure compiler. She’s loved helping others learn to program ever since tutoring SICP as an undergraduate lab assistant at MIT.

William E. Byrd is a Postdoctoral Researcher in the School of Computing at the University of Utah. He is co-author of The Reasoned Schemer, and co-designer of several declarative languages: miniKanren (logic programing), Harlan (GPU programming), and Kanor (cluster programming). His StarCraft 2 handle is ‘Rojex’ (character code 715).

An alternative title for this paper would be: How To Read An Academic CS Paper. Seriously.

From Geek to Clojure at Github has the slides and “Logical types for untyped languages” (mentioned near the end of the paper).

I don’t think you need a login at the ACM Digital Library to see who cites “Logical types for untyped languages.

Some other resources of interest:

Logical Types for Untyped Languages by Sam Tobin-Hochstadt (speaker deck)

Logical Types for Untyped Languages by Sam Tobin-Hochstadt and Matthias Felleisen (video)

A series of videos by Nada Amin and William Byrd that makes fewer assumptions about the audience on reading CS papers would really rock!

7 First Public Working Drafts of XQuery and XPath 3.1

Friday, April 25th, 2014

7 First Public Working Drafts of XQuery and XPath 3.1

From the post:

Today the XML Query Working Group and the XSLT Working Group have published seven First Public Working Drafts, four of which are jointly developed and three are from the XQuery Working Group.

The joint documents are:

  • XML Path Language (XPath) 3.1. XPath is a powerful expression language that allows the processing of values conforming to the data model defined in the XQuery and XPath Data Model. The main features of XPath 3.1 are maps and arrays.
  • XPath and XQuery Functions and Operators 3.1. This specification defines a library of functions available for use in XPath, XQuery, XSLT and other languages.
  • XQuery and XPath Data Model 3.1. This specification defines the data model on which all operations of XPath 3.1, XQuery 3.1, and XSLT 3.1 operate.
  • XSLT and XQuery Serialization 3.1. This document defines serialization of an instance of the XQuery and XPath Data model Data Model into a sequence of octets, such as into XML, text, HTML, JSON.

The three XML Query Working Group documents are:

  • XQuery 3.1 Requirements and Use Cases, which describes the reasons for producing XQuery 3.1, and gives examples.
  • XQuery 3.1: An XML Query Language. XQuery is a versatile query and application development language, capable of processing the information content of diverse data sources including structured and semi-structured documents, relational databases and tree-bases databases. The XQuery language is designed to support powerful optimizations and pre-compilation leading to very efficient searches over large amounts of data, including over so-called XML-native databases that read and write XML but have an efficient internal storage. The 3.1 version adds support for features such as arrays and maps primarily to facilitate processing of JSON and other structures.
  • XQueryX 3.1, which defines an XML syntax for XQuery 3.1.

Learn more about the XML Activity.

To show you how far behind I am on my reading, I haven’t even ordered Michael Kay‘s XSLT 3.0 and XPath 3.0 book and the W3C is already working on 3.1 for both. 😉

I am hopeful that Michael will duplicate his success with XSLT 2.0 and XPath 2.0. This time though, I am going to get the Kindle edition. 😉

Solr In Action – Bug/Feature

Friday, April 25th, 2014

Solr In Action has recently appeared from Manning. I bought it on MEAP and am working from the latest version.

There is a bug/feature that you should be aware of if you are using the source code for Solr in Action.

The data-config.xml file (solrpedia/conf/data-config.xml) has the line:


Which works, if and only if you are using Jetty, which resolves the path relative to solrpedia core.

However, if you are running Solr under Tomcat7, you are going to get an indexing failed with the following log message:

Could not find file: solrpedia.xml (resolved to: /var/lib/tomcat/./solrpedia.xml)

If you change:


in solrpedia/conf/data-config.xml to:


it works like a charm.

Good to know before I started on a much larger data import. 😉

SlideRule [Online Course Collection]

Friday, April 25th, 2014


From the about page:

Education is changing, with great educators from around the world increasingly putting their amazing courses online. We believe we are in the early days of a revolution that will not only increase access to great education, but also transform the way people learn.

SlideRule is our way of contributing to the movement. We help you discover the world’s best online courses in every subject – courses that your friends and thousands of other learners have loved.

I counted thirty-three (33) content providers who are supplying content. Some of it free, some not.

It looks extensive enough to be worth mentioning.


Friday, April 25th, 2014


From the webpage:

PourOver is a library for simple, fast filtering and sorting of large collections – think 100,000s of items – in the browser. It allows you to build data-exploration apps and archives that run at 60fps, that don’t have to to wait for a database call to render query results.

PourOver is built around the ideal of simple queries that can be arbitrarily composed with each other, without having to recalculate their results. You can union, intersect, and difference queries. PourOver will remember how your queries were constructed and can smartly update them when items are added or modified. You also get useful features like collections that buffer their information periodically, views that page and cache, fast sorting, and much, much more.

If you just want to get started using PourOver, I would skip to “Preface – The Best Way to Learn PourOver”. There you will find extensive examples. If you are curious about why we made PourOver or what it might offer to you, I encourage you to skip down to “Chp 1. – The Philosophy of PourOver”.

This looks very cool!

Imagine doing client side merging of content from multiple topic map servers.

This type of software development and open release is making me consider a subscription to the New York Times.


I first saw this at Nathan Yau’s PourOver Allows Filtering of Large Datasets In Your Browser. If you are interested in data visualization and aren’t following Nathan’s blog, you should be.

CSI Buffs

Friday, April 25th, 2014

Crime Scene Investigation, a Guide for Law Enforcement Source: National Forensic Science Technology Center.

If you are a fan of CSI: Crime Scene Investigation, CSI: Miami,
CSI: NY, or any of the other police/crime/terrorism, etc. shows, this will provide hours of entertainment.

If your purposes are more serious, either detecting or avoiding detection of criminal activity (varies by jurisdiction and your appetite for risk), this is one window into the data collection process for criminal investigations. Local terminology and practices will vary.

The handbook bills itself as:

This handbook is intended as a guide to recommended practices for crime scene investigation.

Jurisdictional, logistical, or legal conditions may preclude the use of particular procedures contained herein.

For potentially devastating situations, such as biological weapons or radiological or chemical threats, the appropriate agencies should be contacted. The user should refer to the National Institute of Justice’s publications for fire and arson investigation, bomb and explosives investigation, electronic crime investigation, and death investigation where applicable. (page xi)

Other resources you may find interesting:

Death Investigation (pdf, 64 pages)

Electronic Crime Scene Investigation (Second Edition)

Fire and Arson Scene Evidence (pdf, 73 pages)

Guide for Explosion and Bombing Scene Investigation (pdf, 64 page)

Eliminating the overlap between these documents and supplementing them with local and “live” case examples would greatly increase their value.


PS: I have local copies just in case these should disappear.

We have no “yellow curved fruit” today

Thursday, April 24th, 2014


Tweeted by Olivier Croisier with this comment:

Looks like naming things is hard not only in computer science…

Naming (read identity) problems are everywhere.

Our intellectual cocoons prevent us noticing such problems very often.

At least until something goes terribly wrong. Then the hunt is on for a scapegoat, not an explanation.

Decompiling Clojure

Thursday, April 24th, 2014

Guillermo Winkler has started a series of posts on decompiling Clojure.

Thus far, Decompiling Clojure I and Decompiling Clojure II, The Compiler have been posted.

From the first post:

This is the first in a series of articles about decompiling Clojure, that is, going from JVM bytecode created by the Clojure compiler, to some kind of higher level language, not necessarily Clojure.

This article was written in the scope of a larger project, building a better Clojure debugger, which I’ll probably blog about in the future.

These articles are going to build form the ground up, so you may skip forward if you find some of the stuff obvious.

Just in case you want to read something more challenging than the current FCC and/or security news.

BBC: An Honest Ontology

Thursday, April 24th, 2014

British Broadcasting Corporation launches an Ontology page

From the post:

The Britishi Braodcasting Corporation (BBC) has launced a new page detailing their internal data models. The page provides access to the ontologies the BBC is using to support its audience facing applications such as BBC Sport, BBC Education, BBC Music, News projects and more. These ontologies form the basis of their Linked Data Platform. The listed ontologies include the following;

I think my favorite is:

Core Concepts Ontology -The generic BBC ontology for people, places,events, organisations, themes which represent things that make sense across the BBC. (emphasis added)

I don’t think you can ask for a fairer statement from an ontology than: “which represent things that make sense across the BBC.”

And that’s all any ontology can do. Represent things that make sense in a particular context.

What I wish the BBC ontology did more of (along with other ontologies), is to specify what is required to recognize one of its “things.”

For example, person has these properties: “dateOfBirth, dateOfDeath, gender, occupation, placeOfBirth, placeOfDeath.”

We can ignore “dateOfBirth, dateOfDeath, … placeOfBirth, placeOfDeath” because those would not distinguish a person from a zoo animal, for instance. Ditto for gender.

So, is “occupation” the sole property by which I can distinguish a person from other entities that can have “dateOfBirth, dateOfDeath, gender, …, placeOfBirth, placeOfDeath” properties?

Noting that “occupation” is described as:

This property associates a person with a thematic area he or she worked in, for example Annie Lennox with Music.

BTW, the only property of “theme” is “occupation” and “thematic area” is undefined.

Works if you share an understanding with the BBC about “occupation” and/or don’t want to talk about the relationship between Annie Lennox and Music.

Of course, without more properties, it is hard to know exactly what the BBC means by “thematic area.” That’s ok if you are only using the BBC ontology or if the ambiguity of what is meant is tolerable for your application. Not so ok if you want to map precisely to what the BBC may or may not have meant.

But I do appreciate the BBC being honest about its ontology “…mak[ing] sense across the BBC.

FoundationDB: Developer Recipes

Thursday, April 24th, 2014

FoundationDB: Developer Recipes

From the webpage:

Learn how to build new data models, indexes, and more on top of the FoundationDB key-value store API.

I was musing the other day about how to denormalize a data structure for indexing.

This is the reverse of that process but still should be instructive.

Graphistas should note that FoundationDB also implements the Blueprints API (blueprints-foundationdb-graph).

Tools for ideation and problem solving: Part 1

Thursday, April 24th, 2014

Tools for ideation and problem solving: Part 1 by Dan Lockton.

From the post:

Back in the darkest days of my PhD, I started blogging extracts from the thesis as it was being written, particularly the literature review. It helped keep me motivated when I was at a very low point, and seemed to be of interest to readers who were unlikely to read the whole 300-page PDF or indeed the publications. Possibly because of the amount of useful terms in the text making them very Google-able, these remain extremely popular posts on this blog. So I thought I would continue, not quite where I left off, but with a few extracts that might actually be of practical use to people working on design, new ideas, and understanding people’s behaviour.

The first article (to be split over two parts) is about toolkits (and similar things, starting with an exploration of idea generation methods), prompted by much recent interest in the subject via projects such as Lucy Kimbell, Guy Julier, Jocelyn Bailey and Leah Armstrong’s Mapping Social Design Research & Practice and Nesta’s Development Impact & You toolkit, and some of our discussions at the Helen Hamlyn Centre for the Creative Citizens project about different formats for summarising information effectively. (On this last point, I should mention the Sustainable Cultures Engagement Toolkit developed in 2012-13 by my colleagues Catherine Greene and Lottie Crumbleholme, with Johnson Controls, which is now available online (12.5MB PDF).)

The article below is not intended to be a comprehensive review of the field, but was focused specifically on aspects which I felt were relevant for a ‘design for behaviour change’ toolkit, which became Design with Intent. I should also note that since the below was written, mostly in 2010-11, a number of very useful articles have collected together toolkits, card decks and similar things. I recommend: Venessa Miemis’s 21 Card Decks, Hanna Zoon’s Depository of Design Toolboxes, Joanna Choukeir’s Design Methods Resources, Stephen Anderson’s answer on this Quora thread, and Ola Möller’s 40 Decks of Method Cards for Creativity. I’m sure there are others.

Great post but best read when you have time to follow links and to muse about what you are reading.

I think the bicycle with square wheels was the best example in part 1. Which example do you like best? (Yes, I am teasing you into reading the post.)

Having a variety of problem solving/design skills will enable you to work with groups that respond to different problem solving strategies.

Important in eliciting designs for topic maps as users don’t ever talk about implied semantics known by everyone.

Unfortunately, our machines not being people, don’t know what everyone else knows, they know only what they are told.

I first saw this in Nat Torkington’s Four short links: 23 April 2014.

Verizon 2014 Data Breach Investigations Report

Wednesday, April 23rd, 2014

Kelly Jackson Higgins summarizes the most important point of the Verizon 2014 Data Breach Investigations Report, in Stolen Passwords Used In Most Data Breaches, when she says:

Cyber criminals and cyberspies mostly log in to steal data: Findings from the new and much-anticipated 2014 Verizon Data Breach Investigations Report (DBIR) show that two out of three breaches involved attackers using stolen or misused credentials.

“Two out of three [attacks] focus on credentials at some point in the attack. Trying to get valid credentials is part of many styles of attacks and patterns,” says Jay Jacobs, senior analyst with Verizon and co-author of the report. “To go in with an authenticated credential opens a lot more avenues, obviously. You don’t have to compromise every machine. You just log in.”

When reviewing security solutions, remember 2/3 of all security breaches involve stolen credentials.

You can spend a lot of time and effort on attempts to prevent some future NSA quantum computer from reading your email or you can focus on better credential practices and reduce your present security risk by two-thirds (2/3).

If I were advising an enterprise or government agency on security, other than the obligatory hires/expenses to justify the department budget, I know where my first emphasis would be, subject to local special requirements and risks.