Archive for March, 2014

Microsoft Outlook Users Face Zero-Day Attack

Tuesday, March 25th, 2014

Microsoft Outlook Users Face Zero-Day Attack by Mathew J. Schwartz.

From the post:

Simply previewing maliciously crafted RTF documents in Outlook triggers exploit of bug present in Windows and Mac versions of Word, Microsoft warns

There is a new zero-day attack campaign that’s using malicious RTF documents to exploit vulnerable Outlook users on Windows and Mac OS X systems, even if the emailed documents are only previewed.

That warning was sounded Monday by Microsoft, which said that it’s seen “limited, targeted attacks” in the wild that exploit a newly discovered Microsoft Word RTF file format parser flaw, which can be used to corrupt system memory and execute arbitrary attack code.

“An attacker who successfully exploited this vulnerability could gain the same user rights as the current user,” said a Microsoft’s security advisory. “If the current user is logged on with administrative user rights, an attacker who successfully exploited this vulnerability could take complete control of an affected system. An attacker could then install programs; view, change, or delete data; or create new accounts with full user rights.”

It’s only Snowden Year One (SY1) and with every new zero-day attack that makes the news I wonder: “Did this escape from the NSA?”

The other lesson: Only by building securely can there be any realistic computer security.

One good place to start would be building software that reads (if not also writes) popular office formats securely.

…[S]uffix array construction algorithms

Tuesday, March 25th, 2014

A bioinformatician’s guide to the forefront of suffix array construction algorithms by Anish Man Singh Shrestha, Martin C. Frith, and Paul Horton.


The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

If this doesn’t sound like a real page turner, consider the authors’ concluding paragraph:

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

Reminds me that computer science departments need to start offering courses in “string theory,” to capitalize on the popularity of that phrase. 😉

Shadow DOM

Tuesday, March 25th, 2014

Shadow DOM by Steven Wittens.

From the post:

For a while now I’ve been working on MathBox 2. I want to have an environment where you take a bunch of mathematical legos, bind them to data models, draw them, and modify them interactively at scale. Preferably in a web browser.

Unfortunately HTML is crufty, CSS is annoying and the DOM’s unwieldy. Hence we now have libraries like React. It creates its own virtual DOM just to be able to manipulate the real one—the Agile Bureaucracy design pattern.

The more we can avoid the DOM, the better. But why? And can we fix it?

One of the better posts on markup that I have read in a very long time.

Also of interest, Steven’s heavy interest in graphics and visualization.

His MathBox project for example.

Codex Sinaiticus Added to Digitised Manuscripts

Tuesday, March 25th, 2014

Codex Sinaiticus Added to Digitised Manuscripts by Julian Harrison.

From the post (I have omitted the images, see the original post for those):

Codex Sinaiticus is one of the great treasures of the British Library. Written in the mid-4th century in the Eastern Mediterranean (possibly at Caesarea), it is one of the two oldest surviving copies of the Greek Bible, along with Codex Vaticanus, in Rome. Written in four narrow columns to the page (aside from in the Poetic books, in two columns), its visual appearance is particularly striking.

The significance of Codex Sinaiticus for the text of the New Testament is incalculable, not least because of the many thousands of corrections made to the manuscript between the 4th and 12th centuries.

The manuscript itself is now distributed between four institutions: the British Library, the Universitäts-Bibliothek at Leipzig, the National Library of Russia in St Petersburg, and the Monastery of St Catherine at Mt Sinai. Several years ago, these four institutions came together to collaborate on the Codex Sinaiticus Project, which resulted in full digital coverage and transcription of all extant parts of the manuscript. The fruits of these labours, along with many additional essays and scholarly resources, can be found on the Codex Sinaiticus website.

The British Library owns the vast majority of Codex Sinaiticus and only the British Library portion is being released as part of the Digitised Manuscripts project.

The world in which biblical scholarship is done has changed radically over the last 20 years.

This effort by the British Library should be applauded and supported.

US Government Content Processing: A Case Study

Monday, March 24th, 2014

US Government Content Processing: A Case Study by Stephen E Arnold.

From the post:

I know that the article “Sinkhole of Bureaucracy” is an example of a single case example. Nevertheless, the write up tickled my funny bone. With fancy technology,, and the hyper modern content processing systems used in many Federal agencies, reality is stranger than science fiction.

This passage snagged my attention:

inside the caverns of an old Pennsylvania limestone mine, there are 600 employees of the Office of Personnel Management. Their task is nothing top-secret. It is to process the retirement papers of the government’s own workers. But that system has a spectacular flaw. It still must be done entirely by hand, and almost entirely on paper.

One of President Obama’s advisors is quote as describing the manual operation as “that crazy cave.”

Further in the post Stephen makes a good point when he suggests that in order to replace this operation you would first have to understand it.

But having said that, holding IT contractors accountable for failure would go a long way towards encouraging such understanding.

So far as I know, there have been no consequences for the IT contractors responsible for the meltdown.

Perhaps that is the first sign of IT management incompetence, no consequences for IT failures.


Cosmology, Computers and the VisIVO package

Monday, March 24th, 2014

Cosmology, Computers and the VisIVO package by Bruce Berriman.

From the post:


See Bruce’s post for details and resources on the VisIVO software package.

When some people talk about “big data,” they mean large amounts of repetitious log data. Big, but not complex.

Other “big data,” is not only larger, but also more complex. 😉

Google Search Appliance and Libraries

Monday, March 24th, 2014

Using Google Search Appliance (GSA) to Search Digital Library Collections: A Case Study of the INIS Collection Search by Dobrica Savic.

From the post:

In February 2014, I gave a presentation at the conference on Faster, Smarter and Richer: Reshaping the library catalogue (FSR 2014), which was organized by the Associazione Italiana Biblioteche (AIB) and Biblioteca Apostolica Vaticana in Rome, Italy. My presentation focused on the experience of the International Nuclear Information System (INIS) in using Google Search Appliance (GSA) to search digital library collections at the International Atomic Energy Agency (IAEA). 

Libraries are facing many challenges today. In addition to diminished funding and increased user expectations, the use of classic library catalogues is becoming an additional challenge. Library users require fast and easy access to information resources, regardless of whether the format is paper or electronic. Google Search, with its speed and simplicity, has established a new standard for information retrieval which did not exist with previous generations of library search facilities. Put in a position of David versus Goliath, many small, and even larger libraries, are losing the battle to Google, letting many of its users utilize it rather than library catalogues.

The International Nuclear Information System (INIS)

The International Nuclear Information System (INIS) hosts one of the world's largest collections of published information on the peaceful uses of nuclear science and technology. It offers on-line access to a unique collection of 3.6 million bibliographic records and 483,000 full texts of non-conventional (grey) literature. This large digital library collection suffered from most of the well-known shortcomings of the classic library catalogue. Searching was complex and complicated, it required training in Boolean logic, full-text searching was not an option, and response time was slow. An opportune moment to improve the system came with the retirement of the previous catalogue software and the adoption of Google Search Appliance (GSA) as an organization-wide search engine standard.

To be completely honest, my first reaction wasn’t a favorable one.

But even the complete blog post does not do justice to the project in question.

Take a look at the slides, which include screen shots of the new interface before reaching an opinion.

Take this as a lesson on what your search interface should be offering by default.

There are always other screens you can fill with advanced features.

Understanding Clojure’s Persistent Vectors, pt. 1

Monday, March 24th, 2014

Understanding Clojure’s Persistent Vectors, pt. 1 by Jean Niklas L’orange.

From the post:

You may or may not heard about Clojure’s persistent vectors. It is a data structure invented by Rich Hickey (influenced by Phil Bagwell’s paper on Ideal Hash Trees) for Clojure, which gives practically O(1) runtime for insert, update, lookups and subvec. As they are persistent, every modification creates a new vector instead of changing the old one.

So, how do they work? I’ll try to explain them through a series of blogposts, in which we look at manageable parts each time. It will be a detailed explanation, with all the different oddities around the implementation as well. Consequently, this blog series may not be the perfect fit for people who want a “summary” on how persistent vectors work.

For today, we’ll have a look at a naive first attempt, and will cover updates, insertion and popping (removal at the end).

Note that this blogpost does not represent how PersistentVector is implemented: There are some speed optimizations, solutions for transients and other details which we will cover later. However, this serves as a basis for understanding how they work, and the general idea behind the vector implementation.

The sort of post that makes you start wondering why we don’t have a persistent data model for XTM based topic maps?

With persistent we get to drop all the creating new identifiers on merges, creating sets of identifiers, determining if sets of identifiers intersect, to say nothing of having persistent identifiers for interchange of data with other topic maps. A topic’s identifier is its identifier today, tomorrow and at any time to which it is persisted.

To say nothing of having an audit trail for additions/deletions plus “merges.”

While you are considering those possibilities, see: Understanding Clojure’s Persistent Vectors, pt. 2

The GATE Crowdsourcing Plugin:…

Monday, March 24th, 2014

The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy by Kalina Bontcheva, Ian Roberts, Leon Derczynski, and Dominic Rout.


Crowdsourcing is an increasingly popular, collaborative approach for acquiring annotated corpora. Despite this, reuse of corpus conversion tools and user interfaces between projects is still problematic, since these are not generally made available. This demonstration will introduce the new, open-source GATE Crowd-sourcing plugin, which offers infrastructural support for mapping documents to crowdsourcing units and back, as well as automatically generating reusable crowd-sourcing interfaces for NLP classification and selection tasks. The entire work-flow will be demonstrated on: annotating named entities; disambiguating words and named entities with respect to DBpedia URIs; annotation of opinion holders and targets; and sentiment.

From the introduction:

A big outstanding challenge for crowdsourcing projects is that the cost to define a single annotation task remains quite substantial. This demonstration will introduce the new, open-source GATE Crowdsourcing plugin, which offers infrastructural support for mapping documents to crowdsourcing units, as well as automatically generated, reusable user interfaces [1] for NLP classification and selection tasks. Their use will be demonstrated on annotating named entities (selection task), disambiguating words and named entities with respect to DBpedia URIs (classification task), annotation of opinion holders and targets (selection task), as well as sentiment (classification task).


Are the difficulties associated with annotation UIs a matter of creating the UI or the choices that underlie the UI?

This plugin may shed light on possible answers to that question.

How to Quickly Add Nodes and Edges…

Sunday, March 23rd, 2014

How to Quickly Add Nodes and Edges to Graphs

From the webpage:

The existing interfaces for graph manipulation all suffer from the same problem: it’s very difficult to quickly enter the nodes and edges. One has to create a node, then another node, then make an edge between them. This takes a long time and is cumbersome. Besides, such approach is not really as fast as our thinking is.

We, at Nodus Labs, decided to tackle this problem using what we already do well: #hashtagging the @mentions. The basic idea is that you create the nodes and edges in something that we call a “statement”. Within this #statement you can mark the #concepts with #hashtags, which will become nodes and then mark the @contexts or @lists where you want them to appear with @mentions. This way you can create huge graphs in a matter of seconds and if you do not believe us, watch this screencast of our application below.

You can also try it online on or even install it on your local machine using our free open-source repository on

+1! for using “…what we already do well….” for an authoring interface.

Getting any ideas for a topic map authoring interface?


Sunday, March 23rd, 2014


From the about page:

The DARPA Open Catalog is a list of DARPA-sponsored open source software products and related publications. Each resource link shown on this site links back to information about each project, including links to the code repository and software license information.

This site reorganizes the resources of the Open Catalog (specifically the XDATA program) in a way that is easily sortable based on language, project or team. More information about XDATA’s open source software toolkits and peer-reviewed publications can be found on the DARPA Open Catalog, located at

For more information about this site, e-mail us at

A great public service for anyone interested in DARPA XDATA projects.

You could view this as encouragement to donate time to government hackathons.

I disagree.

Donating services to an organization that pays for IT and then accepts crap results, encourages poor IT management.

New Book on Data and Power

Sunday, March 23rd, 2014

New Book on Data and Power by Bruce Schneier.

From the post:

I’m writing a new book, with the tentative title of Data and Power.

While it’s obvious that the proliferation of data affects power, it’s less clear how it does so. Corporations are collecting vast dossiers on our activities on- and off-line — initially to personalize marketing efforts, but increasingly to control their customer relationships. Governments are using surveillance, censorship, and propaganda — both to protect us from harm and to protect their own power. Distributed groups — socially motivated hackers, political dissidents, criminals, communities of interest — are using the Internet to both organize and effect change. And we as individuals are becoming both more powerful and less powerful. We can’t evade surveillance, but we can post videos of police atrocities online, bypassing censors and informing the world. How long we’ll still have those capabilities is unclear.

Understanding these trends involves understanding data. Data is generated by all computing processes. Most of it used to be thrown away, but declines in the prices of both storage and processing mean that more and more of it is now saved and used. Who saves the data, and how they use it, is a matter of extreme consequence, and will continue to be for the coming decades.

Data and Power examines these trends and more. The book looks at the proliferation and accessibility of data, and how it has enabled constant surveillance of our entire society. It examines how governments and corporations use that surveillance data, as well as how they control data for censorship and propaganda. The book then explores how data has empowered individuals and less-traditional power blocs, and how the interplay among all of these types of power will evolve in the future. It discusses technical controls on power, and the limitations of those controls. And finally, the book describes solutions to balance power in the future — both general principles for society as a whole, and specific near-term changes in technology, business, laws, and social norms.

Bruce says a table of contents should appear in “a couple of months” and he is going to be asking “for volunteers to read and comment on a draft version.”

I assume from the description that Bruce is going to try to connect a fairly large number of dots.

Such as who benefits from the Code of Federal Regulations (CFRs) not having an index? The elimination of easier access to the CFRs is a power move. Someone with a great deal of power wants to eliminate the chance of someone gaining power from following information in the CFRs.

I am not a conspiracy theorist but there are only two classes of people in any society, people with more power than you and people with less. Every sentient person wants to have more and no one will voluntarily take less. Among chickens they call it the “pecking order.”

In human society, the “pecking order” in enforced by uncoordinated and largely unconscious following of cultural norms. No conspiracy, just the way we are. But there are cases, the CFR indexes being one of them, where someone is clearly trying to disadvantage others. Who and for what reasons remains unknown.

Data enhancing the Royal Society of…

Sunday, March 23rd, 2014

Data enhancing the Royal Society of Chemistry publication archive by Antony Williams.


The Royal Society of Chemistry has an archive of hundreds of thousands of published articles containing various types of chemistry related data – compounds, reactions, property data, spectral data etc. RSC has a vision of extracting as much of these data as possible and providing access via ChemSpider and its related projects. To this end we have applied a combination of text-mining extraction, image conversion and chemical validation and standardization approaches. The outcome of this project will result in new chemistry related data being added to our chemical and reaction databases and in the ability to more tightly couple web-based versions of the articles with these extracted data. The ability to search across the archive will be enhanced as a result. This presentation will report on our progress in this data extraction project and discuss how we will ultimately use similar approaches in our publishing pipeline to enhance article markup for new publications.

The data mining Antony details on the Royal Society of Chemistry is impressive!

But as Anthony notes at slide #30, it isn’t a long term solution:

We should NOT be mining data out of future publications (emphasis added)

I would say the same thing for metadata/subject identities in data. For some data and some subjects, we can, after the fact, reconstruct properties to identify the subjects they represent.

Data/text mining would be more accurate and easier if subjects were identified at the time of authoring. Perhaps even automatically or at least subject to a user’s approval.

More accurate than researchers removed from an author by time, distance and even profession, trying to guess what subject an author may have meant.

Better semantic authoring support now, will reduce the cost and improve the accuracy of data mining in the future.

Quickly create a 100k Neo4j graph data model…

Sunday, March 23rd, 2014

Quickly create a 100k Neo4j graph data model with Cypher only by Michael Hunger.

From the post:

We want to run some test queries on an existing graph model but have no sample data at hand and also no input files (CSV,GraphML) that would provide it.

Why not create quickly it on our own just using cypher. First I thought about using Cypher to generate CSV files and loading them back, but it is much easier.

The domain is simple (:User)-[:OWN]→(:Product) but good enough for collaborative filtering or demographic analysis.

Admittedly a “simple” domain but I’m curious how you would rank sample data?

We can all probably recognize “simple” domains but what criteria should we use to rank more complex sample data?


Use Parquet with Impala, Hive, Pig, and MapReduce

Saturday, March 22nd, 2014

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce by John Russell.

From the post:

The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of columnar storage at each phase of data processing.

An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:

  • Columnar storage layout: A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.
  • Flexible compression options: The data can be compressed with any of several codecs. Different data files can be compressed differently. The compression is transparent to applications that read the data files.
  • Innovative encoding schemes: Sequences of identical, similar, or related data values can be represented in ways that save disk space and memory, yet require little effort to decode. The encoding schemes provide an extra level of space savings beyond the overall compression for each data file.
  • Large file size: The layout of Parquet data files is optimized for queries that process large volumes of data, with individual files in the multi-megabyte or even gigabyte range.

Impala can create Parquet tables, insert data into them, convert data from other file formats to Parquet, and then perform SQL queries on the resulting data files. Parquet tables created by Impala can be accessed by Apache Hive, and vice versa.

That said, the CDH software stack lets you use the tool of your choice with the Parquet file format, for each phase of data processing. For example, you can read and write Parquet files using Apache Pig and MapReduce jobs. You can convert, transform, and query Parquet tables through Impala and Hive. And you can interchange data files between all of those components — including ones external to CDH, such as Cascading and Apache Tajo.

In this blog post, you will learn the most important principles involved.

Since I mentioned ROOT files yesterday, I am curious what you make of the use of Thrift metadata definitions to read Parquet files?

It’s great that data can be documented for reading, but reading doesn’t imply to me that its semantics have been captured.

A wide variety of products read data, less certain they can document data semantics.


I first saw this in a tweet by Patrick Hunt.

Opening data: Have you checked your pipes?

Saturday, March 22nd, 2014

Opening data: Have you checked your pipes? by Bob Lannon.

From the post:

Code for America alum Dave Guarino had a post recently entitled “ETL for America”. In it, he highlights something that open data practitioners face with every new project: the problem of Extracting data from old databases, Transforming it to suit a new application or analysis and Loading it into the new datastore that will support that new application or analysis. Almost every technical project (and every idea for one) has this process as an initial cost. This cost is so pervasive that it’s rarely discussed by anyone except for the wretched “data plumber” (Dave’s term) who has no choice but to figure out how to move the important resources from one place to another.

Why aren’t we talking about it?

The up-front costs of ETL don’t come up very often in the open data and civic hacking community. At hackathons, in funding pitches, and in our definitions of success, we tend to focus on outputs (apps, APIs, visualizations) and treat the data preparation as a collateral task, unavoidable and necessary but not worth “getting into the weeds” about. Quoting Dave:

The fact that I can go months hearing about “open data” without a single
mention of ETL is a problem. ETL is the pipes of your house: it’s how you
open data.

It’s difficult to point to evidence that this is really the case, but I personally share Dave’s experience. To me, it’s still the elephant in the room during the proceedings of any given hackathon or open data challenge. I worry that the open data community is somehow under the false impression that, eventually in the sunny future, data will be released in a more clean way and that this cost will decrease over time.

It won’t. Open data might get cleaner, but no data source can evolve to the point where it serves all possible needs. Regardless of how easy it is to read, the data published by government probably wasn’t prepared with your new app idea in mind.

Data transformation will always be necessary, and it’s worth considering apart from the development of the next cool interface. It’s a permanent cost of developing new things in our space, so why aren’t we putting more resources toward addressing it as a problem in its own right? Why not spend some quality time (and money) focused on data preparation itself, and then let a thousand apps bloom?

If you only take away this line:

Open data might get cleaner, but no data source can evolve to the point where it serves all possible needs. (emphasis added)

From Bob’s entire post, reading it has been time well spent.

Your “clean data” will at times be my “dirty data” and vice versa.

Documenting the semantics we “see” in data and that drives our transformations into “clean” data for us, stands a chance of helping the next person in the line to use that data.

Think of it as an accumulation of experience with a data sets and the results obtained from it.

Or you can just “wing it” with ever data set you encounter and so shall we all.

Your call.

I first saw this in a tweet by Dave Guarino.

Working Drafts available in EPUB3

Saturday, March 22nd, 2014

Working Drafts available in EPUB3 by Ivan Herman.

From the post:

As reported elsewhere, the Digital Publishing Interest Group has published its first two public Working Drafts. Beyond the content of those documents, the publication has another aspect worth mentioning. For the first time, “alternate” versions of the two documents have been published, alongside the canonical HTML versions, in EPUB3 format. Because EPUB3 is based on the Open Web Platform, it is a much more faithful alternative to the original content than, for example, a PDF version (which has also been used, time to time, as alternate versions of W3C documents). The EPUB3 versions (of the “Requirements for Latin Text Layout and Pagination“ and the “Annotation Use Cases” Drafts, both linked from the respective documents’ front matter) can be used, for example, for off-line reading, relying on different EPUB readers, available either as standalone applications or as browser extensions.

(The EPUB3 versions were produced using a Python program, also available on github.)

Interesting work but also a reminder that digital formats will continue to evolve as long as they are used.

How well will your metadata transfer to a new system or application?

Or are you suffering from vendor lock?

Possible Elimination of FR and CFR indexes (Pls Read, Forward, Act)

Saturday, March 22nd, 2014

Possible Elimination of FR and CFR indexes

I don’t think I have ever posted with (Pls Read, Forward, Act) in the headline, but this merits it.

From the post:

Please see the following message from Emily Feltren, Director of Government Relations for AALL, and contact her if you have any examples to share.

Hi Advocates—

Last week, the House Oversight and Government Reform Committee reported out the Federal Register Modernization Act (HR 4195). The bill, introduced the night before the mark up, changes the requirement to print the Federal Register and Code of Federal Regulations to “publish” them, eliminates the statutory requirement that the CFR be printed and bound, and eliminates the requirement to produce an index to the Federal Register and CFR. The Administrative Committee of the Federal Register governs how the FR and CFR are published and distributed to the public, and will continue to do so.

While the entire bill is troubling, I most urgently need examples of why the Federal Register and CFR indexes are useful and how you use them. Stories in the next week would be of the most benefit, but later examples will help, too. I already have a few excellent examples from our Print Usage Resource Log – thanks to all of you who submitted entries! But the more cases I can point to, the better.

Interestingly, the Office of the Federal Register itself touted the usefulness of its index when it announced the retooled index last year:

Thanks in advance for your help!

Emily Feltren
Director of Government Relations

American Association of Law Libraries

25 Massachusetts Avenue, NW, Suite 500

Washington, D.C. 20001


This is seriously bad news so I decided to look up the details.

Federal Register

Title 44, Section 1504 Federal Register, currently reads in part:

Documents required or authorized to be published by section 1505 of this title shall be printed and distributed immediately by the Government Printing Office in a serial publication designated the ”Federal Register.” The Public Printer shall make available the facilities of the Government Printing Office for the prompt printing and distribution of the Federal Register in the manner and at the times required by this chapter and the regulations prescribed under it. The contents of the daily issues shall be indexed and shall comprise all documents, required or authorized to be published, filed with the Office of the Federal Register up to the time of the day immediately preceding the day of distribution fixed by regulations under this chapter. (emphasis added)

By comparison, H.R. 4195 — 113th Congress (2013-2014) reads in relevant part:

The Public Printer shall make available the facilities of the Government Printing Office for the prompt publication of the Federal Register in the manner and at the times required by this chapter and the regulations prescribed under it. (Missing index language here.) The contents of the daily issues shall constitute all documents, required or authorized to be published, filed with the Office of the Federal Register up to the time of the day immediately preceding the day of publication fixed by regulations under this chapter.

Code of Federal Regulations (CFRs)

Title 44, Section 1510 Code of Federal Regulations, currently reads in part:

(b) (b) A codification published under subsection (a) of this section shall be printed and bound in permanent form and shall be designated as the ”Code of Federal Regulations.” The Administrative Committee shall regulate the binding of the printed codifications into separate books with a view to practical usefulness and economical manufacture. Each book shall contain an explanation of its coverage and other aids to users that the Administrative Committee may require. A general index to the entire Code of Federal Regulations shall be separately printed and bound. (emphasis added)

By comparison, H.R. 4195 — 113th Congress (2013-2014) reads in relevant part:

(b) Code of Federal Regulations.–A codification prepared under subsection (a) of this section shall be published and shall be designated as the `Code of Federal Regulations’. The Administrative Committee shall regulate the manner and forms of publishing this codification. (Missing index language here.)

I would say that indexes for the Federal Register and the Code of Federal Regulations are history should this bill pass as written.

Is this a problem?

Consider the task of tracking the number of pages in the Federal Register versus the pages in the Code of Federal Regulations that may be impacted:

Federal Register – > 70,000 pages per year.

The page count for final general and permanent rules in the 50-title CFR seems less dramatic than that of the oft-cited Federal Register, which now tops 70,000 pages each year (it stood at 79,311 pages at year-end 2013, the fourth-highest level ever). The Federal Register contains lots of material besides final rules. (emphasis added) (New Data: Code of Federal Regulations Expanding, Faster Pace under Obama by Wayne Crews.)

Code of Federal Regulations – 175,496 pages (2013) plus 1,170 page index.

Now, new data from the National Archives shows that the CFR stands at 175,496 at year-end 2013, including the 1,170-page index. (emphasis added) (New Data: Code of Federal Regulations Expanding, Faster Pace under Obama by Wayne Crews.)

The bottom line is there are 175,496 pages being impacted by more than 70,000 pages per year, published in a week-day publication.

We don’t need indexes to access that material?

Congress, I don’t think “access” means what you think it means.

PS: As a research guide, you are unlikely to do better than: A Research Guide to the Federal Register and the Code of Federal Regulations by Richard J. McKinney at the Law Librarians’ Society of Washington, DC website.

I first saw this in a tweet by Aaron Kirschenfeld.

Building a Language for Spreadsheet Refactoring

Saturday, March 22nd, 2014

Building a Language for Spreadsheet Refactoring by Felienne Hermans.


Felienne Hermans introduces BumbleBee, a refactoring and metaprogramming spreadsheets tool based on a DSL that can perform transformations against spreadsheet formulas.

Argues that spreadsheets are code, rather convincingly. (Answer to the everybody must code argument?)

Uses code smell based analysis.

Has authored a refactoring tool for Excel.

Covers functional programming in F# to create the refactoring for Excel.

Analyzing and Visualizing Spreadsheets Felienne’s dissertation.


Spreadsheets are used extensively in industry: they are the number one tool for financial analysis and are also prevalent in other domains, such as logistics and planning. Their flexibility and immediate feedback make them easy to use for non-programmers.

But as easy as spreadsheets are to build, so difficult can they be to analyze and adapt. This dissertation aims at developing methods to support spreadsheet users to understand, update and improve spreadsheets. We took our inspiration for such methods from software engineering, as this field is specialized in the analysis of data and calculations. In this dissertation, we have looked at four different aspects of spreadsheets: metadata, structure, formulas and data. We found that methods from software engineering can be applied to spreadsheets very well, and that these methods support end-users in working with spreadsheets.

If you agree that spreadsheets are programming, how often do you think user/programmers are capturing the semantics of their spreadsheets?

That’s what I thought as well.

PS: Felienne’s website:, Twitter: @felienne

Institute of Historical Research (Podcasts)

Saturday, March 22nd, 2014

Institute of Historical Research (Podcasts)

From the webpage:

Since 2009 the IHR has produced over 500 podcasts, encompassing not only its acclaimed and unique seminar series, but also one-off talks and conferences. All of these recordings are freely available here to stream or download, and can be searched, or browsed by date, event, or subject. In many cases abstracts and other material accompanying the talks can also be found.

These recordings, particularly those taken from seminars where historians are showcasing their current research, provide a great opportunity to listen to experts in all fields of history discuss their work in progress. If you have any questions relating to the podcasts found here, please contact us.

I don’t know what you like writing topic maps about but I suspect you can find some audio podcast resources here.

Disappointed that “ancient” has so few but recent history, the 16th century onward has much better coverage.

The offerings range from the expected:

Goethe’s Erotic Poetry and the Libertine Spectre

Big Flame 1970-1984. A history of a revolutionary socialist organisation

to the obscure:

Chinese and British Gift Giving in the Macartney Embassy of 1793

Learning from the Experience of a Town in Peru’s Central Andes, 1931-1948

Makes me wonder if there is linked data that cover the subjects in these podcasts?

Illustrates one problem with “universal” solutions. Fairly trivial to cover all the “facts” in Wikipedia but that is such a small portion of all available facts. Useful, but still a small set of facts.


ROOT Files

Friday, March 21st, 2014

ROOT Files

From the webpage:

Today, a huge amount of data is stored into files present on our PC and on the Internet. To achieve the maximum compression, binary formats are used, hence they cannot simply be opened with a text editor to fetch their content. Rather, one needs to use a program to decode the binary files. Quite often, the very same program is used both to save and to fetch the data from those files, but it is also possible (and advisable) that other programs are able to do the same. This happens when the binary format is public and well documented, but may happen also with proprietary formats that became a standard de facto. One of the most important problems of the information era is that programs evolve very rapidly, and may also disappear, so that it is not always trivial to correctly decode a binary file. This is often the case for old files written in binary formats that are not publicly documented, and is a really serious risk for the formats implemented in custom applications.

As a solution to these issues ROOT provides a file format that is a machine-independent compressed binary format, including both the data and its description, and provides an open-source automated tool to generate the data description (or “dictionary“) when saving data, and to generate C++ classes corresponding to this description when reading back the data. The dictionary is used to build and load the C++ code to load the binary objects saved in the ROOT file and to store them into instances of the automatically generated C++ classes.

ROOT files can be structured into “directories“, exactly in the same way as your operative system organizes the files into folders. ROOT directories may contain other directories, so that a ROOT file is more similar to a file system than to an ordinary file.

Amit Kapadia mentions ROOT files in his presentation at CERN on citizen science.

I have only just begun to read the documentation but wanted to pass this starting place along to you.

I don’t find the “machine-independent compressed binary format” argument all that convincing but apparently it has in fact worked for quite some time.

Of particular interest will be the data dictionary aspects of ROOT.

Other data and description capturing file formats?

Citizen Science and the Modern Web…

Friday, March 21st, 2014

Citizen Science and the Modern Web – Talk by Amit Kapadia by Bruce Berriman.

From the post:

Amit Kapadia gave this excellent talk at CERN on Citizen Science and The Modern Web. From Amit’s abstract: “Beginning as a research project to help scientists communicate, the Web has transformed into a ubiquitous medium. As the sciences continue to transform, new techniques are needed to analyze the vast amounts of data being produced by large experiments. The advent of the Sloan Digital Sky Survey increased throughput of astronomical data, giving rise to Citizen Science projects such as Galaxy Zoo. The Web is no longer exclusively used by researchers, but rather, a place where anyone can share information, or even, partake in citizen science projects.

As the Web continues to evolve, new and open technologies enable web applications to become more sophisticated. Scientific toolsets may now target the Web as a platform, opening an application to a wider audience, and potentially citizen scientists. With the latest browser technologies, scientific data may be consumed and visualized, opening the browser as a new platform for scientific analysis.”

Bruce points to the original presentation here.

The emphasis is on astronomy but many good points on citizen science.

Curious if citizen involvement in the sciences and humanities could lead to greater awareness and support for them?

Elasticsearch: The Definitive Guide

Friday, March 21st, 2014

Elasticsearch: The Definitive Guide (Draft)

From the Preface, who should read this book:

This book is for anybody who wants to put their data to work. It doesn’t matter whether you are starting a new project and have the flexibility to design the system from the ground up, or whether you need to give new life to a legacy system. Elasticsearch will help you to solve existing problems and open the way to new features that you haven’t yet considered.

This book is suitable for novices and experienced users alike. We expect you to have some programming background and, although not required, it would help to have used SQL and a relational database. We explain concepts from first principles, helping novices to gain a sure footing in the complex world of search.

The reader with a search background will also benefit from this book. Elasticsearch is a new technology which has some familiar concepts. The more experienced user will gain an understanding of how those concepts have been implemented and how they interact in the context of Elasticsearch. Even in the early chapters, there are nuggets of information that will be useful to the more advanced user.

Finally, maybe you are in DevOps. While the other departments are stuffing data into Elasticsearch as fast as they can, you’re the one charged with stopping their servers from bursting into flames. Elasticsearch scales effortlessly, as long as your users play within the rules. You need to know how to setup a stable cluster before going into production, then be able to recognise the warning signs at 3am in the morning in order to prevent catastrophe. The earlier chapters may be of less interest to you but the last part of the book is essential reading — all you need to know to avoid meltdown.

I fully understand the need, nay, compulsion for an author to say that everyone who is literate needs to read their book. And, if you are not literate, their book is a compelling reason to become literate! 😉

As the author of a book (two editions) and more than one standard, I can assure you an author’s need to reach everyone serves no one very well.

Potential readers ranges from novices, intermediate users and experts.

A book that targets all three will “waste” space on matter already know to experts but not to novices and/or intermediate users.

At the same time, space in a physical book being limited, some material relevant to the expert will be left out all together.

I had that experience quite recently when the details of LukeRequestHandler (Solr) were described as:

Reports meta-information about a Solr index, including information about the number of terms, which fields are used, top terms in the index, and distributions of terms across the index. You may also request information on a per-document basis.

That’s it. Out of more than 600+ pages of text, that is all the information you will find on LukeRequestHandler.

Fortunately I did find:

I don’t fault the author because several entire books could be written with the material they left out.

That is the hardest part of authoring, knowing what to leave out.

PS: Having said all that, I am looking forward to reading Elasticsearch: The Definitive Guide as it develops.

Linux Performance Analysis and Tools:…

Friday, March 21st, 2014

Linux Performance Analysis and Tools: Brendan Gregg’s Talk at SCaLE 11x by Deirdré Straughan.

From the post:

The talk is about Linux Performance Analysis and Tools: specifically, observability tools and the methodologies to use them. Brendan gave a quick tour of over 20 Linux performance analysis tools, including advanced perf and DTrace for Linux, showing the reasons for using them. He also covered key methodologies, including a summary of the USE Method, to demonstrate best practices in using them effectively. There are many areas of the system that people don’t know to check, which are worthwhile to at least know about, especially for serious performance issues where you don’t want to leave any stone unturned. These methodologies – and exposure to the toolset – will help you understand why and how to do this. Brendan also introduced a few new of these key performance tuning tools during his presentation.

Be sure to watch the recorded presentation. You will also find this very cool graphic by Brendan Gregg.

Analysis and Tools

It’s suitable for printing and hanging on the wall as a quick reference.

No doubt you recognize some of these commands, but how many switches can you name for each one and what relationship, if any, does the information from one relate to another?

I first saw this in a tweet by nixCraft Linux Blog.

D3.js, Three.js and CSS 3D Transforms

Friday, March 21st, 2014

D3.js, Three.js and CSS 3D Transforms by Steve Hall.

From the post:

This week I have been having some fun thinking about how you could use D3.js and Three.js together to do some data visualization work. We’ll have to put this one in the experimental column since there is a lot more work to be done, but I was pretty pleased with the results and thought I would blog about what I have done up to this point. While there are plenty of dramatic examples of three.js used to generate 3D globes with lines shooting everywhere, I was interested in a more subtle approach to complement work in D3. I would be curious to hear about other experiments going on along the same lines. A Google search didn’t turn up much.

The following example is using D3 to generate HTML elements and SVG charts and also to store coordinate information for transitions inside data properties. The objects created using D3 are then passed into a three.js scene and animated using CSS 3D transforms (no WebGL here, this is pure DOM).

You really need to run the full demo on a large, high-res monitor.

Wicked cool!

Definitely raises the bar for data visualization!

The only downside being you will be expected to find clever 3D ways to visualize data. Way more complicated than the visualization itself.

Clojure Cookbook has arrived!

Thursday, March 20th, 2014

Clojure Cookbook has arrived

clojure cookbook

From the webpage:

Over the past year, the Clojure community came together to write a wondrous tome chock full of their collective knowledge. At over 70 contributors, 1600 commits and nearly 200 recipes, this is something special, folks.

Clojure’s very own crowd-sourced cookbook, Clojure Cookbook, is available now.

The author’s are accepting pull requests at:

Must be already planning on a second edition! 😉

Don’t be shy!


iOS Reverse Engineering Toolkit

Thursday, March 20th, 2014

Introducing the iOS Reverse Engineering Toolkit by Stephen Jensen.

From the post:

It should be the goal of every worker to expend less time and energy to achieve a task, while still maintaining, or even increasing, productivity. As an iOS penetration tester, I find myself repeating the same manual tasks for each test. Typing out the same commands to run various tools that are required to help me do my job. And to be honest, it’s completely monotonous. Every time I fat-finger a key, I lose productivity, forcing me to expend more time and energy to achieve the task. I’m a fan of automation. I’m a fan of streamlined innovation that saves me time and still accomplishes, for the most part, the same results. It was this desire to save time, and reduce my likelihood of suffering from carpal tunnel, that I created the iOS Reverse Engineering Toolkit.

It’s close enough to the weekend to start looking for interesting diversions.

Does anybody know if NSA staff use iPhones or not? 😉

They can hardly complain about the ethics of surveillance. Yes?

Nosy Americans?

Thursday, March 20th, 2014

You have heard the phrase “ugly Americans,” but have you heard “nosy Americans?”

Lee Munson reports in NSA can record 100% of another country’s telephone calls that:

The National Security Agency (NSA) has the ability to record every single one of a foreign country’s telephone calls and then play the conversations back up to a month after recording, according to a report by The Washington Post.

The NSA program, which begun in 2009, is known as MYSTIC.

MYSTIC, according to the Post, is used to intercept conversations in just one (undisclosed) country, but planning documents show that the NSA intends to use the system in other countries in the future.

The really sad part is when you read:

The Washington Post says, at the request of US officials, it will not reveal the country in question, or any other nation where the system has been planned to be put to use. It is, however, quite likely that calls made to or from that nation will include American citizens.

I must have missed the ballot when United States citizens elected the Washington Post to decide on our behalf what facts we need to hear, and those we don’t.

And I don’t have a lot of sympathy for the argument that surveillance may include American citizens.

Its an easy argument to make constitutionally, but if you denigrate the rights of others based on citizenship, the soccer fields aren’t as far away as you think. (But they will be in non-U.S. territory.)


Thursday, March 20th, 2014


From the webpage:

PLUS is a system for capturing and managing provenance information, originally created at the MITRE Corporation.

Data provenance is “information that helps determine the derivation history of a data product…[It includes] the ancestral data product(s) from which this data product evolved, and the process of transformation of these ancestral data product(s).”

Uses Neo4j for storage.

Includes an academic bibliography of related papers.

Provenance answers the question: Where has your data been, what has happened to your data and with who?