Archive for the ‘Documentation’ Category

TLDR pages (man pages by example)

Thursday, January 18th, 2018

TLDR pages

From the webpage:

The TLDR pages are a community effort to simplify the beloved man pages with practical examples.

The TLDR Pages Book (pdf), has 274 pages!

If you have ever hunted through a man page for an example, you will appreciate TLDR pages!

I first saw this in a tweet by Christophe Lalanne.

Hadoop® v3.0.0, Pre-1990 Documentation Practice

Saturday, December 16th, 2017

Apache® Hadoop® v3.0.0 General Availability

From the post:

Ubiquitous Open Source enterprise framework maintains decade-long leading role in $100B annual Big Data market

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, today announced Apache® Hadoop® v3.0.0, the latest version of the Open Source software framework for reliable, scalable, distributed computing.

Over the past decade, Apache Hadoop has become ubiquitous within the greater Big Data ecosystem by enabling firms to run and manage data applications on large hardware clusters in a distributed computing environment.

"This latest release unlocks several years of development from the Apache community," said Chris Douglas, Vice President of Apache Hadoop. "The platform continues to evolve with hardware trends and to accommodate new workloads beyond batch analytics, particularly real-time queries and long-running services. At the same time, our Open Source contributors have adapted Apache Hadoop to a wide range of deployment environments, including the Cloud."

"Hadoop 3 is a major milestone for the project, and our biggest release ever," said Andrew Wang, Apache Hadoop 3 release manager. "It represents the combined efforts of hundreds of contributors over the five years since Hadoop 2. I'm looking forward to how our users will benefit from new features in the release that improve the efficiency, scalability, and reliability of the platform."

Apache Hadoop 3.0.0 highlights include:

  • HDFS erasure coding —halves the storage cost of HDFS while also improving data durability;
  • YARN Timeline Service v.2 (preview) —improves the scalability, reliability, and usability of the Timeline Service;
  • YARN resource types —enables scheduling of additional resources, such as disks and GPUs, for better integration with machine learning and container workloads;
  • Federation of YARN and HDFS subclusters transparently scales Hadoop to tens of thousands of machines;
  • Opportunistic container execution improves resource utilization and increases task throughput for short-lived containers. In addition to its traditional, central scheduler, YARN also supports distributed scheduling of opportunistic containers; and 
  • Improved capabilities and performance improvements for cloud storage systems such as Amazon S3 (S3Guard), Microsoft Azure Data Lake, and Aliyun Object Storage System.

… (emphasis in original)

Ah, the Hadoop link.

Do you find it odd use of the leader in the “$100B annual Big Data market” is documented by string comments in scripts and code?

Do you think non-technical management benefits from the documentation so captured?

Or that documentation for field names, routines, etc., can be easily extracted?

If software is maturing in a $100B market, shouldn’t it have mature documentation capabilities as well?

Flight rules for git – How to Distinguish Between Astronauts and Programmers

Thursday, November 9th, 2017

Flight rules for git by Kate Hudson.

From the post:

What are “flight rules”?

A guide for astronauts (now, programmers using git) about what to do when things go wrong.

Flight Rules are the hard-earned body of knowledge recorded in manuals that list, step-by-step, what to do if X occurs, and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. […]

NASA has been capturing our missteps, disasters and solutions since the early 1960s, when Mercury-era ground teams first started gathering “lessons learned” into a compendium that now lists thousands of problematic situations, from engine failure to busted hatch handles to computer glitches, and their solutions.

— Chris Hadfield, An Astronaut’s Guide to Life.

Hudson devises an easy test to distinguish between astronauts and programmers:

Astronauts – missteps, disasters and solutions are written down.

Programmers – missteps, disasters and solutions are programmer/sysadmin lore.

With Usenet and Stackover, you can argue improvement by programmers but it’s hardly been systematic. Even so it depends on a “good” query returning few enough “hits” to be useful.

Hudson is capturing “flight rules” for git.

Act like an astronaut and write down your missteps, disasters and solutions.

NASA made it to the moon and beyond by writing things down.

Who knows?

Writing down software missteps, disasters and solutions may help render all systems transparent, willingly or not.

C Reference Manual (D.M. Richie, 1974)

Tuesday, May 23rd, 2017

C Reference Manual (D.M. Richie, 1974)

I mention the C Reference Manual, now forty-three (43) years old, as encouragement to write good documentation.

It may have a longer life than you ever expected!

For example, in 1974 Richie writes:

2.2 Identifier (Names)

An identifier is a sequence of letters and digits: the first character must be alphabetic.

Which we find replicated years later in ISO/IEC 8879 : 1986 (SGML):

4.198 name: A name token whose first character is a name start character.

4.201 name start character: A character that can begin a name: letters and others designated by the concrete syntax.

And in production [53]:

name start character =
LC Letter \
UC Letter \

Where Figure 1 of 9.2.1 SGML Character defines LC Letter as a-z, UC Letter as A-Z, LCNMSTRT as (none), UCNMSTRT as (none), in the concrete syntax.

And in 1997, the letter vs. digit distinction, finds its way into Extensible Markup Language (XML) 1.0.

[4] NameChar ::= Letter | Digit | ‘.’ | ‘-‘ | ‘_’ | ‘:’ | CombiningChar | Extender
[5] Name ::= (Letter | ‘_’ | ‘:’) (NameChar)*

“Letter” is a link to a production referencing all the qualifying Unicode characters which is too long to include here.

What started off as an arbitrary choice, “alphabetic” characters as name start characters in 1974, is picked up some 12 years later (1986) in ISO/IEC 8879 (SGML), both of which were bound by a restricted character set.

When the opportunity came to abandon the letter versus digit distinction in name start characters (XML 1.0), the result is a larger character repertoire for name start characters, but digits continue as second-class citizens.

Can you point to an explanation why Richie preferred alphabetic characters over digits for name start characters?

The Secrets of Technical Writing

Monday, May 22nd, 2017

The Secrets of Technical Writing by Matthew Johnston.

From the post:

The process of writing code, building apps, or developing websites is always evolving, with improvements in coding tools and practices constantly arriving. But one aspect hasn’t really been brought along for the journey, passed-by in the democratisation of learning that the internet has brought about, and that’s the idea of writing about code.

Technical writing is one of the darkest of dark arts in the domain of code development: you won’t find too many people talking about it, you won’t find too many great examples of it, and even hugely successful tech companies have no idea how to handle it.

So, in an effort to change that, I’m going to share with you what I’ve learnt about technical writing from building Facebook’s Platform docs, providing documentation assistance to their Open Source projects, and creating a large, multi-part tutorial for Facebook’s F8 conference in 2016. When I talk about the struggles of writing docs, I’ve seen it happen at the biggest and best of tech companies, and I’ve experienced how difficult it can be to get it right.

These tips aren’t perfect, they aren’t applicable to everything, and I’m not at an expert-level of technical writing, but I think it’s important to share thoughts on this, and help bring technical writing up to par with the rest of code development.

Note that this is from the perspective of writing technical docs, it can just as easily apply to shorter tutorials, blog posts, presentations, or talks.

The best tip of the lot: start early! Don’t wait until just before launch to hack some documentation together.

If you don’t have the cycles, I know someone, who might. 😉

ARM Releases Machine Readable Architecture Specification (Intel?)

Saturday, April 22nd, 2017

ARM Releases Machine Readable Architecture Specification by Alastair Reid.

From the post:

Today ARM released version 8.2 of the ARM v8-A processor specification in machine readable form. This specification describes almost all of the architecture: instructions, page table walks, taking interrupts, taking synchronous exceptions such as page faults, taking asynchronous exceptions such as bus faults, user mode, system mode, hypervisor mode, secure mode, debug mode. It details all the instruction formats and system register formats. The semantics is written in ARM’s ASL Specification Language so it is all executable and has been tested very thoroughly using the same architecture conformance tests that ARM uses to test its processors (See my paper “Trustworthy Specifications of ARM v8-A and v8-M System Level Architecture”.)

The specification is being released in three sets of XML files:

  • The System Register Specification consists of an XML file for each system register in the architecture. For each register, the XML details all the fields within the register, how to access the register and which privilege levels can access the register.
  • The AArch64 Specification consists of an XML file for each instruction in the 64-bit architecture. For each instruction, there is the encoding diagram for the instruction, ASL code for decoding the instruction, ASL code for executing the instruction and any supporting code needed to execute the instruction and the decode tree for finding the instruction corresponding to a given bit-pattern. This also contains the ASL code for the system architecture: page table walks, exceptions, debug, etc.
  • The AArch32 Specification is similar to the AArch64 specification: it contains encoding diagrams, decode trees, decode/execute ASL code and supporting ASL code.

Alastair provides starting points for use of this material by outlining his prior uses of the same.

Raises the question why an equivalent machine readable data set isn’t available for Intel® 64 and IA-32 Architectures? (PDF manuals)

The data is there, but not in a machine readable format.

Anyone know why Intel doesn’t provide the same convenience?

Top considerations for creating bioinformatics software documentation

Wednesday, January 18th, 2017

Top considerations for creating bioinformatics software documentation by Mehran Karimzadeh and Michael M. Hoffman.


Investing in documenting your bioinformatics software well can increase its impact and save your time. To maximize the effectiveness of your documentation, we suggest following a few guidelines we propose here. We recommend providing multiple avenues for users to use your research software, including a navigable HTML interface with a quick start, useful help messages with detailed explanation and thorough examples for each feature of your software. By following these guidelines, you can assure that your hard work maximally benefits yourself and others.


You have written a new software package far superior to any existing method. You submit a paper describing it to a prestigious journal, but it is rejected after Reviewer 3 complains they cannot get it to work. Eventually, a less exacting journal publishes the paper, but you never get as many citations as you expected. Meanwhile, there is not even a single day when you are not inundated by emails asking very simple questions about using your software. Your years of work on this method have not only failed to reap the dividends you expected, but have become an active irritation. And you could have avoided all of this by writing effective documentation in the first place.

Academic bioinformatics curricula rarely train students in documentation. Many bioinformatics software packages lack sufficient documentation. Developers often prefer spending their time elsewhere. In practice, this time is often borrowed, and by ducking work to document their software now, developers accumulate ‘documentation debt’. Later, they must pay off this debt, spending even more time answering user questions than they might have by creating good documentation in the first place. Of course, when confronted with inadequate documentation, some users will simply give up, reducing the impact of the developer’s work.
… (emphasis in original)

Take to heart the authors’ observation on automatic generation of documentation:

The main disadvantage of automatically generated documentation is that you have less control of how to organize the documentation effectively. Whether you used a documentation generator or not, however, there are several advantages to an HTML web site compared with a PDF document. Search engines will more reliably index HTML web pages. In addition, users can more easily navigate the structure of a web page, jumping directly to the information they need.

I would replace “…less control…” with “…virtually no meaningful control…” over the organization of the documentation.

Think about it for a second. You write short comments, sometimes even incomplete sentences as thoughts occur to you in a code or data context.

An automated tool gathers those comments, even incomplete sentences, rips them out of their original context and strings them one after the other.

Do you think that provides a meaningful narrative flow for any reader? Including yourself?

Your documentation doesn’t have to be great literature but as Karimzadeh and Hoffman point out, good documentation can make the difference between use and adoption and your hard work being ignored.

Ping me if you want to take your documentation to the next level.

The Next Generation R Documentation System [Dynamic R Documentation?]

Wednesday, August 31st, 2016

The R Documentation Task Force: The Next Generation R Documentation System by Joseph Rickert and Hadley Wickham.

From the post:

Andrew Redd received $10,000 to lead a new ISC working group, The R Documentation Task Force, which has a mission to design and build the next generation R documentation system. The task force will identify issues with documentation that currently exist, abstract the current Rd system into an R compatible structure, and extend this structure to include new considerations that were not concerns when the Rd system was first implemented. The goal of the project is to create a system that allows for documentation to exist as objects that can be manipulated inside R. This will make the process of creating R documentation much more flexible enabling new capabilities such as porting documentation from other languages or creating inline comments. The new capabilities will add rigor to the documentation process and enable the the system to operate more efficiently than any current methods allow. For more detail have a look at the R Documentation Task Force proposal (Full Text).

The task force team hopes to complete the new documentation system in time for the International R Users Conference, UseR! 2017, which begins July 4th 2017. If you are interested in participating in this task force, please contact Andrew Redd directly via email ( Outline your interest in the project, you experience with documentation any special skills you may have. The task force team is particularly interested in experience with documentation systems for languages other than R and C/C++.

OK, I have a weakness for documentation projects!

See the full proposal for all the details but:

There are two aspects of R documentation I intend to address which will make R an exemplary system for documentation.

The first aspect is storage. The mechanism of storing documentation in separate Rd files hinders the development process and ties documentation to the packaging system, and this need not be so. Life does not always follow the ideal; code and data are not always distributed via nice packages. Decoupling the documentation from the packaging system will allow for more dynamic and flexible documentation strategies, while also simplifying the process of transitioning to packages distributed through CRAN or other outlets.

The second aspect is flexibility of defining documentation. R is a language of flexibility and preference. There are many paths to the same outcome in R. While this has often been a source of confusion to new users of R, however it is also one of R’s greatest strengths. With packages flexibility has allowed for many contributions, some have fallen in favor while others have proven superior. Adding flexibility in documentation methods will allow for newer, and ideally improved, methods to be developed.

Have you seen the timeline?

  • Mid-August 2016 notification of approval.
  • September 1, 2016 Kickoff for the R Documentation Task Force with final members.
  • September 16, 2016 Deadline for submitting posts to the R-consortium blog, the R-announce, Rpackage-devel, and R-devel mailing lists, announcing the project.
  • September 1 through November 27th 2016 The task force conducts bi-weekly meetings via Lync to address issues in documentation.
  • November 27th, 2016 Deadline for preliminary recommendations of documentation extensions. Recommendations and conflicts written up and submitted to the R journal to be published in the December 2016 issue.
  • December 2016 Posts made to the R Consortium blog, and R mailing lists to coincide with the R Journal article to call for public participation.
  • January 27, 2017 Deadline for general comments on recommendations. Work begins to finalize new documentation system.
  • February 2017 Task force meets to finalize decisions after public input.
  • February-May 2017 Task force meets monthly as necessary to monitor progress on code development.
  • May 2017 Article is submitted outlining final recommendations and the subsequent tools developed to the R Journal for review targeting the June 2017 issue.
  • July 4-7, 2017 Developments will be presented at the International R users conference in Brussels, Belgium.

A very ambitious schedule and one that leaves me wondering if December of 2016 is the first opportunity for public participation, will notes/discussions from the bi-weekly meetings be published before then?

Probably incorrect but I have the impression from the proposal that documentation is regarded as a contiguous mass of text. Yes?

I ask because the “…contiguous mass of text…” model for documentation is a very poor one.

Documentation can present to a user as though it were a “…contiguous mass of text…,” but as I said, a very poor model for documentation itself.

Imagine R documentation that automatically updates itself from R-Bloggers, for example, to include the latest tutorials on a package.

Or that updates to include new data sets, issued since the last documentation update.

Treating documentation as though it must be episodically static should have been abandoned years ago.

The use of R and R development are not static, why should its documentation be?

TLDR pages [Explanation and Example Practice]

Friday, August 19th, 2016

TLDR pages

From the webpage:

The TLDR pages are a community effort to simplify the beloved man pages with practical examples.

Try the live demo below, have a look at the pdf version, or follow the installing instructions.

Be sure to read the Contributing guidelines.

I checked and ngrep isn’t there. 🙁

Well, ngrep only has thirty (30) options and switches before you reach <match expression> and <bpf filter>, so how much demand could there be for examples?


Great opportunity to practice your skills at explanation and creating examples.

Data as “First Class Citizens”

Tuesday, February 10th, 2015

Data as “First Class Citizens” by Łukasz Bolikowski, Nikos Houssos, Paolo Manghi, Jochen Schirrwagen.

The guest editorial to D-Lib Magazine, January/February 2015, Volume 21, Number 1/2, introduces a collection of articles focusing on data as “first class citizens,” saying:

Data are an essential element of the research process. Therefore, for the sake of transparency, verifiability and reproducibility of research, data need to become “first-class citizens” in scholarly communication. Researchers have to be able to publish, share, index, find, cite, and reuse research data sets.

The 2nd International Workshop on Linking and Contextualizing Publications and Datasets (LCPD 2014), held in conjunction with the Digital Libraries 2014 conference (DL 2014), took place in London on September 12th, 2014 and gathered a group of stakeholders interested in growing a global data publishing culture. The workshop started with invited talks from Prof. Andreas Rauber (Linking to and Citing Data in Non-Trivial Settings), Stefan Kramer (Linking research data and publications: a survey of the landscape in the social sciences), and Dr. Iain Hrynaszkiewicz (Data papers and their applications: Data Descriptors in Scientific Data). The discussion was then organized into four full-paper sessions, exploring orthogonal but still interwoven facets of current and future challenges for data publishing: “contextualizing and linking” (Semantic Enrichment and Search: A Case Study on Environmental Science Literature and A-posteriori Provenance-enabled Linking of Publications and Datasets via Crowdsourcing), “new forms of publishing” (A Framework Supporting the Shift From Traditional Digital Publications to Enhanced Publications and Science 2.0 Repositories: Time for a Change in Scholarly Communication), “dataset citation” (Data Citation Practices in the CRAWDAD Wireless Network Data Archive, A Methodology for Citing Linked Open Data Subsets, and Challenges in Matching Dataset Citation Strings to Datasets in Social Science) and “dataset peer-review” (Enabling Living Systematic Reviews and Clinical Guidelines through Semantic Technologies and Data without Peer: Examples of Data Peer Review in the Earth Sciences).

We believe these investigations provide a rich overview of current issues in the field, by proposing open problems, solutions, and future challenges. In short they confirm the urgent and fascinating demands of research data publishing.

The only stumbling point in this collection of essays is the notion of data as “First Class Citizens.” Not that I object to a catchy title but not all data is going to be equal when it comes to first class citizenship.

Take Semantic Enrichment and Search: A Case Study on Environmental Science Literature, for example. Great essay on using multiple sources to annotate entities once disambiguation had occurred. But some entities are going to have more annotations than others and some entities may not be recognized at all. Not to mention it is rather doubtful that the markup containing those entities was annotated at all?

Are we sure we want to exclude from data the formats that contain the data? Isn’t a format a form of data? As well as the instructions for processing that data? Perhaps not in every case but shouldn’t data and the formats that hold date be equally treated as first class citizens? I am mindful that hundreds of thousands of people saw the pyramids being built but we have not one accurate report on the process.

Will the lack of that one accurate report deny us access to data quite skillfully preserved in a format that is no longer accessible to us?

While I support the cry for all data to be “first class citizens,” I also support a very broad notion of data to avoid overlooking data that may be critical in the future.

The Big Book of PostgreSQL

Sunday, November 30th, 2014

The Big Book of PostgreSQL by Thom Brown.

From the post:

Documentation is crucial to the success of any software program, particularly open source software (OSS), where new features and functionality are added by many contributors. Like any OSS, PostgreSQL needs to produce accurate, consistent and reliable documentation to guide contributors’ work and reflect the functionality of every new contribution. Documentation also an important source of information for developers, administrators and other end users as they will take actions or base their work on the functionality described in the documentation. Typically, the author of a new feature provides the relevant documentation changes to the project, and that person can be anyone in any role in IT. So it can really come from anywhere.

Postgres documentation is extensive (you can check out the latest 9.4 documentation here). In fact, the U.S. community PDF document is 2,700 pages long. It would be a mighty volume and pretty unwieldy if published as a physical book. The Postgres community is keenly aware that the quality of documentation can make or break an open source project, and thus regularly updates and improves our documentation, a process I’ve appreciated being able to take part in.

A recent podcast, Solr Usability with Steve Rowe & Tim Potter goes to some lengths to describe efforts to improve Solr documentation.

If you know anyone in the Solr community, consider this a shout out that PostgreSQL documentation isn’t a bad example to follow.

Results of 2014 State of Clojure and ClojureScript Survey

Thursday, October 23rd, 2014

Results of 2014 State of Clojure and ClojureScript Survey by Alex Miller.

From the post:

The 2014 State of Clojure and ClojureScript Survey was open from Oct. 8-17th. The State of Clojure survey (which was applicable to all users of Clojure, ClojureScript, and ClojureCLR) had 1339 respondents. The more targeted State of ClojureScript survey had 642 respondents.

The responses to “What has been most frustrating for you in your use of Clojure/CLJS? put “Availability of comprehensive / approachable documentation, tutorials, etc” at #2 and #3 respectively.

Improved technical capabilities is important for existing users but increasing mind share is an issue of “onboarding” new users of Clojure. If you have ever experienced or “read about” the welcoming given even casual visitors in some churches, you will have a good idea of some effective ideas at building membership.

If you try to build a community using techniques not found in churches, you need better techniques. Remember churches have had centuries to practice their membership building techniques.

Let me put it this way: When was the last time you saw a church passing out information as poorly written, organized and incomplete as that for most computer languages? Guess who is winning the membership race by any measure?

Are you up for studying and emulating building church membership techniques? (as appropriate or adapted)

Bioinformatics tools extracted from a typical mammalian genome project

Monday, October 6th, 2014

Bioinformatics tools extracted from a typical mammalian genome project

From the post:

In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.

A must read if you are interested in useful preservation of research and data. The paper focuses on needed improvements in bioinformatics but the issues raised are common to all fields.

How well does your field perform when compared to bioinformatics?

Introducing Hadoop FlipBooks

Saturday, May 31st, 2014

Introducing Hadoop FlipBooks

From the post:

In line with the learning theme that HadoopSphere has been evangelizing, we are pleased to introduce a new feature named FlipBooks. A Hadoop flipbook is a quick reference guide for any topic giving a short summary of key concepts in form of Q&A. Typically with a set of 4 questions, it tries to test your knowledge on the concept.

Curious what you think of this concept?

I looked at a couple of them but four (4) questions seems a bit short.

With the caution that it was probably twenty (20) years ago, I remember the drill software for the Novell Netware CNE program. Organized by subject/class as I recall and certainly a lot more than four (4) questions.

What software would you suggest for authoring similar drill material now?

Open government:….

Saturday, May 31st, 2014

Open government: getting beyond impenetrable online data by Jed Miller.

From the post:

Mathematician Blaise Pascal famously closed a long letter by apologising that he hadn’t had time to make it shorter. Unfortunately, his pithy point about “download time” is regularly attributed to Mark Twain and Henry David Thoreau, probably because the public loves writers more than it loves statisticians. Scientists may make things provable, but writers make them memorable.

The World Bank confronted a similar reality of data journalism earlier this month when it revealed that, of the 1,600 bank reports posted online on from 2008 to 2012, 32% had never been downloaded at all and another 40% were downloaded under 100 times each.

Taken together, these cobwebbed documents represent millions of dollars in World Bank funds and hundreds of thousands of person-hours, spent by professionals who themselves represent millions of dollars in university degrees. It’s difficult to see the return on investment in producing expert research and organising it into searchable web libraries when almost three quarters of the output goes largely unseen.

You won’t find any ways to make documents less impenetrable in Jed’s post but it is a source for quotes on the issue.

For example:

For nonprofits and governments that still publish 100-page pdfs on their websites and do not optimise the content to share in other channels such as social: it is a huge waste of time and ineffective. Stop it now.

OK, so that’s easy: “Stop it now.”

The harder question: “What should we put in its place?”

Shouting “stop it” without offering examples of better documents or approaches, is like a car horn in New York City. It’s just noise pollution.

Do you have any examples of documents, standards, etc. that are “good” and non impenetrable?

Let’s make this more concrete: Suggest an “impenetrable” document*, hopefully not a one hundred (100) page one and I will take a shot at revising it to make it less “impenetrable.” I will post a revised version here with notes as to why revisions were made. We won’t all agree but it might result in a example document that isn’t “impenetrable.”

*Please omit tax statutes or regulations, laws, etc. I could probably make them less impenetrable but only with a great deal of effort. That sort of text is “impenetrable” by design.

Hello Again

Friday, May 30th, 2014

We Are Now In Command of the ISEE-3 Spacecraft by Keith Cowing.

From the post:

The ISEE-3 Reboot Project is pleased to announce that our team has established two-way communication with the ISEE-3 spacecraft and has begun commanding it to perform specific functions. Over the coming days and weeks our team will make an assessment of the spacecraft’s overall health and refine the techniques required to fire its engines and bring it back to an orbit near Earth.

First Contact with ISEE-3 was achieved at the Arecibo Radio Observatory in Puerto Rico. We would not have been able to achieve this effort without the gracious assistance provided by the entire staff at Arecibo. In addition to the staff at Arecibo, our team included simultaneous listening and analysis support by AMSAT-DL at the Bochum Observatory in Germany, the Space Science Center at Morehead State University in Kentucky, and the SETI Institute’s Allen Telescope Array in California.

How’s that for engineering and documentation?

So, maybe good documentation isn’t such a weird thing after all. 😉

Nomad and Historic Information

Thursday, May 22nd, 2014

You may remember Nomad from the Star Trek episode The Changeling. Not quite on that scale but NASA has signed an agreement to allow citizen scientists to “wake up” a thirty-five (35) year old spacecraft this next August.

NASA has given a green light to a group of citizen scientists attempting to breathe new scientific life into a more than 35-year old agency spacecraft.

The agency has signed a Non-Reimbursable Space Act Agreement (NRSAA) with Skycorp, Inc., in Los Gatos, California, allowing the company to attempt to contact, and possibly command and control, NASA’s International Sun-Earth Explorer-3 (ISEE-3) spacecraft as part of the company’s ISEE-3 Reboot Project. This is the first time NASA has worked such an agreement for use of a spacecraft the agency is no longer using or ever planned to use again.

The NRSAA details the technical, safety, legal and proprietary issues that will be addressed before any attempts are made to communicate with or control the 1970’s-era spacecraft as it nears the Earth in August.

“The intrepid ISEE-3 spacecraft was sent away from its primary mission to study the physics of the solar wind extending its mission of discovery to study two comets.” said John Grunsfeld, astronaut and associate administrator for the Science Mission Directorate at NASA headquarters in Washington. “We have a chance to engage a new generation of citizen scientists through this creative effort to recapture the ISEE-3 spacecraft as it zips by the Earth this summer.” NASA Signs Agreement with Citizen Scientists Attempting to Communicate with Old Spacecraft

Do you have any thirty-five (35) year old software you would like to start re-using? 😉

What information should you have captured for that software?

The crowdfunding is in “stretch mode,” working towards $150,000. Support at: ISEE-3 Reboot Project by Space College, Skycorp, and SpaceRef.

Light Table is open source

Thursday, January 9th, 2014

Light Table is open source by Chris Granger.

From the post:

Today Light Table is taking a huge step forward – every bit of its code is now on Github and along side of that, we’re releasing Light Table 0.6.0, which includes all the infrastructure to write and use plugins. If you haven’t been following the 0.5.* releases, this latest update also brings a tremendous amount of stability, performance, and clean up to the party. All of this together means that Light Table is now the open source developer tool platform that we’ve been working towards. Go download it and if you’re new give our tutorial a shot!

If you aren’t already familiar with Light Table, check out The IDE as a value, also by Chris Granger.

Just a mention in the notes, but start listening for “contextuality.” It comes up in functional approaches to graph algorithms.

Astera Centerprise

Wednesday, January 8th, 2014

Asteria Centerprise

From the post:

The first in our Centerprise Best Practices Webinar Series discusses the features of Centerprise that make it the ideal integration solution for the high volume data warehouse. Topics include data quality (profiling, quality measurements, and validation), translating data to star schema (maintaining foreign key relationships and cardinality with slowly changing dimensions), and performance, including querying data with in-database joins and caching. We’ve posted the Q&A below, which delves into some interesting topics.

You can view the webinar video, as well as all our demo and tutorial videos, at Astera TV.

Very visual approach to data integration.

Be aware that comments on objects in a dataflow are a “planned” feature:

An exteremly useful (and simple) addition to Centerprise would be the ability to pin notes onto a flow to be quickly and easily seen by anyone who opens the flow.

This would work as an object which could be dragged to the flow, and allow the user enter enter a note which would remain on-screen, unlike the existing comments which require you to actually open the object and page to the ‘comments’ pane.

This sort of logging ability will prove very useful to explain to future dataflow maintainers why certain decisions were made in the design, as well as informing them of specific changes/additions and the reasons why they were enacted.

As Centerprise is almost ‘self-documenting’, the note-keeping ability would allow us to avoid maintaining and refering to seperate documentation (which can become lost)

A comment on each data object would be an improvement but a flat comment would be of limited utility.

A structured comment (perhaps extensible comment?) that captures the author, date, data source, target, etc. would make comments usefully searchable.

Including structured comments on the dataflows, transformations, maps and workflows themselves and to query for the presence of structured comments would be very useful.

A query for the existence of structured comments could help enforce local requirements for documenting data objects and operations.

Setting up a Hadoop cluster

Thursday, November 21st, 2013

Setting up a Hadoop cluster – Part 1: Manual Installation by Lars Francke.

From the post:

In the last few months I was tasked several times with setting up Hadoop clusters. Those weren’t huge – two to thirteen machines – but from what I read and hear this is a common use case especially for companies just starting with Hadoop or setting up a first small test cluster. While there is a huge amount of documentation in form of official documentation, blog posts, articles and books most of it stops just where it gets interesting: Dealing with all the stuff you really have to do to set up a cluster, cleaning logs, maintaining the system, knowing what and how to tune etc.

I’ll try to describe all the hoops we had to jump through and all the steps involved to get our Hadoop cluster up and running. Probably trivial stuff for experienced Sysadmins but if you’re a Developer and finding yourself in the “Devops” role all of a sudden I hope it is useful to you.

While working at GBIF I was asked to set up a Hadoop cluster on 15 existing and 3 new machines. So the first interesting thing about this setup is that it is a heterogeneous environment: Three different configurations at the moment. This is where our first goal came from: We wanted some kind of automated configuration management. We needed to try different cluster configurations and we need to be able to shift roles around the cluster without having to do a lot of manual work on each machine. We decided to use a tool called Puppet for this task.

While Hadoop is not currently in production at GBIF there are mid- to long-term plans to switch parts of our infrastructure to various components of the HStack. Namely MapReduce jobs with Hive and perhaps Pig (there is already strong knowledge of SQL here) and also storing of large amounts of raw data in HBase to be processed asynchronously (~500 million records until next year) and indexed in a Lucene/Solr solution possibly using something like Katta to distribute indexes. For good measure we also have fairly complex geographic calculations and map-tile rendering that could be done on Hadoop. So we have those 18 machines and no real clue how they’ll be used and which services we’d need in the end.

Dated, 2011, but illustrates some of the issues I raised in: Hadoop Ecosystem Configuration Woes?

Do you keep this level of documentation on your Hadoop installs?

I first saw this in a tweet by Marko A. Rodriguez.

Spreadsheets:… [95% Usage]

Monday, November 11th, 2013

Spreadsheets: The Ununderstood Dark Matter of IT by Felienne Hermans.


Spreadsheets are used extensively in industry: they are the number one tool for financial analysis and are also prevalent in other domains, such as logistics and planning. Their flexibility and immediate feedback make them easy to use for non-programmers. But they are as easy to build, as they are difficult to analyze, maintain and check. Felienne’s research aims at developing methods to support spreadsheet users to understand, update and improve spreadsheets. Inspiration was taken from classic software engineering, as this field is specialized in the analysis of data and calculations. In this talk Felienne will summarize her recently completed PhD research on the topic of spreadsheet structure visualization, spreadsheet smells and clone detection, as well as presenting a sneak peek into the future of spreadsheet research as Delft University.

Some tidbits to interest you in the video:

“95% of all U.S. corporations still use spreadsheets.”

“Spreadsheet can have a long life, 5 years on average.”

“No docs, errors, long life. It looks like software!”

Designing a tool for software users are using, as opposed to designing tools users ought to be using.

What a marketing concept!

Not a lot of details at the PerfectXL website.

PerfectXL analyzes spreadsheets but doesn’t address the inability of spreadsheets to capture robust metadata about data or its processing in a spreadsheet.

Pay particular attention to how Felienne distinguishes a BI dashboard from a spreadsheet. You have seen that before in this blog. (Hint: Search for “F-16” or “VW.”)

No doubt you will also like Felienne’s blog.

I first saw this in a tweet by Lars Marius Garshol.

Ten Simple Rules for Reproducible Computational Research

Sunday, November 10th, 2013

Ten Simple Rules for Reproducible Computational Research by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, Eivind Hovig. (Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol 9(10): e1003285. doi:10.1371/journal.pcbi.1003285)

From the article:

Replication is the cornerstone of a cumulative science [1]. However, new tools and technologies, massive amounts of data, interdisciplinary approaches, and the complexity of the questions being asked are complicating replication efforts, as are increased pressures on scientists to advance their research [2]. As full replication of studies on independently collected data is often not feasible, there has recently been a call for reproducible research as an attainable minimum standard for assessing the value of scientific claims [3]. This requires that papers in experimental science describe the results and provide a sufficiently clear protocol to allow successful repetition and extension of analyses based on original data [4].

The importance of replication and reproducibility has recently been exemplified through studies showing that scientific papers commonly leave out experimental details essential for reproduction [5], studies showing difficulties with replicating published experimental results [6], an increase in retracted papers [7], and through a high number of failing clinical trials [8], [9]. This has led to discussions on how individual researchers, institutions, funding bodies, and journals can establish routines that increase transparency and reproducibility. In order to foster such aspects, it has been suggested that the scientific community needs to develop a “culture of reproducibility” for computational science, and to require it for published claims [3].

We want to emphasize that reproducibility is not only a moral responsibility with respect to the scientific field, but that a lack of reproducibility can also be a burden for you as an individual researcher. As an example, a good practice of reproducibility is necessary in order to allow previously developed methodology to be effectively applied on new data, or to allow reuse of code and results for new projects. In other words, good habits of reproducibility may actually turn out to be a time-saver in the longer run.

The rules:

Rule 1: For Every Result, Keep Track of How It Was Produced

Rule 2: Avoid Manual Data Manipulation Steps

Rule 3: Archive the Exact Versions of All External Programs Used

Rule 4: Version Control All Custom Scripts

Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds

Rule 7: Always Store Raw Data behind Plots

Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

Rule 9: Connect Textual Statements to Underlying Results

Rule 10: Provide Public Access to Scripts, Runs, and Results

To bring this a little closer to home, would another researcher be able to modify your topic map or RDF store with some certainty as to the result?

Or take over the maintenance/modification of a Hadoop ecosystem without hand holding by the current operator?

Being unable to answer either of those questions with “yes,” doesn’t show up as a line item in your current budget.

However, when the need to “reproduce” or modify your system becomes mission critical, it may be a budget (and job) busting event.

What’s your tolerance for job ending risk?

I forgot to mention I first saw this in “Ten Simple Rules for Reproducible Computational Research” – An Excellent Read for Data Scientists by Sean Murphy.

Hadoop Ecosystem Configuration Woes?

Thursday, November 7th, 2013

After listening to Kathleen Ting (Cloudera) describe how 44% of support tickets for the Hadoop ecosystem arise from misconfiguration (Dealing with Data in the Hadoop Ecosystem…), I started to wonder how many opportunities there are for misconfiguration in the Hadoop ecosystem?

That’s probably not an answerable question, but we can look at how configurations are documented in the Hadoop ecosystem:

Comment in the Hadoop ecosystem:

  • Accumulo – XML <!– comment –>
  • Avro – Schemas defined in JSON (no comment facility)
  • Cassandra – “#” comment indicator
  • Chukwa – XML <!– comment –>
  • Falcon – XML <!– comment –>
  • Flume – “#” comment indicator
  • Hadoop – XML <!– comment –>
  • Hama – XML <!– comment –>
  • HBase – XML <!– comment –>
  • Hive – XML <!– comment –>
  • Knox – XML <!– comment –>
  • Mahout – XML <!– comment –>
  • PIG – C style comments
  • Sqoop – “#” comment indicator
  • Tex – XML <!– comment –>
  • ZooKeeper – text but no apparent ability to comment (Zookeeper Administrator’s Guide)

I read that to mean:

1 Component, Pig uses C style comments

2 Components, Avro and ZooKeeper, have no ability for comments at all.

3 Components, Cassandra, Flume and Sqoop use “#” for comments

10 Components, Accumulo, Chukwa, Falcon, Hama, Hadoop, HBase, Hive, Knox, Mahout and Tex presumably support XML comments

A full one third of the Hadoop ecosystem uses a non-XML comments, if comments are permitted at all. The other two-thirds of the ecosystem uses XML comments in some files and not others.

The entire ecosystem lacks a standard way to associate value or settings in one component with values or settings in another component.

To say nothing of associating values or settings with releases of different components.

Without looking at the details of the possible settings for each component, does that seem problematic to you?

Cypher shell with logging

Friday, August 23rd, 2013

Cypher shell with logging by Alex Frieden.

From the post:

For those who don’t know, Neo4j is a graph database built with Java. The internet is abound with examples, so I won’t bore you with any.

Our problem was a data access problem. We built a loader, loaded our data into neo4j, and then queried it. However we ran into a little problem: Neo4j at the time of release logs in the home directory (at least on linux redhat) what query was ran (its there as a hidden file). However, it doesn’t log what time it was run at. One other problem as an administrator point of view is not having a complete log of all queries and data access. So we built a cypher shell that would do the logging the way we needed to log. Future iterations of this shell will have REST cypher queries and not use the embedded mode (which is faster but requires a local connection to the data). We also wanted a way in the future to output results to a file.


Logs are a form of documentation. You may remember that documentation was #1 in the Solr Usability contest.

Documentation is important! Don’t neglect it.

Light Table 0.5.0

Friday, August 23rd, 2013

Light Table 0.5.0 by Chris Granger.

A little later than the first week or two of August, 2013, but not by much!

Chris says Light Table is a next-gen IDE.

He may be right but to evaluate that claim, you will need to download the alpha here.

I must confess I am curious about his claim:

With the work we did to add background processing and a good deal of effort toward ensuring everything ran fast, LightTable is now comparable in speed to Vim and faster than Emacs or Sublime in most things. (emphasis added)

I want to know what “most things” Light Table does faster than Emacs. 😉

Are you downloading a copy yet?

Plan for Light Table 0.5

Tuesday, July 16th, 2013

The plan for 0.5 by Chris Granger.

From the post:

You guys have been waiting very patiently for a while now, so I wanted to give you an idea of what’s coming in 0.5. A fair amount of the work is in simplifying both the UI/workflow as well as refactoring everything to get ready for the beta (plugins!). I’ve been talking with a fair number of people to understand how they use LT or why they don’t and one of the most common pieces of feedback I’ve gotten is that while it is very simple it still seems heavier than something like Sublime. We managed to attribute this to the fact that it does some unfamiliar things, one of the biggest of which is a lack of standard menus. We don’t really gain anything by not doing menus and while there were some technical reasons I didn’t, I’ve modified node-webkit to fix that. So I’m happy to say 0.5 will use standard menus and the ever-present bar on the left will be disappearing. This makes LT about as light as it possibly can be and should alleviate the feeling that you can’t just use it as a text editor.

Looking forward to the first week or two of August, 2013. Chris’ goal for the 0.5 release!

13 Things People Hate about Your Open Source Docs [+ One More]

Saturday, June 22nd, 2013

13 Things People Hate about Your Open Source Docs by Andy Lester.

From the post:

Most open source developers like to think about the quality of the software they build, but the quality of the documentation is often forgotten. Nobody talks about how great a project’s docs are, and yet documentation has a direct impact on your project’s success. Without good documentation, people either do not use your project, or they do not enjoy using it. Happy users are the ones who spread the news about your project – which they do only after they understand how it works, which they learn from the software’s documentation.

Yet, too many open source projects have disappointing documentation. And it can be disappointing in several ways.

The examples I give below are hardly authoritative, and I don’t mean to pick on any particular project. They’re only those that I’ve used recently, and not meant to be exemplars of awfulness. Every project has committed at least a few of these sins. See how many your favorite software is guilty of (whether you are user or developer), and how many you personally can help fix.

Andy’s list:

  1. Lacking a good README or introduction
  2. Docs not available online
  3. Docs only available online
  4. Docs not installed with the package
  5. Lack of screenshots
  6. Lack of realistic examples
  7. Inadequate links and references
  8. Forgetting the new user
  9. Not listening to the users
  10. Not accepting user input
  11. No way to see what the software does without installing it
  12. Relying on technology to do your writing
  13. Arrogance and hostility toward the user

See Andy’s post for the details on his points and the comments that follow.

I do think Andy missed one point:

14. Commercial entity open sources a product, machine generates documentation, expects users to contribute patches to the documentation for free.

What seems odd about that to you?

Developers getting paid to develop poor documentation and their response to user comments on documentation is the “community” should fix it for free.

At least in a true open source project, everyone is contributing and can use the (hopefully) great results equally.

Not so with a, “well…., for that you would need commercial license X” type project.

I first saw this in a tweet by Alexandre.

Nozzle R Package

Sunday, April 14th, 2013

Nozzle R Package

From the webpage:

Nozzle is an R package for generation of reports in high-throughput data analysis pipelines. Nozzle reports are implemented in HTML, JavaScript, and Cascading Style Sheets (CSS), but developers do not need any knowledge of these technologies to work with Nozzle. Instead they can use a simple R API to design and implement powerful reports with advanced features such as foldable sections, zoomable figures, sortable tables, and supplementary information. Please cite our Bioinformatics paper if you are using Nozzle in your work.

I have only looked at the demo reports but this looks quite handy.

It doesn’t hurt to have extensive documentation to justify a conclusion that took you only moments to reach.

“Document Design and Purpose, Not Mechanics”

Friday, February 15th, 2013

“Document Design and Purpose, Not Mechanics” by Stephen Turner.

From the post:

If you ever write code for scientific computing (chances are you do if you’re here), stop what you’re doing and spend 8 minutes reading this open-access paper:

Wilson et al. Best Practices for Scientific Computing. arXiv:1210.0530 (2012). (Direct link to PDF).

The paper makes a number of good points regarding software as a tool just like any other lab equipment: it should be built, validated, and used as carefully as any other physical instrumentation. Yet most scientists who write software are self-taught, and haven’t been properly trained in fundamental software development skills.

The paper outlines ten practices every computational biologist should adopt when writing code for research computing. Most of these are the usual suspects that you’d probably guess – using version control, workflow management, writing good documentation, modularizing code into functions, unit testing, agile development, etc. One that particularly jumped out at me was the recommendation to document design and purpose, not mechanics.

We all know that good comments and documentation is critical for code reproducibility and maintenance, but inline documentation that recapitulates the code is hardly useful. Instead, we should aim to document the underlying ideas, interface, and reasons, not the implementation. (emphasis added)

There is no shortage of advice (largely unread) on good writing practices. 😉

Stephen calling out the advice to “…document design and purpose, not mechanics” struck me as relevant to semantic integration solutions.

In both RDF and XTM topic maps, the same URI as an identifier is taken as identifying the same subject.

But that’s mechanics isn’t it? Just string to string comparison.

Mechanics are important but they are just mechanics.

Documenting the conditions for using a URI will help guide you or your successor to using the same URI the same way.

But that takes more than mechanics.

That takes “…document[ing] the underlying ideas, interface, and reasons, not the implementation.”