Archive for the ‘Github’ Category

The challenge of combining 176 x #otherpeoplesdata…

Wednesday, June 10th, 2015

The challenge of combining 176 x #otherpeoplesdata to create the Biomass And Allometry Database by Daniel Falster , Rich FitzJohn , Remko Duursma , Diego Barneche .

From the post:

Despite the hype around "big data", a more immediate problem facing many scientific analyses is that large-scale databases must be assembled from a collection of small independent and heterogeneous fragments — the outputs of many and isolated scientific studies conducted around the globe.

Collecting and compiling these fragments is challenging at both political and technical levels. The political challenge is to manage the carrots and sticks needed to promote sharing of data within the scientific community. The politics of data sharing have been the primary focus for debate over the last 5 years, but now that many journals and funding agencies are requiring data to be archived at the time of publication, the availability of these data fragments is increasing. But little progress has been made on the technical challenge: how can you combine a collection of independent fragments, each with its own peculiarities, into a single quality database?

Together with 92 other co-authors, we recently published the Biomass And Allometry Database (BAAD) as a data paper in the journal Ecology, combining data from 176 different scientific studies into a single unified database. We built BAAD for several reasons: i) we needed it for our own work ii) we perceived a strong need within the vegetation modelling community for such a database and iii) because it allowed us to road-test some new methods for building and maintaining a database ^1.

Until now, every other data compilation we are aware of has been assembled in the dark. By this we mean, end-users are provided with a finished product, but remain unaware of the diverse modifications that have been made to components in assembling the unified database. Thus users have limited insight into the quality of methods used, nor are they able to build on the compilation themselves.

The approach we took with BAAD is quite different: our database is built from raw inputs using scripts; plus the entire work-flow and history of modifications is available for users to inspect, run themselves and ultimately build upon. We believe this is a better way for managing lots of #otherpeoplesdata and so below share some of the key insights from our experience.

The highlights of the project:

1. Script everything and rebuild from source

2. Establish a data-processing pipeline

  • Don’t modify raw data files
  • Encode meta-data as data, not as code
  • Establish a formal process for processing and reviewing each data set

3. Use version control (git) to track changes and code sharing website (github) for effective collaboration

4. Embrace Openness

5. A living database

There was no mention of reconciliation of nomenclature for species. I checked some of the individual reports, such as Report for study: Satoo1968, which does mention:

Other variables: M.I. Ishihara, H. Utsugi, H. Tanouchi, and T. Hiura conducted formal search of reference databases and digitized raw data from Satoo (1968). Based on this reference, meta data was also created by M.I. Ishihara. Species name and family names were converted by M.I. Ishihara according to the following references: Satake Y, Hara H (1989a) Wild flower of Japan Woody plants I (in Japanese). Heibonsha, Tokyo; Satake Y, Hara H (1989b) Wild flower of Japan Woody plants II (in Japanese). Heibonsha, Tokyo. (Emphasis in original)

I haven’t surveyed all the reports but it appears that “conversion” of species and family names occurred prior to entering the data pipeline.

Not an unreasonable choice but it does mean that we cannot use the original names as recorded as search terms into literature that existed at the time of the original observations.

Normalization of data often leads to loss of information. Not necessarily but often does.

I first saw this in a tweet by Dr. Mike Whitfield.

Improving GitHub for science

Thursday, June 19th, 2014

Improving GitHub for science

From the post:

GitHub is being used today to build scientific software that’s helping find Earth-like planets in other solar systems, analyze DNA, and build open source rockets.

Seeing these projects and all this momentum within academia has pushed us to think about how we can make GitHub a better tool for research. As scientific experiments become more complex and their datasets grow, researchers are spending more of their time writing tools and software to analyze the data they collect. Right now though, these efforts often happen in isolation.

Citable code for academic software

Sharing your work is good, but collaborating while also getting required academic credit is even better. Over the past couple of months we’ve been working with the Mozilla Science Lab and data archivers, Figshare and Zenodo, to make it possible to get a Digital Object Identifier (DOI) for any GitHub repository archive.

DOIs form the backbone of the academic reference and metrics system. With a DOI for your GitHub repository archive, your code becomes citable. Our newest Guide explains how to create a DOI for your repository.

A move in the right direction to be sure but how much of a move is open to question.

Think of a DOI as the equivalent to a International Standard Book Number (ISBN). Using that as an identifier, you are sure to find a book that I cite.

But if the book is several hundred pages long, you may find my “citing it” by an ISBN identifier alone isn’t quite good enough.

The same will be true for some citations using DOIs for Github repositories. Better than nothing at all, but falls short of a robust identifier for material within a Github archive.

I first saw this in a tweet by Peter Kraker.

Improving GitHub for science

Thursday, May 15th, 2014

Improving GitHub for science

From the post:

GitHub is being used today to build scientific software that’s helping find Earth-like planets in other solar systems, analyze DNA, and build open source rockets.

Seeing these projects and all this momentum within academia has pushed us to think about how we can make GitHub a better tool for research. As scientific experiments become more complex and their datasets grow, researchers are spending more of their time writing tools and software to analyze the data they collect. Right now though, these efforts often happen in isolation.

Citable code for academic software

Sharing your work is good, but collaborating while also getting required academic credit is even better. Over the past couple of months we’ve been working with the Mozilla Science Lab and data archivers, Figshare and Zenodo, to make it possible to get a Digital Object Identifier (DOI) for any GitHub repository archive.

DOIs form the backbone of the academic reference and metrics system. With a DOI for your GitHub repository archive, your code becomes citable. Our newest Guide explains how to create a DOI for your repository.

A great step forward, but like http: pointing to entire resources, it is of limited utility.

Assume that I am using a DOI for a software archive and I want to point to and identify a code snippet in the archive that implements Fast Fourier Transform (FFT). My first task is to point to that snippet. A second task would be to create an association between the snippet and my annotation that it implements the Fast Fourier Transform. Yet a third task would be to gather up all the pointers that point to implementations of the Fast Fourier Transform (FFT).

For all of those tasks, I need to identify and point to a particular part of the underlying source code.

Unfortunately, a DOI is limited to identifying a single entity.

Each DOI® name is a unique “number”, assigned to identify only one entity. Although the DOI system will assure that the same DOI name is not issued twice, it is a primary responsibility of the Registrant (the company or individual assigning the DOI name) and its Registration Agency to identify uniquely each object within a DOI name prefix. (DOI Handbook

How would you extend the DOIs being used by GitHub to identify code fragments within source code repositories?

I first saw this in a tweet by Peter Desmet.

Restructuring the Web with Git

Friday, November 8th, 2013

Restructuring the Web with Git by Simon St. Laurent.

From the post:

Web designers? Git? Github? Aren’t those for programmers? At Artifact, Christopher Schmitt showed designers how much their peers are already doing with Github, and what more they can do. Github (and the underlying Git toolset) changes the way that all kinds of people work together.

Sharing with Git

As amazing as Linux may be, I keep thinking that Git may prove to be Linux Torvalds’ most important contribution to computing. Most people think of it, if they think of it at all, as a tool for managing source code. It can do far more, though, providing a drastically different (and I think better) set of tools for managing distributed projects, especially those that use text.

Git tackles an unwieldy problem, managing the loosely structured documents that humans produce. Text files are incredibly flexible, letting us store everything from random notes to code of all kinds to tightly structured data. As awesome as text files are—readable, searchable, relatively easy to process—they tend to become a mess when there’s a big pile of them.

Simon makes a good argument for the version control and sharing aspects of Github.

But Github doesn’t offer any features (that I am aware of) to manage the semantics of the data stored at Github.

For example, if I search for “greek,” I am returned results that include the Greek language, Greek mythology, New Testament Greek, etc.

There are only four hundred and sixty-five (465) results as of today but even if I look at all of them, I have no reason to think I have found all the relevant resources.

For example, a search on Greek Mythology would miss:

Myths-and-myth-makers–Old-Tales-and-Superstitions-Interpreted-by-Comparative-Mythology_1061, which has one hundred and four (104) references to Greek gods/mythology.

Moreover, now having discovered this work should be returned on a search for Greek Mythology, how do I impart that knowledge to the system so that future users will find that work?

Github works quite well, but it has a ways to go before it improves on the finding of documents.

The GitHub Data Challenge II

Friday, April 5th, 2013

The GitHub Data Challenge II

From the webpage:

There are millions of projects on GitHub. Every day, people from around the world are working to make these projects better. Opening issues, pushing code, submitting Pull Requests, discussing project details — GitHub activity is a papertrail of progress. Have you ever wondered what all that data looks like? There are millions of stories to tell; you just have to look.

Last year we held our first data challenge. We saw incredible visualizations, interesting timelines and compelling analysis.

What stories will be told this year? It’s up to you!

To Enter

Send a link to a GitHub repository or gist with your graph(s) along with a description to data@github.com before midnight, May 8th, 2013 PST.

Approaching 100M rows, how would you visualize the data and what questions would you explore?

GitHub Social Graphs with Groovy and GraphViz

Tuesday, May 29th, 2012

GitHub Social Graphs with Groovy and GraphViz

From the post:

Using the GitHub API, Groovy and GraphViz to determine, interpret and render a graph of the relationships between GitHub users based on the watchers of their repositories. The end result can look something like this.

[Image omitted. I stared to embed the image but on the narrow scale of my blog, it just didn’t look good. See the post for the full size version.]

A must see for all Groovy fans!

For an alternative, see:

Mining GitHub – Followers in Tinkerpop

Pointers to social graphs for GitHub using other tools appreciated!

Mining GitHub – Followers in Tinkerpop

Monday, May 14th, 2012

Mining GitHub – Followers in Tinkerpop

Patrick Wagstrom writes:

Development of any moderately complex software package is a social process. Even if a project is developed entirely by a single person, there is still a social component that consists of all of the people who use the software, file bugs, and provide recommendations for enhancements. This social aspect is one of the driving forces behind the proliferation of social software development sites such as GitHub, SourceForge, Google Code, and BitBucket.

These sites combine together a variety of tools that are common for software development such as version control, bug trackers, mailing lists, release management, project planning, and wikis. In addition, some of these have more social aspects that allow you find and follow individual developers or watch particular projects. In this post I’m going to show you how we can use some this information to gain insight into a software development community, specifically the community around the Tinkerpop stack of tools for graph databases.

GitHub as a social community. Who knew? 😉

Very instructive walk through Gremlin, GraphML, and R with a prepared data set. It doesn’t get much better than this!