Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 9, 2011

Apache Software Foundation Public Mail Archives

Filed under: Cloud Computing,Dataset — Patrick Durusau @ 6:40 pm

Apache Software Foundation Public Mail Archives

From the webpage:

Submitted By: Grant Ingersoll
US Snapshot ID (Linux/Unix): snap-­17f7f476
Size: 200 GB
License: Public Domain (See http://apache.org/foundation/public-­archives.html)
Source: The Apache Software Foundation (http://www.apache.org)
Created On: August 15, 2011 10:00 PM GMT
Last Updated: August 15, 2011 10:00 PM GMT

A collection of all publicly available mail archives from the Apache55 Software Foundation (ASF), taken on July 11, 2011.

This collection contains all publicly available email archives from the ASF’s 80+ projects (http://mail-archives.apache.org/mod_mbox/), including mailing lists such as Apache HTTPD Server, Apache Tomcat, Apache Lucene and Solr, Apache Hadoop and many more.

Generally speaking, most projects have at least three lists: user, dev and commits, but some have more, some have less. The user lists are where users of the software ask questions on usage, while the dev list usually contains discussions on the development of the project (code, releases, etc.)

The commit lists usually consists of automated notifications sent by the various ASF version control tools, like Subversion or CVS, and contain information about changes made to the project’s source code.

Both tarballs and per project sets are available in the snapshot. The tarballs are organized according to project name. Thus, a-d.tar.gz contains all ASF projects that begin with the letters a, b, c or d, such as abdera.apache.org. Files within the project are usually gzipped mbox files. (I split the first paragraph into several paragraphs for readability reasons.)

Rather meager documentation for a 200 GB data set don’t you think? I think a first step would be to create basic documentation on what projects are present, their mailing lists, some basic statistical counts to serve as reference points.

You have been waiting for a motivation to “get into” cloud computing, well, now you have the motivation and an interesting dataset!

1 Comment

  1. […] Word For It Patrick Durusau on Topic Maps and Semantic Diversity « Apache Software Foundation Public Mail Archives Parallel frameworks for graph processing […]

    Pingback by Open Relevance Project « Another Word For It — October 9, 2011 @ 6:41 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress