Archive for the ‘Open Data’ Category

Unpaywall (Access to Academic Publishing)

Wednesday, April 12th, 2017

How a Browser Extension Could Shake Up Academic Publishing by Lindsay McKenzie.

From the post:

Open-access advocates have had several successes in the past few weeks. The Bill & Melinda Gates Foundation started its own open-access publishing platform, which the European Commission may replicate. And librarians attending the Association of College and Research Libraries conference in March were glad to hear that the Open Access Button, a tool that helps researchers gain free access to copies of articles, will be integrated into existing interlibrary-loan arrangements.

Another initiative, called Unpaywall, is a simple browser extension, but its creators, Jason Priem and Heather Piwowar, say it could help alter the status quo of scholarly publishing.

“We’re setting up a lemonade stand right next to the publishers’ lemonade stand,” says Mr. Priem. “They’re charging $30 for a glass of lemonade, and we’re showing up right next to them and saying, ‘Lemonade for free’. It’s such a disruptive, exciting, and interesting idea, I think.”

Like the Open Access Button, Unpaywall is open-source, nonprofit, and dedicated to improving access to scholarly research. The button, devised in 2013, has a searchable database that comes into play when a user hits a paywall.

When an Unpaywall user lands on the page of a research article, the software scours thousands of institutional repositories, preprint servers, and websites like PubMed Central to see if an open-access copy of the article is available. If it is, users can click a small green tab on the side of the screen to view a PDF.

Sci-Hub gets an honorable mention as a “..pirate website…,” usage of which carries “…so much fear and uncertainty….” (Disclaimer, the author of those comments is one of the creators of Unpaywall (Jason Priem).)

Hardly. What was long suspected about academic publishing has become widely known: Peer review is a fiction, even at the best known publishers, to say nothing of lesser lights in the academic universe. The “contribution” of publishers is primarily maintaining lists of editors for padding the odd resume. (Peer Review failure: Science and Nature journals reject papers because they “have to be wrong”.)

I should not overlook publishers as a source of employment for “gatekeepers.” “Gatekeepers” being those unable to make a contribution on their own, who seek to prevent others from doing so and failing that, preventing still others from learning of those contributions.

Serfdom was abolished centuries ago, academic publishing deserves a similar fate.

PS: For some reason authors are reluctant to post the web address for Sci-Hub:

Leak Publication: Sharing, Crediting, and Re-Using Leaks

Wednesday, March 22nd, 2017

If you substitute “leak” for “data” in this essay by Daniella Lowenberg, does it work for leaks as well?

Data Publication: Sharing, Crediting, and Re-Using Research Data by Daniella Lowenberg.

From the post:

In the most basic terms- Data Publishing is the process of making research data publicly available for re-use. But even in this simple statement there are many misconceptions about what Data Publications are and why they are necessary for the future of scholarly communications.

Let’s break down a commonly accepted definition of “research data publishing”. A Data Publication has three core features: 1 – data that are publicly accessible and are preserved for an indefinite amount of time, 2 – descriptive information about the data (metadata), and 3 – a citation for the data (giving credit to the data). Why are these elements essential? These three features make research data reusable and reproducible- the goal of a Data Publication.

As much as I admire the work of the International Consortium of Investigative Journalists (ICIJ, especially its Panama Papers project, sharing data beyond the confines of their community isn’t a value, much less a goal.

As all secret keepers, government, industry, organizations, ICIJ has “reasons” for its secrecy, but none that I find any more or less convincing than those offered by other secret keepers.

Every secret keeper has an agenda their secrecy serves. Agendas that which don’t include a public empowered to make judgments about their secret keeping.

The ICIJ proclaims Leak to Us.

A good place to leak but include with your leak a demand, an unconditional demand, that your leak be released in its entirely within a year or two of its first publication.

Help enable the public to watch all secrets and secret keepers, not just those some secret keepers choose to expose.

Open Science: Too Much Talk, Too Little Action [Lessons For Political Opposition]

Monday, February 6th, 2017

Open Science: Too Much Talk, Too Little Action by Björn Brembs.

From the post:

Starting this year, I will stop traveling to any speaking engagements on open science (or, more generally, infrastructure reform), as long as these events do not entail a clear goal for action. I have several reasons for this decision, most of them boil down to a cost/benefit estimate. The time spent traveling does not seem worth the hardly noticeable benefits any more.

I got involved in Open Science more than 10 years ago. Trying to document the point when it all started for me, I found posts about funding all over my blog, but the first blog posts on publishing were from 2005/2006, the announcement of me joining the editorial board of newly founded PLoS ONE late 2006 and my first post on the impact factor in 2007. That year also saw my first post on how our funding and publishing system may contribute to scientific misconduct.

In an interview on the occasion of PLoS ONE’s ten-year anniversary, PLoS mentioned that they thought the publishing landscape had changed a lot in these ten years. I replied that, looking back ten years, not a whole lot had actually changed:

  • Publishing is still dominated by the main publishers which keep increasing their profit margins, sucking the public teat dry
  • Most of our work is still behind paywalls
  • You won’t get a job unless you publish in high-ranking journals.
  • Higher ranking journals still publish less reliable science, contributing to potential replication issues
  • The increase in number of journals is still exponential
  • Libraries are still told by their faculty that subscriptions are important
  • The digital functionality of our literature is still laughable
  • There are no institutional solutions to sustainably archive and make accessible our narratives other than text, or our code or our data

The only difference in the last few years really lies in the fraction of available articles, but that remains a small minority, less than 30% total.

So the work that still needs to be done is exactly the same as it was at the time Stevan Harnad published his “Subversive Proposal” , 23 years ago: getting rid of paywalls. This goal won’t be reached until all institutions have stopped renewing their subscriptions. As I don’t know of a single institution without any subscriptions, that task remains just as big now as it was 23 years ago. Noticeable progress has only been on the margins and potentially in people’s heads. Indeed, now only few scholars haven’t heard of “Open Access”, yet, but apparently without grasping the issues, as my librarian colleagues keep reminding me that their faculty believe open access has already been achieved because they can access everything from the computer in their institute.

What needs to be said about our infrastructure has been said, both in person, and online, and in print, and on audio, and on video. Those competent individuals at our institutions who make infrastructure decisions hence know enough to be able to make their rational choices. Obviously, if after 23 years of talking about infrastructure reform, this is the state we’re in, our approach wasn’t very effective and my contribution is clearly completely negligible, if at all existent. There is absolutely no loss if I stop trying to tell people what they already should know. After all, the main content of my talks has barely changed in the last eight or so years. Only more recent evidence has been added and my conclusions have become more radical, i.e., trying to tackle the radix (Latin: root) of the problem, rather than palliatively care for some tangential symptoms.

The line:

What needs to be said about our infrastructure has been said, both in person, and online, and in print, and on audio, and on video.

is especially relevant in light of the 2016 presidential election and the fund raising efforts of organizations that form the “political opposition.”

You have seen the ads in email, on Facebook, Twitter, etc., all pleading for funding to oppose the current US President.

I agree the current US President should be opposed.

But the organizations seeking funding failed to stop his rise to power.

Whether their failure was due to organizational defects or poor strategies is really beside the point. They failed.

Why should I enable them to fail again?

One data point, the Women’s March on Washington was NOT organized by organizations with permanents staff and offices in Washington or elsewhere.

Is your contribution supporting staffs and offices of the self-righteous (the primary function of old line organizations) or investigation, research, reporting and support of boots on the ground?

Government excesses are not stopped by bewailing our losses but by making government agents bewail theirs.

ODI – Access To Legal Data News

Friday, January 13th, 2017

Strengthening our legal data infrastructure by Amanda Smith.

Amanda recounts an effort between the Open Data Institute (ODI) and Thomas Reuters to improve access to legal data.

From the post:

Paving the way for a more open legal sector: discovery workshop

In September 2016, Thomson Reuters and the ODI gathered publishers of legal data, policy makers, law firms, researchers, startups and others working in the sector for a discovery workshop. Its aims were to explore important data types that exist within the sector, and map where they sit on the data spectrum, discuss how they flow between users and explore the opportunities that taking a more open approach could bring.

The notes from the workshop explore current mechanisms for collecting, managing and publishing data, benefits of wider access and barriers to use. There are certain questions that remain unanswered – for example, who owns the copyright for data collected in court. The notes are open for comments, and we invite the community to share their thoughts on these questions, the data types discussed, how to make them more open and what we might have missed.

Strengthening data infrastructure in the legal sector: next steps

Following this workshop we are working in partnership with Thomson Reuters to explore data infrastructure – datasets, technologies and processes and organisations that maintain them – in the legal sector, to inform a paper to be published later in the year. The paper will focus on case law, legislation and existing open data that could be better used by the sector.

The Ministry of Justice have also started their own data discovery project, which the ODI have been contributing to. You can keep up to date on their progress by following the MOJ Digital and Technology blog and we recommend reading their data principles.

Get involved

We are looking to the legal and data communities to contribute opinion pieces and case studies to the paper on data infrastructure for the legal sector. If you would like to get involved, contact us.
…(emphasis in original)

Encouraging news, especially for those interested in building value-added tools on top of data that is made available publicly. At least they can avoid the cost of collecting data already collected by others.

Take the opportunity to comment on the notes and participate as you are able.

If you think you have seen use cases for topic maps before, consider that the Code of Federal Regulations (US), as of December 12, 2016, has 54938 separate but not unique, definitions of “person.” The impact of each regulation depending upon its definition of that term.

Other terms have similar semantic difficulties both in the Code of Federal Regulations as well as the US Code.

Guarantees Of Public Access In Trump Administration (A Perfect Data Storm)

Saturday, December 31st, 2016

I read hand wringing over the looming secrecy of the Trump administration on a daily basis.

More truthfully, I skip over daily hand wringing over the looming secrecy of the Trump administration.

For two reasons.

First, as reported in US government subcontractor leaks confidential military personnel data by Charlie Osborne, government data doesn’t require hacking, just a little initiative.

In this particular case, it was rsync without a username or password, that made this data leak possible.

Editors should ask their reporters before funding FOIA suits: “Have you tried rsync?”

Second, the alleged-to-be-Trump-nominees for cabinet and lesser positions, remind me of this character from Dilbert: November 2, 1992:


Trump appointees may have mastered the pointy end of pencils but their ability to use cyber-security will be as shown.

When you add up the cyber-security incompetence of Trump appointees, complaints from Inspector Generals about agency security, and agencies leaking to protect their positions/turf, you have the conditions for a perfect data storm.

A perfect data storm that may see the US government hemorrhaging data like never before.

PS: You know my preference, post leaks on receipt in their entirety. As for “consequences,” consider those a down payment on what awaits people who betray humanity, their people, colleagues and family. They could have chosen differently and didn’t. What more can one say?

Weakly Weaponized Open Data

Friday, November 4th, 2016

Berners-Lee raises spectre of weaponized open data by Bill Camarda.

From the post:


Practically everybody loves open data, ie “data that anyone can access, use or share”. And nobody loves it more than Tim Berners-Lee, creator of the World Wide Web, and co-founder of the Open Data Institute (ODI).

Berners-Lee and his ODI colleagues have spent years passionately evangelizing governments and companies to publicly release their non-personal data for use to improve communities.

So when he recently told the Guardian that hackers could use open data to create societal chaos, it might have been this year’s most surprising “man bites dog” news story.

What’s going on here? The growing fear of data sabotage, that’s what.

Bill focuses on the manipulation and/or planting of false data, which could result in massive traffic jams, changes in market prices, etc.

In fact, Berners-Lee says in the original Guardian story:

“If you falsify government data then there are all kinds of ways that you could get financial gain, so yes,” he said, “it’s important that even though people think about open data as not a big security problem, it is from the point of view of being accurate.”

He added: “I suppose it’s not as exciting as personal data for hackers to get into because it’s public.”

Disruptive to some, profitable to others, but what should be called weakly weaponized open data.

Here is one instance of strongly weaponized open data.

Scenario: We Don’t Need No Water, Let The Motherfucker Burn

The United States is currently experiencing a continuing drought. From the U.S. Drought Monitor:


Keying on the solid red color around Atlanta, GA, Fire Weather, a service of the National Weather Service, estimates the potential impact of fires near Atlanta:


Impacted by a general conflagration around Atlanta:

Population: 2,783,418
Airports: 38
Miles of Interstate: 556
Miles of Rail: 2399
Parks: 4
Area: 27,707 Sq. Miles

Pipelines are missing from the list of impacts. For that, consult the National Pipeline Mapping System where even a public login reveals:


The red lines are hazardous liquid pipelines, blue lines are gas transmission pipelines, the yellow lines outline Fulton County.

We have located a likely place for a forest fire, have some details on its probable impact and a rough idea of gas and other pipelines in the prospective burn area.

Oh, we need a source of ignition. Road flares anyone?


From the WSDOT, Winter Driving Supply Checklist. Emergency kits with flares are available at box stores and online.

Bottom line:

Intentional forest fires can be planned from public data sources. Governments gratuitously suggest non-suspicious methods of transporting forest fire starting materials.

Details I have elided over, such as evacuation routes, fire watch stations, drones as fire starters, fire histories, public events, plus greater detail from the resources cited, are all available from public sources.

What are your Weaponized Open Data risks?

Version 2 of the Hubble Source Catalog [Model For Open Access – Attn: Security Researchers]

Friday, September 30th, 2016

Version 2 of the Hubble Source Catalog

From the post:

The Hubble Source Catalog (HSC) is designed to optimize science from the Hubble Space Telescope by combining the tens of thousands of visit-based source lists in the Hubble Legacy Archive (HLA) into a single master catalog.

Version 2 includes:

  • Four additional years of ACS source lists (i.e., through June 9, 2015). All ACS source lists go deeper than in version 1. See current HLA holdings for details.
  • One additional year of WFC3 source lists (i.e., through June 9, 2015).
  • Cross-matching between HSC sources and spectroscopic COS, FOS, and GHRS observations.
  • Availability of magauto values through the MAST Discovery Portal. The maximum number of sources displayed has increased from 10,000 to 50,000.

The HSC v2 contains members of the WFPC2, ACS/WFC, WFC3/UVIS and WFC3/IR Source Extractor source lists from HLA version DR9.1 (data release 9.1). The crossmatching process involves adjusting the relative astrometry of overlapping images so as to minimize positional offsets between closely aligned sources in different images. After correction, the astrometric residuals of crossmatched sources are significantly reduced, to typically less than 10 mas. The relative astrometry is supported by using Pan-STARRS, SDSS, and 2MASS as the astrometric backbone for initial corrections. In addition, the catalog includes source nondetections. The crossmatching algorithms and the properties of the initial (Beta 0.1) catalog are described in Budavari & Lubow (2012).


There are currently three ways to access the HSC as described below. We are working towards having these interfaces consolidated into one primary interface, the MAST Discovery Portal.

  • The MAST Discovery Portal provides a one-stop web access to a wide variety of astronomical data. To access the Hubble Source Catalog v2 through this interface, select Hubble Source Catalog v2 in the Select Collection dropdown, enter your search target, click search and you are on your way. Please try Use Case Using the Discovery Portal to Query the HSC
  • The HSC CasJobs interface permits you to run large and complex queries, phrased in the Structured Query Language (SQL).
  • HSC Home Page

    – The HSC Summary Search Form displays a single row entry for each object, as defined by a set of detections that have been cross-matched and hence are believed to be a single object. Averaged values for magnitudes and other relevant parameters are provided.

    – The HSC Detailed Search Form displays an entry for each separate detection (or nondetection if nothing is found at that position) using all the relevant Hubble observations for a given object (i.e., different filters, detectors, separate visits).

Amazing isn’t it?

The astronomy community long ago vanquished data hoarding and constructed tools to avoid moving very large data sets across the network.

All while enabling more and not less access and research using the data.

Contrast that to the sorry state of security research, where example code is condemned, if not actually prohibited by law.

Yet, if you believe current news reports (always an iffy proposition), cybercrime is growing by leaps and bounds. (PwC Study: Biggest Increase in Cyberattacks in Over 10 Years)

How successful is the “data hoarding” strategy of the security research community?

Mapping U.S. wildfire data from public feeds

Monday, August 29th, 2016

Mapping U.S. wildfire data from public feeds by David Clark.

From the post:

With the Mapbox Datasets API, you can create data-based maps that continuously update. As new data arrives, you can push incremental changes to your datasets, then update connected tilesets or use the data directly in a map.

U.S. wildfires have been in the news this summer, as they are every summer, so I set out to create an automatically updating wildfire map.

An excellent example of using public data feeds to create a resource not otherwise available.

Historical fire data can be found at: Federal Wildland Fire Occurrence Data, spanning 1980 through 2015.

The Outlooks page of the National Interagency Coordination Center provides four month (from current month) outlook and weekly outlook fire potential reports and maps.

Hunters Bag > 400 Database Catalogs

Monday, August 29th, 2016

Transparency Hunters Capture More than 400 California Database Catalogs by Dave Maass.

The post in its entirety:

A team of over 40 transparency activists aimed their browsers at California this past weekend, collecting more than 400 database catalogs from local government agencies, as required under a new state law. Together, participants in the California Database Hunt shined light on thousands upon thousands of government record systems.

California S.B. 272 requires every local government body, with the exception of educational agencies, to post inventories of their “enterprise systems,” essentially every database that holds records on members of the public or is used as a primary source of information. These database catalogs were required to be posted online (at least by agencies with websites) by July 1, 2016.

EFF, the Data Foundation, the Sunlight Foundation, and Level Zero, combined forces to host volunteers in San Francisco, Washington, D.C., and remotely. More than 40 volunteers scoured as many local agency websites as we could in four hours—cities, counties, regional transportation agencies, water districts, etc. Here are the rough numbers:

680 – The number of unique agencies that supporters searched

970 – The number of searches conducted (Note: agencies found on the first pass not to have catalogs were searched a second time)

430 – Number of agencies with database catalogs online

250 – Number of agencies without database catalogs online, as verified by two people

Download a spreadsheet of the local government database catalogs we found: Excel/TSV

Download a spreadsheet of cities and counties that did not have S.B. 272 catalogs: Excel/TSV

Please note that for each of the cities and counties identified as not posting database catalogs, at least two volunteers searched for the catalogs and could not find them. It is possible that those agencies do in fact have S.B. 272-compliant catalogs posted somewhere, but not in what we would call a “prominent location,” as required by the new law. If you represent an agency that would like its database catalog listed, please send an email to

We owe a debt of gratitude to the dozens of volunteers who sacrificed their Saturday afternoons to help make local government in California a little less opaque. Check out this 360-degree photo of our San Francisco team on Facebook.

In the coming days and weeks, we plan to analyze and share the data further. Stay tuned, and if you find anything interesting perusing these database catalogs, please drop us a line at

Of course, bagging the database catalogs is like having a collection of Christmas catalogs. It’s great, but there are more riches within!

What data products would you look for first?

Updated to mirror changes (clarification) in original.

How-To Safely Protest on the Downtown Connector – #BLM

Wednesday, July 13th, 2016

Atlanta doesn’t have a spotless record on civil rights but Mayor Kasim Reed agreeing to meet with #BLM leaders on July 18, 2016, is a welcome contrast to response in the police state of Baton Rouge, for example.

During this “cooling off” period, I want to address Mayor Reed’s concern for the safety of #BLM protesters and motorists should #BLM protests move onto the Downtown Connector.

Being able to protest on the Downtown Connector would be far more effective than blocking random Atlanta surface streets, by day or night. Mayor Reed’s question is how to do so safely?

Here is Google Maps’ representation of a part of the Downtown Connector:


That view isn’t helpful on the issue of safety but consider a smaller portion of the Downtown Connector as seen by Google Earth:


The safety question has two parts: How to transport #BLM protesters to a protest site on the Downtown Connector? How to create a safe protest site on the Downtown Connector?

A nearly constant element of the civil rights movement provides the answer: buses. From the Montgomery Bus Boycott, Freedom Riders, to the long experiment with busing to achieve desegregation in education.

Looking at an enlargement of an image of the Downtown Connector, you will see that ten (10) buses would fill all the lanes, plus the emergency lane and the shoulder, preventing any traffic from going around the buses. That provides safety for protesters. Not to mention transporting all the protesters safely to the protest site.

The Downtown Connector is often described as a “parking lot” so drivers are accustomed to traffic slowing to a full stop. If a group of buses formed a line across all lanes of the Downtown Connector and slowed to a stop, traffic would be safely stopped. That provides safety for drivers.

The safety of both protesters and drivers depends upon coordination between cars and buses to fill all the lanes of the Downtown Connector and then slowing down in unison, plus buses occupying the emergency lane and shoulder. Anything less than full interdiction of the highway would put both protesters and drivers at risk.

Churches and church buses have often played pivotal roles in the civil rights movement so the means for creating safe protest spaces, even on the Downtown Connector, are not out of reach.

There are other logistical and legal issues involved in such a protest but I have limited myself to offering a solution to Mayor Reed’s safety question.

PS: The same observations apply to any limited access motorway, modulo adaptation to your local circumstances.

The No-Value-Add Of Academic Publishers And Peer Review

Tuesday, June 21st, 2016

Comparing Published Scientific Journal Articles to Their Pre-print Versions by Martin Klein, Peter Broadwell, Sharon E. Farb, Todd Grappone.


Academic publishers claim that they add value to scholarly communications by coordinating reviews and contributing and enhancing text during publication. These contributions come at a considerable cost: U.S. academic libraries paid $1.7 billion for serial subscriptions in 2008 alone. Library budgets, in contrast, are flat and not able to keep pace with serial price inflation. We have investigated the publishers’ value proposition by conducting a comparative study of pre-print papers and their final published counterparts. This comparison had two working assumptions: 1) if the publishers’ argument is valid, the text of a pre-print paper should vary measurably from its corresponding final published version, and 2) by applying standard similarity measures, we should be able to detect and quantify such differences. Our analysis revealed that the text contents of the scientific papers generally changed very little from their pre-print to final published versions. These findings contribute empirical indicators to discussions of the added value of commercial publishers and therefore should influence libraries’ economic decisions regarding access to scholarly publications.

The authors have performed a very detailed analysis of pre-prints, 90% – 95% of which are published as open pre-prints first, to conclude there is no appreciable difference between the pre-prints and the final published versions.

I take “…no appreciable difference…” to mean academic publishers and the peer review process, despite claims to the contrary, contribute little or no value to academic publications.

How’s that for a bargaining chip in negotiating subscription prices?

Where Has Sci-Hub Gone?

Saturday, June 18th, 2016

While I was writing about the latest EC idiocy (link tax), I was reminded of Sci-Hub.

Just checking to see if it was still alive, I tried

404 by standard DNS service.

If you are having the same problem, Mike Masnick reports in Sci-Hub, The Repository Of ‘Infringing’ Academic Papers Now Available Via Telegram, you can access Sci-Hub via:

I’m not on Telegram, yet, but that may be changing soon. 😉

BTW, while writing this update, I stumbled across: The New Napster: How Sci-Hub is Blowing Up the Academic Publishing Industry by Jason Shen.

From the post:

This is obviously piracy. And Elsevier, one of the largest academic journal publishers, is furious. In 2015, the company earned $1.1 billion in profits on $2.9 billion in revenue [2] and Sci-hub directly attacks their primary business model: subscription service it sells to academic organizations who pay to get access to its journal articles. Elsevier filed a lawsuit against Sci-Hub in 2015, claiming Sci-hub is causing irreparable injury to the organization and its publishing partners.

But while Elsevier sees Sci-Hub as a major threat, for many scientists and researchers, the site is a gift from the heavens, because they feel unfairly gouged by the pricing of academic publishing. Elsevier is able to boast a lucrative 37% profit margin because of the unusual (and many might call exploitative) business model of academic publishing:

  • Scientists and academics submit their research findings to the most prestigious journal they can hope to land in, without getting any pay.
  • The journal asks leading experts in that field to review papers for quality (this is called peer-review and these experts usually aren’t paid)
  • Finally, the journal turns around and sells access to these articles back to scientists/academics via the organization-wide subscriptions at the academic institution where they work or study

There’s piracy afoot, of that I have no doubt.


  • Relies on research it does not sponsor
  • Research results are submitted to it for free
  • Research is reviewed for free
  • Research is published in journals of value only because of the free contributions to them
  • Elsevier makes a 37% profit off of that free content

There is piracy but Jason fails to point to Elsevier as the pirate.

Sci-Hub/Alexandra Elbakyan is re-distributing intellectual property that was stolen by Elsevier from the academic community, for its own gain.

It’s time to bring Elsevier’s reign of terror against the academic community to an end. Support Sci-Hub in any way possible.

Reproducible Research Resources for Research(ing) Parasites

Friday, June 3rd, 2016

Reproducible Research Resources for Research(ing) Parasites by Scott Edmunds.

From the post:

Two new research papers on scabies and tapeworms published today showcase a new collaboration with This demonstrates a new way to share scientific methods that allows scientists to better repeat and build upon these complicated studies on difficult-to-study parasites. It also highlights a new means of writing all research papers with citable methods that can be updated over time.

While there has been recent controversy (and hashtags in response) from some of the more conservative sections of the medical community calling those who use or build on previous data “research parasites”, as data publishers we strongly disagree with this. And also feel it is unfair to drag parasites into this when they can teach us a thing or two about good research practice. Parasitology remains a complex field given the often extreme differences between parasites, which all fall under the umbrella definition of an organism that lives in or on another organism (host) and derives nutrients at the host’s expense. Published today in GigaScience are articles on two parasitic organisms, scabies and on the tapeworm Schistocephalus solidus. Not only are both papers in parasitology, but the way in which these studies are presented showcase a new collaboration with that provides a unique means for reporting the Methods that serves to improve reproducibility. Here the authors take advantage of their open access repository of scientific methods and a collaborative protocol-centered platform, and we for the first time have integrated this into our submission, review and publication process. We now also have a groups page on the portal where our methods can be stored.

A great example of how sharing data advances research.

Of course, that assumes that one of your goals is to advance research and not solely yourself, your funding and/or your department.

Such self-centered as opposed to research-centered individuals do exist, but I would not malign true parasites by describing them as such, even colloquially.

The days of science data hoarders are numbered and one can only hope that the same is true for the “gatekeepers” of humanities data, manuscripts and artifacts.

The only known contribution of hoarders or “gatekeepers” has been to the retarding of their respective disciplines.

Given the choice of advancing your field along with yourself, or only yourself, which one will you choose?

Open Data Institute – Join From £1 (Süddeutsche Zeitung (SZ), “Nein!”)

Tuesday, April 26th, 2016

A new offer for membership in the Open Data Institute:

Data impacts everybody. It’s the infrastructure that underpins transparency, accountability, public services, business innovation and civil society.

Together we can embrace open data to improve how we access healthcare services, discover cures for diseases, understand our governments, travel around more easily and much, much more.

Are you eager to learn more about it, collaborate with it or meet others who are already making a difference with it? From just £1 join our growing, collaborative global network of individuals, students, businesses, startups and organisations, and receive:

  • invitations to events and open evenings organised by the ODI and beyond
  • opportunities to promote your own news and events across the network
  • updates up to twice a month from the world of data and open innovation
  • 30% discount on all our courses
  • 20% reduction on our annual ODI Summit

Become a member from £1

I’d like to sign my organisation up

If you search for Süddeutsche Zeitung (SZ), the hoarders of the Panama Papers, you will come up empty.

SZ is in favor of transparency and accountability, but only for others. Never for SZ.

SZ claims in some venues to be concerned with the privacy of individuals mentioned in the Panama Papers.

How to judge between privacy rights of individuals, parties to looting nations, against the public’s right to judge reporting on the same? How is financial regulation reform possible without the details?

SZ is comfortable with protecting looters of nations and obstructing meaningful financial reform.

You can judge news media by the people they protect.

Academic, Not Industrial Secrecy

Saturday, March 12th, 2016

Data too important to share: do those who control the data control the message? by Peter Doshi (BMJ 2016;352:i1027).

Read Peter’s post for the details but the problem in a nutshell:

“The main concern we had was that Fresenius was involved in the process,” Myburgh explained. He said there was never any question of Krumholz’s independence or credentials. Rather, it was a “concern that this was a way for Fresenius to get the data once they were in the public domain. We want restrictions on who could do the analyses.

Under the YODA model Krumholz proposed, the data would be reanalysed by independent parties before being made more broadly available.

“We have no issue with the concept of data sharing,” Myburgh said. “The concerns we have come down to the people with ulterior motives which contradict or do not adhere to the scientific principles we adhere to. That’s the danger.”

Myburgh described himself as an impartial scientist, in contrast to those who have challenged his study. “I’ve heard some of the protagonists of starch. Senior figures wanted to make a point. We do research to answer a question. They do analyses to prove a point.” (emphasis added)

You can hear the echoes of Myburgh’s position of:

We want restrictions on who could do the analyses.

in every government claim for not releasing data that supports government conclusions.

If “terrorists” really are the danger the government claims, don’t you think releasing the data on which that claim is based would convince everyone? Or nearly everyone?

Ah, but some of us might not think opposing corrupt, puppet governments in the Middle East is the same thing as “terrorism.”

And still others of us might not think opposing an oppressive theocracy is the same as “terrorism.”

Yes, more data could lead to more informed discussion, but it could also lead to inconvenient questions.

If Myburgh and colleagues were to find this is the last funded study from any source, unless and until they release this and other trial data, they would sing a different tune.

Anyone with a list of the funders for Myburgh and his colleagues?

Email addresses would be a good start.

Sci-Hub Tip: Converting Paywall DOIs to Public Access

Thursday, February 11th, 2016

In a tweet Jon Tenn@nt points out that:

Reminder: add “” after the .com in the URL of pretty much any paywalled paper to gain instant free access.

BTW, I tested Jon’s advice with:****/*******

re-cast as:****/*******

And it works!

With a little scripting, you can convert your paywall DOIs into public access with

This “worked for me” so if you encounter issues, please ping me so I can update this post.

Happy reading!

Tackling Zika

Thursday, February 11th, 2016

F1000Research launches rapid, open, publishing channel to help scientists tackle Zika

From the post:

ZAO provides a platform for scientists and clinicians to publish their findings and source data on Zika and its mosquito vectors within days of submission, so that research, medical and government personnel can keep abreast of the rapidly evolving outbreak.

The channel provides diamond-access: it is free to access and articles are published free of charge. It also accepts articles on other arboviruses such as Dengue and Yellow Fever.

The need for the channel is clearly evidenced by a recent report on the global response to the Ebola virus by the Harvard-LSHTM (London School of Hygiene & Tropical Medicine) Independent Panel.

The report listed ‘Research: production and sharing of data, knowledge, and technology’ among its 10 recommendations, saying: “Rapid knowledge production and dissemination are essential for outbreak prevention and response, but reliable systems for sharing epidemiological, genomic, and clinical data were not established during the Ebola outbreak.”

Dr Megan Coffee, an infectious disease clinician at the International Rescue Committee in New York, said: “What’s published six months, or maybe a year or two later, won’t help you – or your patients – now. If you’re working on an outbreak, as a clinician, you want to know what you can know – now. It won’t be perfect, but working in an information void is even worse. So, having a way to get information and address new questions rapidly is key to responding to novel diseases.”

Dr. Coffee is also a co-author of an article published in the channel today, calling for rapid mobilisation and adoption of open practices in an important strand of the Zika response: drug discovery –

Sean Ekins, of Collaborative Drug Discovery, and lead author of the article, which is titled ‘Open drug discovery for the Zika virus’, said: “We think that we would see rapid progress if there was some call for an open effort to develop drugs for Zika. This would motivate members of the scientific community to rally around, and centralise open resources and ideas.”

Another co-author, of the article, Lucio Freitas-Junior of the Brazilian Biosciences National Laboratory, added: “It is important to have research groups working together and sharing data, so that scarce resources are not wasted in duplication. This should always be the case for neglected diseases research, and even more so in the case of Zika.”

Rebecca Lawrence, Managing Director, F1000, said: “One of the key conclusions of the recent Harvard-LSHTM report into the global response to Ebola was that rapid, open data sharing is essential in disease outbreaks of this kind and sadly it did not happen in the case of Ebola.

“As the world faces its next health crisis in the form of the Zika virus, F1000Research has acted swiftly to create a free, dedicated channel in which scientists from across the globe can share new research and clinical data, quickly and openly. We believe that it will play a valuable role in helping to tackle this health crisis.”


For more information:

Andrew Baud, Tala (on behalf of F1000), +44 (0) 20 3397 3383 or +44 (0) 7775 715775

Excellent news for researchers but a direct link to the new channel would have been helpful as well: Zika & Arbovirus Outbreaks (ZAO).

See this post: The Zika & Arbovirus Outbreaks channel on F1000Research by Thomas Ingraham.

News organizations should note that as of today, 11 February 2016, ZAO offers 9 articles, 16 posters and 1 set of slides. Those numbers are likely to increase rapidly.

Oh, did I mention the ZAO channel is free?

Unlike some journals, payment, prestige, privilege, are not pre-requisites for publication.

Useful research on Zika & Arboviruses is the only requirement.

I know, sounds like a dangerous precedent but defeating a disease like Zika will require taking risks.

Addressing The Concerns Of The Selfish

Monday, January 25th, 2016

A burnt hand didn’t teach any lessons to Dr. Jeffrey M. Drazen of the New England Journal of Medicine (NEJM).

Just last week Jeffrey and a co-conspirator took to the editorial page of the NEJM to denounce as “parasites,” scientists who reuse data developed by others. Especially, if the data developers weren’t included in the new work. See: Parasitic Re-use of Data? Institutionalizing Toadyism.

Overly sensitive, as protectors of greedy people tend to be, Jeffrey takes back to the editorial page to say:

In the process of formulating our policy, we spoke to clinical trialists around the world. Many were concerned that data sharing would require them to commit scarce resources with little direct benefit. Some of them spoke pejoratively in describing data scientists who analyze the data of others.3 To make data sharing successful, it is important to acknowledge and air those concerns.(Data Sharing and The Journal)

On target with concerns about data sharing requiring “…scarce resources with little direct benefit.”

Except Jeffrey forgot to mention that in his editorial about “parasites.”

Not a single word. The “cost free” myth of sharing data persists and the NEJM’s voice could be an important one in dispelling that myth.

But not Jeffrey, he took up his lance to defend the concerns of the selfish.

I will post separately on the issue of the cost of data sharing, etc., which as I say, is a legitimate concern.

We don’t need to resort to toadyism to satisfy the concerns of scientists over re-use of their data.

Create all the needed mechanisms to compensate for the sharing of data and if anyone objects or has “concerns” about re-use of data, cease funding them and/or any project of which they are a member.

There is no right to public funding for research, especially for scientists who have developed a sense of entitlement to public funding, for their own benefit.

You might want to compare the NEJM position to that of the radio astronomy community which shares both raw and processed data with anyone who wants to download it.

It’s a question of “privilege,” and not public safety, etc.

It’s annoying enough that people are selfish with research data, don’t be dishonest as well.

Introducing Kaggle Datasets [No Data Feudalism Here]

Saturday, January 23rd, 2016

Introducing Kaggle Datasets

From the post:

At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. That’s why we’ve created a home for high quality public datasets, Kaggle Datasets.

Kaggle Datasets has four core components:

  • Access: simple, consistent access to the data with clear licensing
  • Analysis: a way to explore the data without downloading it
  • Results: visibility to the previous work that’s been created on the data
  • Conversation: forums and comments for discussing the nuances of the data

Are you interested in publishing one of your datasets on Submit a sample here.

Unlike some medievalists who publish in the New England Journal of Medicine, Kaggle not only makes the data sets freely available, but offers tools to help you along.

Kaggle will also assist you in making your datasets available as well.

Parasitic Re-use of Data? Institutionalizing Toadyism.

Thursday, January 21st, 2016

Data Sharing by Dan L. Longo, M.D., and Jeffrey M. Drazen, M.D, N Engl J Med 2016; 374:276-277 January 21, 2016 DOI: 10.1056/NEJMe1516564.

This editorial in the New England Journal of Medicine advocates the following for re-use of medical data:

How would data sharing work best? We think it should happen symbiotically, not parasitically. Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested. What is learned may be beautiful even when seen from close up.

I had to check my calendar to make sure April the 1st hadn’t slipped up on me.

This is one of the most bizarre and malignant proposals on data re-use that I have seen.

If you have an original idea, you have to approach other researchers as a suppliant and ask them to benefit from your idea, possibly using their data in new and innovative ways?

Does that smack of a “good old boys/girls” club to you?

If anyone uses the term parasitic or parasite with regard to data re-use, be sure to respond with the question:

How much do dogs in the manger contribute to science?

That phenomena is not unknown in the humanities nor in biblical studies. There was a wave of very disgusting dissertations that began with “…X entrusted me with this fragment of the Dead Sea Scrolls….”

I suppose those professors knew their ability to attract students based on merit versus their hoarding of original text fragments better than I did. You should judge them by their choices.

rOpenSci (updated tutorials) [Learn Something, Write Something]

Monday, January 4th, 2016

rOpenSci has updated 16 of its tutorials!

More are on the way!

Need a detailed walk through of what our packages allow you to do? Click on a package below, quickly install it and follow along. We’re in the process of updating existing package tutorials and adding several more in the coming weeks. If you find any bugs or have comments, drop a note in the comments section or send us an email. If a tutorial is available in multiple languages we indicate that with badges, e.g., (English) (Português).

  • alm    Article-level metrics
  • antweb    AntWeb data
  • aRxiv    Access to arXiv text
  • bold    Barcode data
  • ecoengine    Biodiversity data
  • ecoretriever    Retrieve ecological datasets
  • elastic    Elasticsearch R client
  • fulltext    Text mining client
  • geojsonio    GeoJSON/TopoJSON I/O
  • gistr    Work w/ GitHub Gists
  • internetarchive    Internet Archive client
  • lawn    Geospatial Analysis
  • musemeta    Scrape museum metadata
  • rAltmetric client
  • rbison    Biodiversity data from USGS
  • rcrossref    Crossref client
  • rebird    eBird client
  • rentrez    Entrez client
  • rerddap    ERDDAP client
  • rfisheries client
  • rgbif    GBIF biodiversity data
  • rinat    Inaturalist data
  • RNeXML    Create/consume NeXML
  • rnoaa    Client for many NOAA datasets
  • rplos    PLOS text mining
  • rsnps    SNP data access
  • rvertnet biodiversity data
  • rWBclimate    World Bank Climate data
  • solr    SOLR database client
  • spocc    Biodiversity data one stop shop
  • taxize    Taxonomic toolbelt
  • traits    Trait data
  • treebase


        Treebase data
  • wellknown    Well-known text <-> GeoJSON
  • More tutorials on the way.

Good documentation is hard to come by and good tutorials even more so.

Yet, here are rOpenSci you will find thirty-four (34) tutorials and more on the way.

Let’s answer that moronic security saying: See Something, Say Something, with:

Learn Something, Write Something.

Natural England opens-up seabed datasets

Monday, December 21st, 2015

Natural England opens-up seabed datasets by Hannah Ross.

From the post:

Following the Secretary of State’s announcement in June 2015 that Defra would become an open, data driven organisation we have been working hard at Natural England to start unlocking our rich collection of data. We have opened up 71 data sets, our first contribution to the #OpenDefra challenge to release 8000 sets of data by June 2016.

What is the data?

The data is primarily marine data which we commissioned to help identify marine protected areas (MPAs) and monitor their condition.

We hope that the publication of these data sets will help many people get a better understanding of:

  • marine nature and its conservation and monitoring
  • the location of habitats sensitive to human activities such as oil spills
  • the environmental impact of a range of activities from fishing to the creation of large marinas

The data is available for download on the EMODnet Seabed Habitats website under the Open Government Licence and more information about the data can be found at DATA.GOV.UK.

This is just the start…

Throughout 2016 we will be opening up lots more of our data, from species records to data from aerial surveys.

We’d like to know what you think of our data; please take a look and let us know what you think at

Image: Sea anemone (sunset cup-coral), Copyright (CC by-nc-nd 2.0) Natural England/Roger Mitchell 1978.

Great new data source and looking forward to more.

A welcome layer on this data would be, where possible, identification of activities and people responsible for degradation of sea anemone habitats.

Sea anemones are quite beautiful but lack the ability to defend against human disruption of those environment.

Preventing disruption of sea anemone habitats is a step forward.

Discouraging those who practice disruption of sea anemone habitats is another.

Planet Platform Beta & Open California:…

Friday, October 16th, 2015

Planet Platform Beta & Open California: Our Data, Your Creativity by Will Marshall.

From the post:

At Planet Labs, we believe that broad coverage frequent imagery of the Earth can be a significant tool to address some of the world’s challenges. But this can only happen if we democratise access to it. Put another way, we have to make data easy to access, use, and buy. That’s why I recently announced at the United Nations that Planet Labs will provide imagery in support of projects to advance the Sustainable Development Goals.

Today I am proud to announce that we’re releasing a beta version of the Planet Platform, along with our imagery of the state of California under an open license.

The Planet Platform Beta will enable a pioneering cohort of developers, image analysts, researchers, and humanitarian organizations to get access to our data, web-based tools and APIs. The goal is to provide a “sandbox” for people to start developing and testing their apps on a stack of openly available imagery, with the goal of jump-starting a developer community; and collecting data feedback on Planet’s data, tools, and platform.

Our Open California release includes two years of archival imagery of the whole state of California from our RapidEye satellites and 2 months of data from the Dove satellite archive; and will include new data collected from both constellations on an ongoing basis, with a two-week delay. The data will be under an open license, specifically CC BY-SA 4.0. The spirit of the license is to encourage R&D and experimentation in an “open data” context. Practically, this means you can do anything you want, but you must “open” your work, just as we are opening ours. It will enable the community to discuss their experiments and applications openly, and thus, we hope, establish the early foundation of a new geospatial ecosystem.

California is our first Open Region, but shall not be the last. We will open more of our data in the future. This initial release will inform how we deliver our data set to a global community of customers.

Resolution for the Dove satellites is 3-5 meters and the RapidEye satellites is 5 meters.

Not quite goldfish bowl or Venice Beach resolution but useful for other purposes.

Now would be a good time to become familiar with managing and annotating satellite imagery. Higher resolutions, public and private are only a matter of time.

Data Portals

Monday, October 12th, 2015

Data Portals

From the webpage:

A Comprehensive List of Open Data Portals from Around the World

Two things spring to mind:

First, the number of portals seems a bit lite given the rate of data accumulation.

Second, take a look at the geographic distribution of data portals. Asia and Northern Africa seem rather sparse don’t you think?

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories [But Who Pays?]

Sunday, September 13th, 2015

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories by Kirk Borne.

From the post:

Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).

The following seven V’s represent characteristics and challenges of open data:

  1. Validity: data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
  2. Value: new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
  3. Variety: the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
  4. Voice: your open data becomes the voice of your organization to your stakeholders (including customers, clients, employees, sponsors, and the public).
  5. Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
  6. Vulnerability: the frequency of data theft and hacking incidents has increased dramatically in recent years — and this is for data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Open data are therefore much more vulnerable to misuse, abuse, manipulation, or alteration.
  7. proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data): maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses and their context, and more.

Open Data has many benefits when the 7 V’s are answered!

Kirk doesn’t address who pay the cost of the 7 V’s being answered.

The most obvious one for topic maps:

#5 Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use….

Yes, “…when you provide the data for others to use.” If I can use my data without documenting the semantics and schema (data models), who covers the cost of my creating that documentation and schemas?

In any sufficiently large enterprise, when you ask for assistance, the response will ask for the contract number to which the assistance should be billed.

If you know your Heinlein, then you know the acronym TANSTaaFL (“There ain’t no such thing as a free lunch”) and its application here is obvious.

Or should I say its application is obvious from the repeated calls for better documentation and models and the continued absence of the same?

Who do you think should be paying for better documentation and data models?

Put Your Open Data Where Your Mouth Is (Deadline for Submission: 28 June 2015)

Wednesday, June 17th, 2015

Open Data as Open Educational Resources – Case Studies: Call for Participation

From the call:

The Context:

Open Data is invaluable to support researchers, but we contend that open datasets used as Open Educational Resources (OER) can also be invaluable asset for teaching and learning. The use of real datasets can enable a series of opportunities for students to collaborate across disciplines, to apply quantitative and qualitative methods, to understand good practices in data retrieval, collection and analysis, to participate in research-based learning activities which develop independent research, teamwork, critical and citizenship skills. (For more detail please see:

The Call:

We are inviting individuals and teams to submit case studies describing experiences in the use of open data as open educational resources. Proposals are open to everyone who would like to promote good practices in pedagogical uses of open data in an educational context. The selected case studies will be published in a open e-book (CC_BY_NC_SA) hosted by Open Knowledge Foundation Open Education Group by mid September 2015.

Participation in the call requires the submission of a short proposal describing the case study (of around 500 words), all proposal must be written in English, however, the selected authors will have the opportunity to submit the case both in English and another language, as our aim is to support the adoption of good practices in the use of open data in different countries.

Key dates:

  • Deadline for submission of proposals (approx. 500 words): 28th June
  • Notification to accepted proposals: 5th of July
  • Draft case study submitted for review (1500 – 2000 words): 26th of July
  • Publication-ready deadline: 16th of August
  • Publication date: September 2015

If you have any questions or comments please contact us by filling the “contact the editors” box at the end of this form

Javiera Atenas
Leo Havemann

Use of open data implies a readiness to further the use of open data. One way to honor that implied obligation is to share with others your successes and just as importantly, any failures in the use of open data in an educational context.

All too often we hear only a steady stream of success stories and we wonder where others drew such perfect students, assistants, and clean data that underlies their success. Never realizing that their students, assistants and data are no better and no worse than ours. The regular mis-steps, false starts, outright wrong paths are omitted in the story telling. For times’ sake no doubt.

If you can, do participate in this effort, even if you only have a success story to relate. 😉

Don’t Think Open Access Is Important?…

Thursday, June 11th, 2015

Don’t Think Open Access Is Important? It Might Have Prevented Much Of The Ebola Outbreak by Mike Masnick

From the post:

For years now, we’ve been talking up the importance of open access to scientific research. Big journals like Elsevier have generally fought against this at every point, arguing that its profits are more important that some hippy dippy idea around sharing knowledge. Except, as we’ve been trying to explain, it’s that sharing of knowledge that leads to innovation and big health breakthroughs. Unfortunately, it’s often pretty difficult to come up with a concrete example of what didn’t happen because of locked up knowledge. And yet, it appears we have one new example that’s rather stunning: it looks like the worst of the Ebola outbreak from the past few months might have been avoided if key research had been open access, rather than locked up.

That, at least, appears to be the main takeaway of a recent NY Times article by the team in charge of drafting Liberia’s Ebola recovery plan. What they found was that the original detection of Ebola in Liberia was held up by incorrect “conventional wisdom” that Ebola was not present in that part of Africa:

Mike goes on to point out knowledge about Ebola in Liberia was published in pay-per-view medical journals, which would have been prohibitively expensive for Liberian doctors.

He has a valid point but how often do primary care physicians consult research literature? And would they have the search chops to find research from 1982?

I am very much in favor of open access but open access on its own doesn’t bring about access or meaningful use of information once accessed.

The challenge of combining 176 x #otherpeoplesdata…

Wednesday, June 10th, 2015

The challenge of combining 176 x #otherpeoplesdata to create the Biomass And Allometry Database by Daniel Falster , Rich FitzJohn , Remko Duursma , Diego Barneche .

From the post:

Despite the hype around "big data", a more immediate problem facing many scientific analyses is that large-scale databases must be assembled from a collection of small independent and heterogeneous fragments — the outputs of many and isolated scientific studies conducted around the globe.

Collecting and compiling these fragments is challenging at both political and technical levels. The political challenge is to manage the carrots and sticks needed to promote sharing of data within the scientific community. The politics of data sharing have been the primary focus for debate over the last 5 years, but now that many journals and funding agencies are requiring data to be archived at the time of publication, the availability of these data fragments is increasing. But little progress has been made on the technical challenge: how can you combine a collection of independent fragments, each with its own peculiarities, into a single quality database?

Together with 92 other co-authors, we recently published the Biomass And Allometry Database (BAAD) as a data paper in the journal Ecology, combining data from 176 different scientific studies into a single unified database. We built BAAD for several reasons: i) we needed it for our own work ii) we perceived a strong need within the vegetation modelling community for such a database and iii) because it allowed us to road-test some new methods for building and maintaining a database ^1.

Until now, every other data compilation we are aware of has been assembled in the dark. By this we mean, end-users are provided with a finished product, but remain unaware of the diverse modifications that have been made to components in assembling the unified database. Thus users have limited insight into the quality of methods used, nor are they able to build on the compilation themselves.

The approach we took with BAAD is quite different: our database is built from raw inputs using scripts; plus the entire work-flow and history of modifications is available for users to inspect, run themselves and ultimately build upon. We believe this is a better way for managing lots of #otherpeoplesdata and so below share some of the key insights from our experience.

The highlights of the project:

1. Script everything and rebuild from source

2. Establish a data-processing pipeline

  • Don’t modify raw data files
  • Encode meta-data as data, not as code
  • Establish a formal process for processing and reviewing each data set

3. Use version control (git) to track changes and code sharing website (github) for effective collaboration

4. Embrace Openness

5. A living database

There was no mention of reconciliation of nomenclature for species. I checked some of the individual reports, such as Report for study: Satoo1968, which does mention:

Other variables: M.I. Ishihara, H. Utsugi, H. Tanouchi, and T. Hiura conducted formal search of reference databases and digitized raw data from Satoo (1968). Based on this reference, meta data was also created by M.I. Ishihara. Species name and family names were converted by M.I. Ishihara according to the following references: Satake Y, Hara H (1989a) Wild flower of Japan Woody plants I (in Japanese). Heibonsha, Tokyo; Satake Y, Hara H (1989b) Wild flower of Japan Woody plants II (in Japanese). Heibonsha, Tokyo. (Emphasis in original)

I haven’t surveyed all the reports but it appears that “conversion” of species and family names occurred prior to entering the data pipeline.

Not an unreasonable choice but it does mean that we cannot use the original names as recorded as search terms into literature that existed at the time of the original observations.

Normalization of data often leads to loss of information. Not necessarily but often does.

I first saw this in a tweet by Dr. Mike Whitfield.

Reputation instead of obligation:…

Thursday, June 4th, 2015

Reputation instead of obligation: forging new policies to motivate academic data sharing by Sascha Friesike, Benedikt Fecher, Marcel Hebing, and Stephanie Linek.

From the post:

Despite strong support from funding agencies and policy makers academic data sharing sees hardly any adoption among researchers. Current policies that try to foster academic data sharing fail, as they try to either motivate researchers to share for the common good or force researchers to publish their data. Instead, Sascha Friesike, Benedikt Fecher, Marcel Hebing, and Stephanie Linek argue that in order to tap into the vast potential that is attributed to academic data sharing we need to forge new policies that follow the guiding principle reputation instead of obligation.

In 1996, leaders of the scientific community met in Bermuda and agreed on a set of rules and standards for the publication of human genome data. What became known as the Bermuda Principles can be considered a milestone for the decoding of our DNA. These principles have been widely acknowledged for their contribution towards an understanding of disease causation and the interplay between the sequence of the human genome. The principles shaped the practice of an entire research field as it established a culture of data sharing. Ever since, the Bermuda Principles are used to showcase how the publication of data can enable scientific progress.

Considering this vast potential, it comes as no surprise that open research data finds prominent support from policy makers, funding agencies, and researchers themselves. However, recent studies show that it is hardly ever practised. We argue that the academic system is a reputation economy in which researchers are best motivated to perform activities if those pay in the form of reputation. Therefore, the hesitant adoption of data sharing practices can mainly be explained by the absence of formal recognition. And we should change this.

(emphasis in the original)

Understanding what motivates researchers to share data is an important step towards encouraging data sharing.

But at the same time, would we say that every researcher is as good as every other researcher at preparing data for sharing? At documenting data for sharing? At doing any number of tasks that aren’t really research, but just as important in order to share data?

Rather than focusing exclusively on researchers, funders should fund projects to include data sharing specialists who have the skills and interests necessary to effectively share data as part of a project’s output. Their reputations will be more closely tied to the successful sharing of data and researchers would gain in reputation for the high quality data that is shared. A much better fit for the recommendation of the authors.

Or to put it differently, lecturing researchers on how they should spend their limited time and resources to satisfy your goals, isn’t going to motivate anyone. “Pay the man!” (Richard Prior from Silver Streak)

How journals could “add value”

Thursday, May 28th, 2015

How journals could “add value” by Mark Watson.

From the post:

I wrote a piece for Genome Biology, you may have read it, about open science. I said a lot of things in there, but one thing I want to focus on is how journals could “add value”. As brief background: I think if you’re going to make money from academic publishing (and I have no problem if that’s what you want to do), then I think you should “add value”. Open science and open access is coming: open access journals are increasingly popular (and cheap!), preprint servers are more popular, green and gold open access policies are being implemented etc etc. Essentially, people are going to stop paying to access research articles pretty soon – think 5-10 year time frame.

So what can journals do to “add value”? What can they do that will make us want to pay to access them? Here are a few ideas, most of which focus on going beyond the PDF:

Humanities journals and their authors should take heed of these suggestions.

Not applicable in every case but certainly better than “journal editorial board as resume padding.”