Archive for the ‘Open Data’ Category

Alexandra Elbakyan (Sci-Hub) As Freedom Fighter

Friday, February 9th, 2018

Recognizing Alexandra Elbakyan:

Alexandra Elbakyan is the freedom fighter behind Sci-Hub, a repository of 64.5 million papers, or “two-thirds of all published research, and it [is] available to anyone.”

Ian Graber-Stiehl, in Science’s Pirate Queen, misses an opportunity to ditch the mis-framing of Elbakyan as a “pirate,” and to properly frame her as a freedom fighter.

To set the background for why you too should see Elbakyan as a freedom fighter, it’s necessary to review, briefly, the notion of “sale” and your intellectual freedom prior to widespread use of electronic texts.

When I started using libraries in the ’60’s, you had to physically visit the library to use its books or journals. The library would purchase those items, what is known as first sale, and then either lend them or allow patrons to read them. No separate charge or income for the publisher upon reading. And once purchased, the item remained in the library for use by others.

With the advent of electronic texts, plus oppressive contracts and manipulation of the law, publishers began charging libraries even more than when libraries purchased and maintained access to material for their patrons. Think of it as a form of recurrent extortion, you can’t have access to materials already purchased, save for paying to maintain that access.

Which of course means that both libraries and individuals have lost their right to pay for an item and to maintain it separate and apart from the publisher. That’s a serious theft and it took place in full public view.

There are pirates in this story, people who stole the right of libraries and individuals to purchase items for their own storage and use. Some of the better known ones include: American Chemical Society, Reed-Elsevier (a/k/a RELX Group),Sage Publishing, Springer, Taylor & Francis, and, Wiley-Blackwell.

Elbakyan is trying to recover access for everyone, access that was stolen.

That doesn’t sound like the act of a pirate. Pirates steal for their own benefit. That sounds like the pirates I listed above.

Now that you know Elbakyan is fighting to recover a right taken from you, does that make you view her fight differently?

BTW, when publishers float the false canard of their professional staff/editors/reviewers, remember their retraction rates are silent witnesses refuting their claims of competence.

Read any recent retraction for the listed publishers. Use RetractionWatch for current or past retractions. “Unread” is the best explanation for how most of them got past “staff/editors/reviewers.”

Do you support freedom fighters or publisher/pirates?

If you want to support publisher/pirates, no further action needed.

If you want to support freedom fighters, including Alexandra Elbakyan, the Sci-Hub site has a donate link, contact Elbakyan if you have extra cutting edge equipment to offer, promote Sci-Hub on social media, etc.

For making the lives of publisher/pirates more difficult, use your imagination.

To follow Elbakyan, see her blog and Facebook page.

Introducing Data360R — data to the power of R [On Having an Agenda]

Saturday, December 9th, 2017

Introducing Data360R — data to the power of R

From the post:

Last January 2017, the World Bank launched TCdata360 (, a new open data platform that features more than 2,000 trade and competitiveness indicators from 40+ data sources inside and outside the World Bank Group. Users of the website can compare countries, download raw data, create and share data visualizations on social media, get country snapshots and thematic reports, read data stories, connect through an application programming interface (API), and more.

The response to the site has been overwhelmingly enthusiastic, and this growing user base continually inspires us to develop better tools to increase data accessibility and usability. After all, open data isn’t useful unless it’s accessed and used for actionable insights.

One such tool we recently developed is data360r, an R package that allows users to interact with the TCdata360 API and query TCdata360 data, metadata, and more using easy, single-line functions.

So long as you remember the World Bank has an agenda and all the data it releases serves that agenda, you should suffer no permanent harm.

Don’t take that as meaning other sources of data have less of an agenda, although you may find their agendas differ from that of the World Bank.

The recent “discovery” that machine learning algorithms can conceal social or racist bias, was long overdue.

Anyone who took survey work in social science methodology in the last half of the 20th century would report that data collection itself, much less its processing, is fraught with unavoidable bias.

It is certainly possible, in the physical sense, to give students standardized tests, but what test results mean for any given question, such as teacher competence, is far from clear.

Or to put it differently, just because something can be measured is no guarantee the measurement is meaningful. The same applied to the data that results from any measurement process.

Take advantage of data360r certainly, but keep a wary eye on data from any source.

Maintaining Your Access to Sci-Hub

Tuesday, November 21st, 2017

A tweet today by @Sci_Hub advises:

Sci-Hub is working. To get around domain names problem, use custom Sci-Hub DNS servers and How to customize DNS in Windows:

No doubt, Elsevier will continue to attempt to interfere with your access to Sci-Hub.

Already the largest, bloated and insecure presence academic publishing presence on the Internet, Elsevier labors every day to become more of an attractive nuisance.

What corporate strategy is served by painting a flashing target on your Internet presence?


PS: Do update your DNS entries while pondering that question.

Academic Torrents Update

Friday, November 3rd, 2017

When I last mentioned Academic Torrents, in early 2014, it had 1.67TB of research data.

I dropped by Academic Torrents this week to find it now has 25.53TB of research data!

Some arbitrary highlights:

Richard Feynman’s Lectures on Physics (The Messenger Lectures)

A collection of sport activity datasets for data analysis and data mining 2017a

[Coursera] Machine Learning (Stanford University) (ml)

UC Berkeley Computer Science Courses (Full Collection)

[Coursera] Mining Massive Datasets (Stanford University) (mmds)

Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Original Dataset)

Your arbitrary highlights are probably different than mine so visit Academic Torrents to see what data captures your eye.


(Legal) Office of Personnel Management Data!

Friday, June 9th, 2017

We’re Sharing A Vast Trove Of Federal Payroll Records by Jeremy Singer-Vine.

From the post:

Today, BuzzFeed News is sharing an enormous dataset — one that sheds light on four decades of the United States’ federal payroll.

The dataset contains hundreds of millions of rows and stretches all the way back to 1973. It provides salary, title, and demographic details about millions of U.S. government employees, as well as their migrations into, out of, and through the federal bureaucracy. In many cases, the data also contains employees’ names.

We obtained the information — nearly 30 gigabytes of it — from the U.S. Office of Personnel Management, via the Freedom of Information Act (FOIA). Now, we’re sharing it with the public. You can download it for free on the Internet Archive.

This is the first time, it seems, that such extensive federal payroll data is freely available online, in bulk. (The Asbury Park Press and both publish searchable databases. They’re great for browsing, but don’t let you download the data.)

We hope that policy wonks, sociologists, statisticians, fellow journalists — or anyone else, for that matter — find the data useful.

We obtained the information through two Freedom of Information Act requests to OPM. The first chunk of data, provided in response to a request filed in September 2014, covers late 1973 through mid-2014. The second, provided in response to a request filed in December 2015, covers late 2014 through late 2016. We have submitted a third request, pending with the agency, to update the data further.

Between our first and second requests, OPM announced it had suffered a massive computer hack. As a result, the agency told us, it would no longer release certain information, including the employee “pseudo identifier” that had previously disambiguated employees with common names.

What a great data release! Kudos and thanks to BuzzFeed News!

If you need the “pseudo identifiers” for the second or following releases and/or data for the employees withheld (generally the more interesting ones), consult data from the massive computer hack.

Or obtain the excluded data directly from the Office of Personnel Management without permission.


Open Data = Loss of Bureaucratic Power

Friday, June 9th, 2017

James Comey’s leaked memos about meetings with President Trump illustrates one reason for the lack of progress on open data reported in FOIA This! The Depressing State of Open Data by Toby McIntosh.

From Former DOJ Official on Comey Leak: ‘Standard Operating Procedure’ Among Bureaucrats:

On “Fox & Friends” today, J. Christian Adams said the leak of the memos by Comey was in line with “standard operating procedure” among Beltway bureaucrats.

“[They] were using the media, using confidential information to advance attacks on the President of the United States. That’s what they do,” said Adams, adding he saw it go on at DOJ.

Access to information is one locus of bureaucratic power, which makes the story in FOIA This! The Depressing State of Open Data a non-surprise:

In our latest look at FOIA around the world, we examine the state of open data sets. According to the new report by the World Wide Web Foundation, the news is not good.

“The number of global truly open datasets remains at a standstill,” according to the group’s researchers, who say that only seven percent of government data is fully open.

The findings come in the fourth edition of the Open Data Barometer, an annual assessment which was enlarged this year to include 1,725 datasets from 15 different sectors across 115 countries. The report summarizes:

Only seven governments include a statement on open data by default in their current policies. Furthermore, we found that only 7 percent of the data is fully open, only one of every two datasets is machine readable and only one in four datasets has an open license. While more data has become available in a machine-readable format and under an open license since the first edition of the Barometer, the number of global truly open datasets remains at a standstill.

Based on the detailed country-by-country rankings, the report says some countries continue to be leaders on open data, a few have stepped up their game, but some have slipped backwards.

With open data efforts at a standstill and/or sliding backwards, waiting for bureaucrats to voluntarily relinquish power is a non-starter.

There are other options.

Need I mention the Office of Personnel Management hack? The highly touted but apparently fundamentally vulnerable NSA?

If you need a list of cyber-vulnerable U.S. government agencies, see: A-Z Index of U.S. Government Departments and Agencies.

You can:

  • wait for bureaucrats to abase themselves,
  • post how government “…ought to be transparent and accountable…”
  • echo opinions of others on calling for open data,

or, help yourself to government collected, generated, or produced data.

Which one do you think is more effective?

Open data quality – Subject Identity By Another Name

Thursday, June 8th, 2017

Open data quality – the next shift in open data? by Danny Lämmerhirt and Mor Rubinstein.

From the post:

Some years ago, open data was heralded to unlock information to the public that would otherwise remain closed. In the pre-digital age, information was locked away, and an array of mechanisms was necessary to bridge the knowledge gap between institutions and people. So when the open data movement demanded “Openness By Default”, many data publishers followed the call by releasing vast amounts of data in its existing form to bridge that gap.

To date, it seems that opening this data has not reduced but rather shifted and multiplied the barriers to the use of data, as Open Knowledge International’s research around the Global Open Data Index (GODI) 2016/17 shows. Together with data experts and a network of volunteers, our team searched, accessed, and verified more than 1400 government datasets around the world.

We found that data is often stored in many different places on the web, sometimes split across documents, or hidden many pages deep on a website. Often data comes in various access modalities. It can be presented in various forms and file formats, sometimes using uncommon signs or codes that are in the worst case only understandable to their producer.

As the Open Data Handbook states, these emerging open data infrastructures resemble the myth of the ‘Tower of Babel’: more information is produced, but it is encoded in different languages and forms, preventing data publishers and their publics from communicating with one another. What makes data usable under these circumstances? How can we close the information chain loop? The short answer: by providing ‘good quality’ open data.

Congratulations to Open Knowledge International on re-discovering the ‘Tower of Babel’ problem that prevents easy re-use of data.

Contrary to Lämmerhirt and Rubinstein’s claim, barriers have not “…shifted and multiplied….” More accurate to say Lämmerhirt and Rubinstein have experienced what so many other researchers have found for decades:

We found that data is often stored in many different places on the web, sometimes split across documents, or hidden many pages deep on a website. Often data comes in various access modalities. It can be presented in various forms and file formats, sometimes using uncommon signs or codes that are in the worst case only understandable to their producer.

The record linkage community, think medical epidemiology, has been working on aspects of this problem since the 1950’s at least (under that name). It has a rich and deep history, focused in part on mapping diverse data sets to a common representation and then performing analysis upon the resulting set.

A common omission in record linkage is to capture in discoverable format, the basis for mapping of the diverse records to a common format. That is subjects represented by “…uncommon signs or codes that are in the worst case only understandable to their producer,” that Lämmerhirt and Rubinstein complain of, although signs and codes need not be “uncommon” to be misunderstood by others.

To their credit, unlike RDF and the topic maps default, record linkage has long recognized that identification consists of multiple parts and not single strings.

Topic maps, at least at their inception, was unaware of record linkage and the vast body of research done under that moniker. Topic maps were bitten by the very problem they were seeking to solve. That being a subject, could be identified many different ways and information discovered by others about that subject, could be nearby but undiscoverable/unknown.

Rather than building on the experience with record linkage, topic maps, at least in the XML version, defaulted to relying on URLs to identify the location of subjects (resources) and/of identifying subjects (identifiers). Avoiding the Philosophy 101 mistakes of RDF, confusing locators and identifiers + refusing to correct the confusion, wasn’t enough for topic maps to become widespread. One suspects in part because topic maps were premised on creating more identifiers for subjects which already had them.

Imagine that your company has 1,000 employees and in order to use a new system, say topic maps, everyone must get a new name. Can’t use the old one. Do you see a problem? Now multiple that by every subject anyone in your company wants to talk about. We won’t run out of identifiers but your staff will certainly run out of patience.

Robust solutions to the open data ‘Tower of Babel’ issue will include the use of multi-part identifications extant in data stores, dynamic creation of multi-part identifications when necessary (note, no change to existing data store), discoverable documentation of multi-part identifications and their mappings, where syntax and data models are up to the user of data.

That sounds like a job for XQuery to me.


Unpaywall (Access to Academic Publishing)

Wednesday, April 12th, 2017

How a Browser Extension Could Shake Up Academic Publishing by Lindsay McKenzie.

From the post:

Open-access advocates have had several successes in the past few weeks. The Bill & Melinda Gates Foundation started its own open-access publishing platform, which the European Commission may replicate. And librarians attending the Association of College and Research Libraries conference in March were glad to hear that the Open Access Button, a tool that helps researchers gain free access to copies of articles, will be integrated into existing interlibrary-loan arrangements.

Another initiative, called Unpaywall, is a simple browser extension, but its creators, Jason Priem and Heather Piwowar, say it could help alter the status quo of scholarly publishing.

“We’re setting up a lemonade stand right next to the publishers’ lemonade stand,” says Mr. Priem. “They’re charging $30 for a glass of lemonade, and we’re showing up right next to them and saying, ‘Lemonade for free’. It’s such a disruptive, exciting, and interesting idea, I think.”

Like the Open Access Button, Unpaywall is open-source, nonprofit, and dedicated to improving access to scholarly research. The button, devised in 2013, has a searchable database that comes into play when a user hits a paywall.

When an Unpaywall user lands on the page of a research article, the software scours thousands of institutional repositories, preprint servers, and websites like PubMed Central to see if an open-access copy of the article is available. If it is, users can click a small green tab on the side of the screen to view a PDF.

Sci-Hub gets an honorable mention as a “..pirate website…,” usage of which carries “…so much fear and uncertainty….” (Disclaimer, the author of those comments is one of the creators of Unpaywall (Jason Priem).)

Hardly. What was long suspected about academic publishing has become widely known: Peer review is a fiction, even at the best known publishers, to say nothing of lesser lights in the academic universe. The “contribution” of publishers is primarily maintaining lists of editors for padding the odd resume. (Peer Review failure: Science and Nature journals reject papers because they “have to be wrong”.)

I should not overlook publishers as a source of employment for “gatekeepers.” “Gatekeepers” being those unable to make a contribution on their own, who seek to prevent others from doing so and failing that, preventing still others from learning of those contributions.

Serfdom was abolished centuries ago, academic publishing deserves a similar fate.

PS: For some reason authors are reluctant to post the web address for Sci-Hub:

Leak Publication: Sharing, Crediting, and Re-Using Leaks

Wednesday, March 22nd, 2017

If you substitute “leak” for “data” in this essay by Daniella Lowenberg, does it work for leaks as well?

Data Publication: Sharing, Crediting, and Re-Using Research Data by Daniella Lowenberg.

From the post:

In the most basic terms- Data Publishing is the process of making research data publicly available for re-use. But even in this simple statement there are many misconceptions about what Data Publications are and why they are necessary for the future of scholarly communications.

Let’s break down a commonly accepted definition of “research data publishing”. A Data Publication has three core features: 1 – data that are publicly accessible and are preserved for an indefinite amount of time, 2 – descriptive information about the data (metadata), and 3 – a citation for the data (giving credit to the data). Why are these elements essential? These three features make research data reusable and reproducible- the goal of a Data Publication.

As much as I admire the work of the International Consortium of Investigative Journalists (ICIJ, especially its Panama Papers project, sharing data beyond the confines of their community isn’t a value, much less a goal.

As all secret keepers, government, industry, organizations, ICIJ has “reasons” for its secrecy, but none that I find any more or less convincing than those offered by other secret keepers.

Every secret keeper has an agenda their secrecy serves. Agendas that which don’t include a public empowered to make judgments about their secret keeping.

The ICIJ proclaims Leak to Us.

A good place to leak but include with your leak a demand, an unconditional demand, that your leak be released in its entirely within a year or two of its first publication.

Help enable the public to watch all secrets and secret keepers, not just those some secret keepers choose to expose.

Open Science: Too Much Talk, Too Little Action [Lessons For Political Opposition]

Monday, February 6th, 2017

Open Science: Too Much Talk, Too Little Action by Björn Brembs.

From the post:

Starting this year, I will stop traveling to any speaking engagements on open science (or, more generally, infrastructure reform), as long as these events do not entail a clear goal for action. I have several reasons for this decision, most of them boil down to a cost/benefit estimate. The time spent traveling does not seem worth the hardly noticeable benefits any more.

I got involved in Open Science more than 10 years ago. Trying to document the point when it all started for me, I found posts about funding all over my blog, but the first blog posts on publishing were from 2005/2006, the announcement of me joining the editorial board of newly founded PLoS ONE late 2006 and my first post on the impact factor in 2007. That year also saw my first post on how our funding and publishing system may contribute to scientific misconduct.

In an interview on the occasion of PLoS ONE’s ten-year anniversary, PLoS mentioned that they thought the publishing landscape had changed a lot in these ten years. I replied that, looking back ten years, not a whole lot had actually changed:

  • Publishing is still dominated by the main publishers which keep increasing their profit margins, sucking the public teat dry
  • Most of our work is still behind paywalls
  • You won’t get a job unless you publish in high-ranking journals.
  • Higher ranking journals still publish less reliable science, contributing to potential replication issues
  • The increase in number of journals is still exponential
  • Libraries are still told by their faculty that subscriptions are important
  • The digital functionality of our literature is still laughable
  • There are no institutional solutions to sustainably archive and make accessible our narratives other than text, or our code or our data

The only difference in the last few years really lies in the fraction of available articles, but that remains a small minority, less than 30% total.

So the work that still needs to be done is exactly the same as it was at the time Stevan Harnad published his “Subversive Proposal” , 23 years ago: getting rid of paywalls. This goal won’t be reached until all institutions have stopped renewing their subscriptions. As I don’t know of a single institution without any subscriptions, that task remains just as big now as it was 23 years ago. Noticeable progress has only been on the margins and potentially in people’s heads. Indeed, now only few scholars haven’t heard of “Open Access”, yet, but apparently without grasping the issues, as my librarian colleagues keep reminding me that their faculty believe open access has already been achieved because they can access everything from the computer in their institute.

What needs to be said about our infrastructure has been said, both in person, and online, and in print, and on audio, and on video. Those competent individuals at our institutions who make infrastructure decisions hence know enough to be able to make their rational choices. Obviously, if after 23 years of talking about infrastructure reform, this is the state we’re in, our approach wasn’t very effective and my contribution is clearly completely negligible, if at all existent. There is absolutely no loss if I stop trying to tell people what they already should know. After all, the main content of my talks has barely changed in the last eight or so years. Only more recent evidence has been added and my conclusions have become more radical, i.e., trying to tackle the radix (Latin: root) of the problem, rather than palliatively care for some tangential symptoms.

The line:

What needs to be said about our infrastructure has been said, both in person, and online, and in print, and on audio, and on video.

is especially relevant in light of the 2016 presidential election and the fund raising efforts of organizations that form the “political opposition.”

You have seen the ads in email, on Facebook, Twitter, etc., all pleading for funding to oppose the current US President.

I agree the current US President should be opposed.

But the organizations seeking funding failed to stop his rise to power.

Whether their failure was due to organizational defects or poor strategies is really beside the point. They failed.

Why should I enable them to fail again?

One data point, the Women’s March on Washington was NOT organized by organizations with permanents staff and offices in Washington or elsewhere.

Is your contribution supporting staffs and offices of the self-righteous (the primary function of old line organizations) or investigation, research, reporting and support of boots on the ground?

Government excesses are not stopped by bewailing our losses but by making government agents bewail theirs.

ODI – Access To Legal Data News

Friday, January 13th, 2017

Strengthening our legal data infrastructure by Amanda Smith.

Amanda recounts an effort between the Open Data Institute (ODI) and Thomas Reuters to improve access to legal data.

From the post:

Paving the way for a more open legal sector: discovery workshop

In September 2016, Thomson Reuters and the ODI gathered publishers of legal data, policy makers, law firms, researchers, startups and others working in the sector for a discovery workshop. Its aims were to explore important data types that exist within the sector, and map where they sit on the data spectrum, discuss how they flow between users and explore the opportunities that taking a more open approach could bring.

The notes from the workshop explore current mechanisms for collecting, managing and publishing data, benefits of wider access and barriers to use. There are certain questions that remain unanswered – for example, who owns the copyright for data collected in court. The notes are open for comments, and we invite the community to share their thoughts on these questions, the data types discussed, how to make them more open and what we might have missed.

Strengthening data infrastructure in the legal sector: next steps

Following this workshop we are working in partnership with Thomson Reuters to explore data infrastructure – datasets, technologies and processes and organisations that maintain them – in the legal sector, to inform a paper to be published later in the year. The paper will focus on case law, legislation and existing open data that could be better used by the sector.

The Ministry of Justice have also started their own data discovery project, which the ODI have been contributing to. You can keep up to date on their progress by following the MOJ Digital and Technology blog and we recommend reading their data principles.

Get involved

We are looking to the legal and data communities to contribute opinion pieces and case studies to the paper on data infrastructure for the legal sector. If you would like to get involved, contact us.
…(emphasis in original)

Encouraging news, especially for those interested in building value-added tools on top of data that is made available publicly. At least they can avoid the cost of collecting data already collected by others.

Take the opportunity to comment on the notes and participate as you are able.

If you think you have seen use cases for topic maps before, consider that the Code of Federal Regulations (US), as of December 12, 2016, has 54938 separate but not unique, definitions of “person.” The impact of each regulation depending upon its definition of that term.

Other terms have similar semantic difficulties both in the Code of Federal Regulations as well as the US Code.

Guarantees Of Public Access In Trump Administration (A Perfect Data Storm)

Saturday, December 31st, 2016

I read hand wringing over the looming secrecy of the Trump administration on a daily basis.

More truthfully, I skip over daily hand wringing over the looming secrecy of the Trump administration.

For two reasons.

First, as reported in US government subcontractor leaks confidential military personnel data by Charlie Osborne, government data doesn’t require hacking, just a little initiative.

In this particular case, it was rsync without a username or password, that made this data leak possible.

Editors should ask their reporters before funding FOIA suits: “Have you tried rsync?”

Second, the alleged-to-be-Trump-nominees for cabinet and lesser positions, remind me of this character from Dilbert: November 2, 1992:


Trump appointees may have mastered the pointy end of pencils but their ability to use cyber-security will be as shown.

When you add up the cyber-security incompetence of Trump appointees, complaints from Inspector Generals about agency security, and agencies leaking to protect their positions/turf, you have the conditions for a perfect data storm.

A perfect data storm that may see the US government hemorrhaging data like never before.

PS: You know my preference, post leaks on receipt in their entirety. As for “consequences,” consider those a down payment on what awaits people who betray humanity, their people, colleagues and family. They could have chosen differently and didn’t. What more can one say?

Weakly Weaponized Open Data

Friday, November 4th, 2016

Berners-Lee raises spectre of weaponized open data by Bill Camarda.

From the post:


Practically everybody loves open data, ie “data that anyone can access, use or share”. And nobody loves it more than Tim Berners-Lee, creator of the World Wide Web, and co-founder of the Open Data Institute (ODI).

Berners-Lee and his ODI colleagues have spent years passionately evangelizing governments and companies to publicly release their non-personal data for use to improve communities.

So when he recently told the Guardian that hackers could use open data to create societal chaos, it might have been this year’s most surprising “man bites dog” news story.

What’s going on here? The growing fear of data sabotage, that’s what.

Bill focuses on the manipulation and/or planting of false data, which could result in massive traffic jams, changes in market prices, etc.

In fact, Berners-Lee says in the original Guardian story:

“If you falsify government data then there are all kinds of ways that you could get financial gain, so yes,” he said, “it’s important that even though people think about open data as not a big security problem, it is from the point of view of being accurate.”

He added: “I suppose it’s not as exciting as personal data for hackers to get into because it’s public.”

Disruptive to some, profitable to others, but what should be called weakly weaponized open data.

Here is one instance of strongly weaponized open data.

Scenario: We Don’t Need No Water, Let The Motherfucker Burn

The United States is currently experiencing a continuing drought. From the U.S. Drought Monitor:


Keying on the solid red color around Atlanta, GA, Fire Weather, a service of the National Weather Service, estimates the potential impact of fires near Atlanta:


Impacted by a general conflagration around Atlanta:

Population: 2,783,418
Airports: 38
Miles of Interstate: 556
Miles of Rail: 2399
Parks: 4
Area: 27,707 Sq. Miles

Pipelines are missing from the list of impacts. For that, consult the National Pipeline Mapping System where even a public login reveals:


The red lines are hazardous liquid pipelines, blue lines are gas transmission pipelines, the yellow lines outline Fulton County.

We have located a likely place for a forest fire, have some details on its probable impact and a rough idea of gas and other pipelines in the prospective burn area.

Oh, we need a source of ignition. Road flares anyone?


From the WSDOT, Winter Driving Supply Checklist. Emergency kits with flares are available at box stores and online.

Bottom line:

Intentional forest fires can be planned from public data sources. Governments gratuitously suggest non-suspicious methods of transporting forest fire starting materials.

Details I have elided over, such as evacuation routes, fire watch stations, drones as fire starters, fire histories, public events, plus greater detail from the resources cited, are all available from public sources.

What are your Weaponized Open Data risks?

Version 2 of the Hubble Source Catalog [Model For Open Access – Attn: Security Researchers]

Friday, September 30th, 2016

Version 2 of the Hubble Source Catalog

From the post:

The Hubble Source Catalog (HSC) is designed to optimize science from the Hubble Space Telescope by combining the tens of thousands of visit-based source lists in the Hubble Legacy Archive (HLA) into a single master catalog.

Version 2 includes:

  • Four additional years of ACS source lists (i.e., through June 9, 2015). All ACS source lists go deeper than in version 1. See current HLA holdings for details.
  • One additional year of WFC3 source lists (i.e., through June 9, 2015).
  • Cross-matching between HSC sources and spectroscopic COS, FOS, and GHRS observations.
  • Availability of magauto values through the MAST Discovery Portal. The maximum number of sources displayed has increased from 10,000 to 50,000.

The HSC v2 contains members of the WFPC2, ACS/WFC, WFC3/UVIS and WFC3/IR Source Extractor source lists from HLA version DR9.1 (data release 9.1). The crossmatching process involves adjusting the relative astrometry of overlapping images so as to minimize positional offsets between closely aligned sources in different images. After correction, the astrometric residuals of crossmatched sources are significantly reduced, to typically less than 10 mas. The relative astrometry is supported by using Pan-STARRS, SDSS, and 2MASS as the astrometric backbone for initial corrections. In addition, the catalog includes source nondetections. The crossmatching algorithms and the properties of the initial (Beta 0.1) catalog are described in Budavari & Lubow (2012).


There are currently three ways to access the HSC as described below. We are working towards having these interfaces consolidated into one primary interface, the MAST Discovery Portal.

  • The MAST Discovery Portal provides a one-stop web access to a wide variety of astronomical data. To access the Hubble Source Catalog v2 through this interface, select Hubble Source Catalog v2 in the Select Collection dropdown, enter your search target, click search and you are on your way. Please try Use Case Using the Discovery Portal to Query the HSC
  • The HSC CasJobs interface permits you to run large and complex queries, phrased in the Structured Query Language (SQL).
  • HSC Home Page

    – The HSC Summary Search Form displays a single row entry for each object, as defined by a set of detections that have been cross-matched and hence are believed to be a single object. Averaged values for magnitudes and other relevant parameters are provided.

    – The HSC Detailed Search Form displays an entry for each separate detection (or nondetection if nothing is found at that position) using all the relevant Hubble observations for a given object (i.e., different filters, detectors, separate visits).

Amazing isn’t it?

The astronomy community long ago vanquished data hoarding and constructed tools to avoid moving very large data sets across the network.

All while enabling more and not less access and research using the data.

Contrast that to the sorry state of security research, where example code is condemned, if not actually prohibited by law.

Yet, if you believe current news reports (always an iffy proposition), cybercrime is growing by leaps and bounds. (PwC Study: Biggest Increase in Cyberattacks in Over 10 Years)

How successful is the “data hoarding” strategy of the security research community?

Mapping U.S. wildfire data from public feeds

Monday, August 29th, 2016

Mapping U.S. wildfire data from public feeds by David Clark.

From the post:

With the Mapbox Datasets API, you can create data-based maps that continuously update. As new data arrives, you can push incremental changes to your datasets, then update connected tilesets or use the data directly in a map.

U.S. wildfires have been in the news this summer, as they are every summer, so I set out to create an automatically updating wildfire map.

An excellent example of using public data feeds to create a resource not otherwise available.

Historical fire data can be found at: Federal Wildland Fire Occurrence Data, spanning 1980 through 2015.

The Outlooks page of the National Interagency Coordination Center provides four month (from current month) outlook and weekly outlook fire potential reports and maps.

Hunters Bag > 400 Database Catalogs

Monday, August 29th, 2016

Transparency Hunters Capture More than 400 California Database Catalogs by Dave Maass.

The post in its entirety:

A team of over 40 transparency activists aimed their browsers at California this past weekend, collecting more than 400 database catalogs from local government agencies, as required under a new state law. Together, participants in the California Database Hunt shined light on thousands upon thousands of government record systems.

California S.B. 272 requires every local government body, with the exception of educational agencies, to post inventories of their “enterprise systems,” essentially every database that holds records on members of the public or is used as a primary source of information. These database catalogs were required to be posted online (at least by agencies with websites) by July 1, 2016.

EFF, the Data Foundation, the Sunlight Foundation, and Level Zero, combined forces to host volunteers in San Francisco, Washington, D.C., and remotely. More than 40 volunteers scoured as many local agency websites as we could in four hours—cities, counties, regional transportation agencies, water districts, etc. Here are the rough numbers:

680 – The number of unique agencies that supporters searched

970 – The number of searches conducted (Note: agencies found on the first pass not to have catalogs were searched a second time)

430 – Number of agencies with database catalogs online

250 – Number of agencies without database catalogs online, as verified by two people

Download a spreadsheet of the local government database catalogs we found: Excel/TSV

Download a spreadsheet of cities and counties that did not have S.B. 272 catalogs: Excel/TSV

Please note that for each of the cities and counties identified as not posting database catalogs, at least two volunteers searched for the catalogs and could not find them. It is possible that those agencies do in fact have S.B. 272-compliant catalogs posted somewhere, but not in what we would call a “prominent location,” as required by the new law. If you represent an agency that would like its database catalog listed, please send an email to

We owe a debt of gratitude to the dozens of volunteers who sacrificed their Saturday afternoons to help make local government in California a little less opaque. Check out this 360-degree photo of our San Francisco team on Facebook.

In the coming days and weeks, we plan to analyze and share the data further. Stay tuned, and if you find anything interesting perusing these database catalogs, please drop us a line at

Of course, bagging the database catalogs is like having a collection of Christmas catalogs. It’s great, but there are more riches within!

What data products would you look for first?

Updated to mirror changes (clarification) in original.

How-To Safely Protest on the Downtown Connector – #BLM

Wednesday, July 13th, 2016

Atlanta doesn’t have a spotless record on civil rights but Mayor Kasim Reed agreeing to meet with #BLM leaders on July 18, 2016, is a welcome contrast to response in the police state of Baton Rouge, for example.

During this “cooling off” period, I want to address Mayor Reed’s concern for the safety of #BLM protesters and motorists should #BLM protests move onto the Downtown Connector.

Being able to protest on the Downtown Connector would be far more effective than blocking random Atlanta surface streets, by day or night. Mayor Reed’s question is how to do so safely?

Here is Google Maps’ representation of a part of the Downtown Connector:


That view isn’t helpful on the issue of safety but consider a smaller portion of the Downtown Connector as seen by Google Earth:


The safety question has two parts: How to transport #BLM protesters to a protest site on the Downtown Connector? How to create a safe protest site on the Downtown Connector?

A nearly constant element of the civil rights movement provides the answer: buses. From the Montgomery Bus Boycott, Freedom Riders, to the long experiment with busing to achieve desegregation in education.

Looking at an enlargement of an image of the Downtown Connector, you will see that ten (10) buses would fill all the lanes, plus the emergency lane and the shoulder, preventing any traffic from going around the buses. That provides safety for protesters. Not to mention transporting all the protesters safely to the protest site.

The Downtown Connector is often described as a “parking lot” so drivers are accustomed to traffic slowing to a full stop. If a group of buses formed a line across all lanes of the Downtown Connector and slowed to a stop, traffic would be safely stopped. That provides safety for drivers.

The safety of both protesters and drivers depends upon coordination between cars and buses to fill all the lanes of the Downtown Connector and then slowing down in unison, plus buses occupying the emergency lane and shoulder. Anything less than full interdiction of the highway would put both protesters and drivers at risk.

Churches and church buses have often played pivotal roles in the civil rights movement so the means for creating safe protest spaces, even on the Downtown Connector, are not out of reach.

There are other logistical and legal issues involved in such a protest but I have limited myself to offering a solution to Mayor Reed’s safety question.

PS: The same observations apply to any limited access motorway, modulo adaptation to your local circumstances.

The No-Value-Add Of Academic Publishers And Peer Review

Tuesday, June 21st, 2016

Comparing Published Scientific Journal Articles to Their Pre-print Versions by Martin Klein, Peter Broadwell, Sharon E. Farb, Todd Grappone.


Academic publishers claim that they add value to scholarly communications by coordinating reviews and contributing and enhancing text during publication. These contributions come at a considerable cost: U.S. academic libraries paid $1.7 billion for serial subscriptions in 2008 alone. Library budgets, in contrast, are flat and not able to keep pace with serial price inflation. We have investigated the publishers’ value proposition by conducting a comparative study of pre-print papers and their final published counterparts. This comparison had two working assumptions: 1) if the publishers’ argument is valid, the text of a pre-print paper should vary measurably from its corresponding final published version, and 2) by applying standard similarity measures, we should be able to detect and quantify such differences. Our analysis revealed that the text contents of the scientific papers generally changed very little from their pre-print to final published versions. These findings contribute empirical indicators to discussions of the added value of commercial publishers and therefore should influence libraries’ economic decisions regarding access to scholarly publications.

The authors have performed a very detailed analysis of pre-prints, 90% – 95% of which are published as open pre-prints first, to conclude there is no appreciable difference between the pre-prints and the final published versions.

I take “…no appreciable difference…” to mean academic publishers and the peer review process, despite claims to the contrary, contribute little or no value to academic publications.

How’s that for a bargaining chip in negotiating subscription prices?

Where Has Sci-Hub Gone?

Saturday, June 18th, 2016

While I was writing about the latest EC idiocy (link tax), I was reminded of Sci-Hub.

Just checking to see if it was still alive, I tried

404 by standard DNS service.

If you are having the same problem, Mike Masnick reports in Sci-Hub, The Repository Of ‘Infringing’ Academic Papers Now Available Via Telegram, you can access Sci-Hub via:

I’m not on Telegram, yet, but that may be changing soon. 😉

BTW, while writing this update, I stumbled across: The New Napster: How Sci-Hub is Blowing Up the Academic Publishing Industry by Jason Shen.

From the post:

This is obviously piracy. And Elsevier, one of the largest academic journal publishers, is furious. In 2015, the company earned $1.1 billion in profits on $2.9 billion in revenue [2] and Sci-hub directly attacks their primary business model: subscription service it sells to academic organizations who pay to get access to its journal articles. Elsevier filed a lawsuit against Sci-Hub in 2015, claiming Sci-hub is causing irreparable injury to the organization and its publishing partners.

But while Elsevier sees Sci-Hub as a major threat, for many scientists and researchers, the site is a gift from the heavens, because they feel unfairly gouged by the pricing of academic publishing. Elsevier is able to boast a lucrative 37% profit margin because of the unusual (and many might call exploitative) business model of academic publishing:

  • Scientists and academics submit their research findings to the most prestigious journal they can hope to land in, without getting any pay.
  • The journal asks leading experts in that field to review papers for quality (this is called peer-review and these experts usually aren’t paid)
  • Finally, the journal turns around and sells access to these articles back to scientists/academics via the organization-wide subscriptions at the academic institution where they work or study

There’s piracy afoot, of that I have no doubt.


  • Relies on research it does not sponsor
  • Research results are submitted to it for free
  • Research is reviewed for free
  • Research is published in journals of value only because of the free contributions to them
  • Elsevier makes a 37% profit off of that free content

There is piracy but Jason fails to point to Elsevier as the pirate.

Sci-Hub/Alexandra Elbakyan is re-distributing intellectual property that was stolen by Elsevier from the academic community, for its own gain.

It’s time to bring Elsevier’s reign of terror against the academic community to an end. Support Sci-Hub in any way possible.

Reproducible Research Resources for Research(ing) Parasites

Friday, June 3rd, 2016

Reproducible Research Resources for Research(ing) Parasites by Scott Edmunds.

From the post:

Two new research papers on scabies and tapeworms published today showcase a new collaboration with This demonstrates a new way to share scientific methods that allows scientists to better repeat and build upon these complicated studies on difficult-to-study parasites. It also highlights a new means of writing all research papers with citable methods that can be updated over time.

While there has been recent controversy (and hashtags in response) from some of the more conservative sections of the medical community calling those who use or build on previous data “research parasites”, as data publishers we strongly disagree with this. And also feel it is unfair to drag parasites into this when they can teach us a thing or two about good research practice. Parasitology remains a complex field given the often extreme differences between parasites, which all fall under the umbrella definition of an organism that lives in or on another organism (host) and derives nutrients at the host’s expense. Published today in GigaScience are articles on two parasitic organisms, scabies and on the tapeworm Schistocephalus solidus. Not only are both papers in parasitology, but the way in which these studies are presented showcase a new collaboration with that provides a unique means for reporting the Methods that serves to improve reproducibility. Here the authors take advantage of their open access repository of scientific methods and a collaborative protocol-centered platform, and we for the first time have integrated this into our submission, review and publication process. We now also have a groups page on the portal where our methods can be stored.

A great example of how sharing data advances research.

Of course, that assumes that one of your goals is to advance research and not solely yourself, your funding and/or your department.

Such self-centered as opposed to research-centered individuals do exist, but I would not malign true parasites by describing them as such, even colloquially.

The days of science data hoarders are numbered and one can only hope that the same is true for the “gatekeepers” of humanities data, manuscripts and artifacts.

The only known contribution of hoarders or “gatekeepers” has been to the retarding of their respective disciplines.

Given the choice of advancing your field along with yourself, or only yourself, which one will you choose?

Open Data Institute – Join From £1 (Süddeutsche Zeitung (SZ), “Nein!”)

Tuesday, April 26th, 2016

A new offer for membership in the Open Data Institute:

Data impacts everybody. It’s the infrastructure that underpins transparency, accountability, public services, business innovation and civil society.

Together we can embrace open data to improve how we access healthcare services, discover cures for diseases, understand our governments, travel around more easily and much, much more.

Are you eager to learn more about it, collaborate with it or meet others who are already making a difference with it? From just £1 join our growing, collaborative global network of individuals, students, businesses, startups and organisations, and receive:

  • invitations to events and open evenings organised by the ODI and beyond
  • opportunities to promote your own news and events across the network
  • updates up to twice a month from the world of data and open innovation
  • 30% discount on all our courses
  • 20% reduction on our annual ODI Summit

Become a member from £1

I’d like to sign my organisation up

If you search for Süddeutsche Zeitung (SZ), the hoarders of the Panama Papers, you will come up empty.

SZ is in favor of transparency and accountability, but only for others. Never for SZ.

SZ claims in some venues to be concerned with the privacy of individuals mentioned in the Panama Papers.

How to judge between privacy rights of individuals, parties to looting nations, against the public’s right to judge reporting on the same? How is financial regulation reform possible without the details?

SZ is comfortable with protecting looters of nations and obstructing meaningful financial reform.

You can judge news media by the people they protect.

Academic, Not Industrial Secrecy

Saturday, March 12th, 2016

Data too important to share: do those who control the data control the message? by Peter Doshi (BMJ 2016;352:i1027).

Read Peter’s post for the details but the problem in a nutshell:

“The main concern we had was that Fresenius was involved in the process,” Myburgh explained. He said there was never any question of Krumholz’s independence or credentials. Rather, it was a “concern that this was a way for Fresenius to get the data once they were in the public domain. We want restrictions on who could do the analyses.

Under the YODA model Krumholz proposed, the data would be reanalysed by independent parties before being made more broadly available.

“We have no issue with the concept of data sharing,” Myburgh said. “The concerns we have come down to the people with ulterior motives which contradict or do not adhere to the scientific principles we adhere to. That’s the danger.”

Myburgh described himself as an impartial scientist, in contrast to those who have challenged his study. “I’ve heard some of the protagonists of starch. Senior figures wanted to make a point. We do research to answer a question. They do analyses to prove a point.” (emphasis added)

You can hear the echoes of Myburgh’s position of:

We want restrictions on who could do the analyses.

in every government claim for not releasing data that supports government conclusions.

If “terrorists” really are the danger the government claims, don’t you think releasing the data on which that claim is based would convince everyone? Or nearly everyone?

Ah, but some of us might not think opposing corrupt, puppet governments in the Middle East is the same thing as “terrorism.”

And still others of us might not think opposing an oppressive theocracy is the same as “terrorism.”

Yes, more data could lead to more informed discussion, but it could also lead to inconvenient questions.

If Myburgh and colleagues were to find this is the last funded study from any source, unless and until they release this and other trial data, they would sing a different tune.

Anyone with a list of the funders for Myburgh and his colleagues?

Email addresses would be a good start.

Sci-Hub Tip: Converting Paywall DOIs to Public Access

Thursday, February 11th, 2016

In a tweet Jon Tenn@nt points out that:

Reminder: add “” after the .com in the URL of pretty much any paywalled paper to gain instant free access.

BTW, I tested Jon’s advice with:****/*******

re-cast as:****/*******

And it works!

With a little scripting, you can convert your paywall DOIs into public access with

This “worked for me” so if you encounter issues, please ping me so I can update this post.

Happy reading!

Tackling Zika

Thursday, February 11th, 2016

F1000Research launches rapid, open, publishing channel to help scientists tackle Zika

From the post:

ZAO provides a platform for scientists and clinicians to publish their findings and source data on Zika and its mosquito vectors within days of submission, so that research, medical and government personnel can keep abreast of the rapidly evolving outbreak.

The channel provides diamond-access: it is free to access and articles are published free of charge. It also accepts articles on other arboviruses such as Dengue and Yellow Fever.

The need for the channel is clearly evidenced by a recent report on the global response to the Ebola virus by the Harvard-LSHTM (London School of Hygiene & Tropical Medicine) Independent Panel.

The report listed ‘Research: production and sharing of data, knowledge, and technology’ among its 10 recommendations, saying: “Rapid knowledge production and dissemination are essential for outbreak prevention and response, but reliable systems for sharing epidemiological, genomic, and clinical data were not established during the Ebola outbreak.”

Dr Megan Coffee, an infectious disease clinician at the International Rescue Committee in New York, said: “What’s published six months, or maybe a year or two later, won’t help you – or your patients – now. If you’re working on an outbreak, as a clinician, you want to know what you can know – now. It won’t be perfect, but working in an information void is even worse. So, having a way to get information and address new questions rapidly is key to responding to novel diseases.”

Dr. Coffee is also a co-author of an article published in the channel today, calling for rapid mobilisation and adoption of open practices in an important strand of the Zika response: drug discovery –

Sean Ekins, of Collaborative Drug Discovery, and lead author of the article, which is titled ‘Open drug discovery for the Zika virus’, said: “We think that we would see rapid progress if there was some call for an open effort to develop drugs for Zika. This would motivate members of the scientific community to rally around, and centralise open resources and ideas.”

Another co-author, of the article, Lucio Freitas-Junior of the Brazilian Biosciences National Laboratory, added: “It is important to have research groups working together and sharing data, so that scarce resources are not wasted in duplication. This should always be the case for neglected diseases research, and even more so in the case of Zika.”

Rebecca Lawrence, Managing Director, F1000, said: “One of the key conclusions of the recent Harvard-LSHTM report into the global response to Ebola was that rapid, open data sharing is essential in disease outbreaks of this kind and sadly it did not happen in the case of Ebola.

“As the world faces its next health crisis in the form of the Zika virus, F1000Research has acted swiftly to create a free, dedicated channel in which scientists from across the globe can share new research and clinical data, quickly and openly. We believe that it will play a valuable role in helping to tackle this health crisis.”


For more information:

Andrew Baud, Tala (on behalf of F1000), +44 (0) 20 3397 3383 or +44 (0) 7775 715775

Excellent news for researchers but a direct link to the new channel would have been helpful as well: Zika & Arbovirus Outbreaks (ZAO).

See this post: The Zika & Arbovirus Outbreaks channel on F1000Research by Thomas Ingraham.

News organizations should note that as of today, 11 February 2016, ZAO offers 9 articles, 16 posters and 1 set of slides. Those numbers are likely to increase rapidly.

Oh, did I mention the ZAO channel is free?

Unlike some journals, payment, prestige, privilege, are not pre-requisites for publication.

Useful research on Zika & Arboviruses is the only requirement.

I know, sounds like a dangerous precedent but defeating a disease like Zika will require taking risks.

Addressing The Concerns Of The Selfish

Monday, January 25th, 2016

A burnt hand didn’t teach any lessons to Dr. Jeffrey M. Drazen of the New England Journal of Medicine (NEJM).

Just last week Jeffrey and a co-conspirator took to the editorial page of the NEJM to denounce as “parasites,” scientists who reuse data developed by others. Especially, if the data developers weren’t included in the new work. See: Parasitic Re-use of Data? Institutionalizing Toadyism.

Overly sensitive, as protectors of greedy people tend to be, Jeffrey takes back to the editorial page to say:

In the process of formulating our policy, we spoke to clinical trialists around the world. Many were concerned that data sharing would require them to commit scarce resources with little direct benefit. Some of them spoke pejoratively in describing data scientists who analyze the data of others.3 To make data sharing successful, it is important to acknowledge and air those concerns.(Data Sharing and The Journal)

On target with concerns about data sharing requiring “…scarce resources with little direct benefit.”

Except Jeffrey forgot to mention that in his editorial about “parasites.”

Not a single word. The “cost free” myth of sharing data persists and the NEJM’s voice could be an important one in dispelling that myth.

But not Jeffrey, he took up his lance to defend the concerns of the selfish.

I will post separately on the issue of the cost of data sharing, etc., which as I say, is a legitimate concern.

We don’t need to resort to toadyism to satisfy the concerns of scientists over re-use of their data.

Create all the needed mechanisms to compensate for the sharing of data and if anyone objects or has “concerns” about re-use of data, cease funding them and/or any project of which they are a member.

There is no right to public funding for research, especially for scientists who have developed a sense of entitlement to public funding, for their own benefit.

You might want to compare the NEJM position to that of the radio astronomy community which shares both raw and processed data with anyone who wants to download it.

It’s a question of “privilege,” and not public safety, etc.

It’s annoying enough that people are selfish with research data, don’t be dishonest as well.

Introducing Kaggle Datasets [No Data Feudalism Here]

Saturday, January 23rd, 2016

Introducing Kaggle Datasets

From the post:

At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. That’s why we’ve created a home for high quality public datasets, Kaggle Datasets.

Kaggle Datasets has four core components:

  • Access: simple, consistent access to the data with clear licensing
  • Analysis: a way to explore the data without downloading it
  • Results: visibility to the previous work that’s been created on the data
  • Conversation: forums and comments for discussing the nuances of the data

Are you interested in publishing one of your datasets on Submit a sample here.

Unlike some medievalists who publish in the New England Journal of Medicine, Kaggle not only makes the data sets freely available, but offers tools to help you along.

Kaggle will also assist you in making your datasets available as well.

Parasitic Re-use of Data? Institutionalizing Toadyism.

Thursday, January 21st, 2016

Data Sharing by Dan L. Longo, M.D., and Jeffrey M. Drazen, M.D, N Engl J Med 2016; 374:276-277 January 21, 2016 DOI: 10.1056/NEJMe1516564.

This editorial in the New England Journal of Medicine advocates the following for re-use of medical data:

How would data sharing work best? We think it should happen symbiotically, not parasitically. Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested. What is learned may be beautiful even when seen from close up.

I had to check my calendar to make sure April the 1st hadn’t slipped up on me.

This is one of the most bizarre and malignant proposals on data re-use that I have seen.

If you have an original idea, you have to approach other researchers as a suppliant and ask them to benefit from your idea, possibly using their data in new and innovative ways?

Does that smack of a “good old boys/girls” club to you?

If anyone uses the term parasitic or parasite with regard to data re-use, be sure to respond with the question:

How much do dogs in the manger contribute to science?

That phenomena is not unknown in the humanities nor in biblical studies. There was a wave of very disgusting dissertations that began with “…X entrusted me with this fragment of the Dead Sea Scrolls….”

I suppose those professors knew their ability to attract students based on merit versus their hoarding of original text fragments better than I did. You should judge them by their choices.

rOpenSci (updated tutorials) [Learn Something, Write Something]

Monday, January 4th, 2016

rOpenSci has updated 16 of its tutorials!

More are on the way!

Need a detailed walk through of what our packages allow you to do? Click on a package below, quickly install it and follow along. We’re in the process of updating existing package tutorials and adding several more in the coming weeks. If you find any bugs or have comments, drop a note in the comments section or send us an email. If a tutorial is available in multiple languages we indicate that with badges, e.g., (English) (Português).

  • alm    Article-level metrics
  • antweb    AntWeb data
  • aRxiv    Access to arXiv text
  • bold    Barcode data
  • ecoengine    Biodiversity data
  • ecoretriever    Retrieve ecological datasets
  • elastic    Elasticsearch R client
  • fulltext    Text mining client
  • geojsonio    GeoJSON/TopoJSON I/O
  • gistr    Work w/ GitHub Gists
  • internetarchive    Internet Archive client
  • lawn    Geospatial Analysis
  • musemeta    Scrape museum metadata
  • rAltmetric client
  • rbison    Biodiversity data from USGS
  • rcrossref    Crossref client
  • rebird    eBird client
  • rentrez    Entrez client
  • rerddap    ERDDAP client
  • rfisheries client
  • rgbif    GBIF biodiversity data
  • rinat    Inaturalist data
  • RNeXML    Create/consume NeXML
  • rnoaa    Client for many NOAA datasets
  • rplos    PLOS text mining
  • rsnps    SNP data access
  • rvertnet biodiversity data
  • rWBclimate    World Bank Climate data
  • solr    SOLR database client
  • spocc    Biodiversity data one stop shop
  • taxize    Taxonomic toolbelt
  • traits    Trait data
  • treebase


        Treebase data
  • wellknown    Well-known text <-> GeoJSON
  • More tutorials on the way.

Good documentation is hard to come by and good tutorials even more so.

Yet, here are rOpenSci you will find thirty-four (34) tutorials and more on the way.

Let’s answer that moronic security saying: See Something, Say Something, with:

Learn Something, Write Something.

Natural England opens-up seabed datasets

Monday, December 21st, 2015

Natural England opens-up seabed datasets by Hannah Ross.

From the post:

Following the Secretary of State’s announcement in June 2015 that Defra would become an open, data driven organisation we have been working hard at Natural England to start unlocking our rich collection of data. We have opened up 71 data sets, our first contribution to the #OpenDefra challenge to release 8000 sets of data by June 2016.

What is the data?

The data is primarily marine data which we commissioned to help identify marine protected areas (MPAs) and monitor their condition.

We hope that the publication of these data sets will help many people get a better understanding of:

  • marine nature and its conservation and monitoring
  • the location of habitats sensitive to human activities such as oil spills
  • the environmental impact of a range of activities from fishing to the creation of large marinas

The data is available for download on the EMODnet Seabed Habitats website under the Open Government Licence and more information about the data can be found at DATA.GOV.UK.

This is just the start…

Throughout 2016 we will be opening up lots more of our data, from species records to data from aerial surveys.

We’d like to know what you think of our data; please take a look and let us know what you think at

Image: Sea anemone (sunset cup-coral), Copyright (CC by-nc-nd 2.0) Natural England/Roger Mitchell 1978.

Great new data source and looking forward to more.

A welcome layer on this data would be, where possible, identification of activities and people responsible for degradation of sea anemone habitats.

Sea anemones are quite beautiful but lack the ability to defend against human disruption of those environment.

Preventing disruption of sea anemone habitats is a step forward.

Discouraging those who practice disruption of sea anemone habitats is another.

Planet Platform Beta & Open California:…

Friday, October 16th, 2015

Planet Platform Beta & Open California: Our Data, Your Creativity by Will Marshall.

From the post:

At Planet Labs, we believe that broad coverage frequent imagery of the Earth can be a significant tool to address some of the world’s challenges. But this can only happen if we democratise access to it. Put another way, we have to make data easy to access, use, and buy. That’s why I recently announced at the United Nations that Planet Labs will provide imagery in support of projects to advance the Sustainable Development Goals.

Today I am proud to announce that we’re releasing a beta version of the Planet Platform, along with our imagery of the state of California under an open license.

The Planet Platform Beta will enable a pioneering cohort of developers, image analysts, researchers, and humanitarian organizations to get access to our data, web-based tools and APIs. The goal is to provide a “sandbox” for people to start developing and testing their apps on a stack of openly available imagery, with the goal of jump-starting a developer community; and collecting data feedback on Planet’s data, tools, and platform.

Our Open California release includes two years of archival imagery of the whole state of California from our RapidEye satellites and 2 months of data from the Dove satellite archive; and will include new data collected from both constellations on an ongoing basis, with a two-week delay. The data will be under an open license, specifically CC BY-SA 4.0. The spirit of the license is to encourage R&D and experimentation in an “open data” context. Practically, this means you can do anything you want, but you must “open” your work, just as we are opening ours. It will enable the community to discuss their experiments and applications openly, and thus, we hope, establish the early foundation of a new geospatial ecosystem.

California is our first Open Region, but shall not be the last. We will open more of our data in the future. This initial release will inform how we deliver our data set to a global community of customers.

Resolution for the Dove satellites is 3-5 meters and the RapidEye satellites is 5 meters.

Not quite goldfish bowl or Venice Beach resolution but useful for other purposes.

Now would be a good time to become familiar with managing and annotating satellite imagery. Higher resolutions, public and private are only a matter of time.