Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 12, 2015

Data Portals

Filed under: Open Data — Patrick Durusau @ 7:59 pm

Data Portals

From the webpage:

A Comprehensive List of Open Data Portals from Around the World

Two things spring to mind:

First, the number of portals seems a bit lite given the rate of data accumulation.

Second, take a look at the geographic distribution of data portals. Asia and Northern Africa seem rather sparse don’t you think?

September 13, 2015

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories [But Who Pays?]

Filed under: Data Repositories,Open Data — Patrick Durusau @ 9:30 pm

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories by Kirk Borne.

From the post:

Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).

The following seven V’s represent characteristics and challenges of open data:

  1. Validity: data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
  2. Value: new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
  3. Variety: the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
  4. Voice: your open data becomes the voice of your organization to your stakeholders (including customers, clients, employees, sponsors, and the public).
  5. Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
  6. Vulnerability: the frequency of data theft and hacking incidents has increased dramatically in recent years — and this is for data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Open data are therefore much more vulnerable to misuse, abuse, manipulation, or alteration.
  7. proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data): maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses and their context, and more.

Open Data has many benefits when the 7 V’s are answered!

Kirk doesn’t address who pay the cost of the 7 V’s being answered.

The most obvious one for topic maps:

#5 Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use….

Yes, “…when you provide the data for others to use.” If I can use my data without documenting the semantics and schema (data models), who covers the cost of my creating that documentation and schemas?

In any sufficiently large enterprise, when you ask for assistance, the response will ask for the contract number to which the assistance should be billed.

If you know your Heinlein, then you know the acronym TANSTaaFL (“There ain’t no such thing as a free lunch”) and its application here is obvious.

Or should I say its application is obvious from the repeated calls for better documentation and models and the continued absence of the same?

Who do you think should be paying for better documentation and data models?

June 17, 2015

Put Your Open Data Where Your Mouth Is (Deadline for Submission: 28 June 2015)

Filed under: Education,Open Access,Open Data — Patrick Durusau @ 4:03 pm

Open Data as Open Educational Resources – Case Studies: Call for Participation

From the call:

The Context:

Open Data is invaluable to support researchers, but we contend that open datasets used as Open Educational Resources (OER) can also be invaluable asset for teaching and learning. The use of real datasets can enable a series of opportunities for students to collaborate across disciplines, to apply quantitative and qualitative methods, to understand good practices in data retrieval, collection and analysis, to participate in research-based learning activities which develop independent research, teamwork, critical and citizenship skills. (For more detail please see: http://education.okfn.org/the-21st-centurys-raw-material-using-open-data-as-open-educational-resources)

The Call:

We are inviting individuals and teams to submit case studies describing experiences in the use of open data as open educational resources. Proposals are open to everyone who would like to promote good practices in pedagogical uses of open data in an educational context. The selected case studies will be published in a open e-book (CC_BY_NC_SA) hosted by Open Knowledge Foundation Open Education Group http://education.okfn.org by mid September 2015.

Participation in the call requires the submission of a short proposal describing the case study (of around 500 words), all proposal must be written in English, however, the selected authors will have the opportunity to submit the case both in English and another language, as our aim is to support the adoption of good practices in the use of open data in different countries.

Key dates:

  • Deadline for submission of proposals (approx. 500 words): 28th June
  • Notification to accepted proposals: 5th of July
  • Draft case study submitted for review (1500 – 2000 words): 26th of July
  • Publication-ready deadline: 16th of August
  • Publication date: September 2015

If you have any questions or comments please contact us by filling the “contact the editors” box at the end of this form

Javiera Atenas https://twitter.com/jatenas
Leo Havemann https://twitter.com/leohavemann

http://www.idea-space.eu/idea/72/info

Use of open data implies a readiness to further the use of open data. One way to honor that implied obligation is to share with others your successes and just as importantly, any failures in the use of open data in an educational context.

All too often we hear only a steady stream of success stories and we wonder where others drew such perfect students, assistants, and clean data that underlies their success. Never realizing that their students, assistants and data are no better and no worse than ours. The regular mis-steps, false starts, outright wrong paths are omitted in the story telling. For times’ sake no doubt.

If you can, do participate in this effort, even if you only have a success story to relate. 😉

June 11, 2015

Don’t Think Open Access Is Important?…

Filed under: Open Access,Open Data — Patrick Durusau @ 2:39 pm

Don’t Think Open Access Is Important? It Might Have Prevented Much Of The Ebola Outbreak by Mike Masnick

From the post:

For years now, we’ve been talking up the importance of open access to scientific research. Big journals like Elsevier have generally fought against this at every point, arguing that its profits are more important that some hippy dippy idea around sharing knowledge. Except, as we’ve been trying to explain, it’s that sharing of knowledge that leads to innovation and big health breakthroughs. Unfortunately, it’s often pretty difficult to come up with a concrete example of what didn’t happen because of locked up knowledge. And yet, it appears we have one new example that’s rather stunning: it looks like the worst of the Ebola outbreak from the past few months might have been avoided if key research had been open access, rather than locked up.

That, at least, appears to be the main takeaway of a recent NY Times article by the team in charge of drafting Liberia’s Ebola recovery plan. What they found was that the original detection of Ebola in Liberia was held up by incorrect “conventional wisdom” that Ebola was not present in that part of Africa:

Mike goes on to point out knowledge about Ebola in Liberia was published in pay-per-view medical journals, which would have been prohibitively expensive for Liberian doctors.

He has a valid point but how often do primary care physicians consult research literature? And would they have the search chops to find research from 1982?

I am very much in favor of open access but open access on its own doesn’t bring about access or meaningful use of information once accessed.

June 10, 2015

The challenge of combining 176 x #otherpeoplesdata…

Filed under: Biodiversity,Biology,Github,Integration,Open Data — Patrick Durusau @ 10:39 am

The challenge of combining 176 x #otherpeoplesdata to create the Biomass And Allometry Database by Daniel Falster , Rich FitzJohn , Remko Duursma , Diego Barneche .

From the post:

Despite the hype around "big data", a more immediate problem facing many scientific analyses is that large-scale databases must be assembled from a collection of small independent and heterogeneous fragments — the outputs of many and isolated scientific studies conducted around the globe.

Collecting and compiling these fragments is challenging at both political and technical levels. The political challenge is to manage the carrots and sticks needed to promote sharing of data within the scientific community. The politics of data sharing have been the primary focus for debate over the last 5 years, but now that many journals and funding agencies are requiring data to be archived at the time of publication, the availability of these data fragments is increasing. But little progress has been made on the technical challenge: how can you combine a collection of independent fragments, each with its own peculiarities, into a single quality database?

Together with 92 other co-authors, we recently published the Biomass And Allometry Database (BAAD) as a data paper in the journal Ecology, combining data from 176 different scientific studies into a single unified database. We built BAAD for several reasons: i) we needed it for our own work ii) we perceived a strong need within the vegetation modelling community for such a database and iii) because it allowed us to road-test some new methods for building and maintaining a database ^1.

Until now, every other data compilation we are aware of has been assembled in the dark. By this we mean, end-users are provided with a finished product, but remain unaware of the diverse modifications that have been made to components in assembling the unified database. Thus users have limited insight into the quality of methods used, nor are they able to build on the compilation themselves.

The approach we took with BAAD is quite different: our database is built from raw inputs using scripts; plus the entire work-flow and history of modifications is available for users to inspect, run themselves and ultimately build upon. We believe this is a better way for managing lots of #otherpeoplesdata and so below share some of the key insights from our experience.

The highlights of the project:

1. Script everything and rebuild from source

2. Establish a data-processing pipeline

  • Don’t modify raw data files
  • Encode meta-data as data, not as code
  • Establish a formal process for processing and reviewing each data set

3. Use version control (git) to track changes and code sharing website (github) for effective collaboration

4. Embrace Openness

5. A living database

There was no mention of reconciliation of nomenclature for species. I checked some of the individual reports, such as Report for study: Satoo1968, which does mention:

Other variables: M.I. Ishihara, H. Utsugi, H. Tanouchi, and T. Hiura conducted formal search of reference databases and digitized raw data from Satoo (1968). Based on this reference, meta data was also created by M.I. Ishihara. Species name and family names were converted by M.I. Ishihara according to the following references: Satake Y, Hara H (1989a) Wild flower of Japan Woody plants I (in Japanese). Heibonsha, Tokyo; Satake Y, Hara H (1989b) Wild flower of Japan Woody plants II (in Japanese). Heibonsha, Tokyo. (Emphasis in original)

I haven’t surveyed all the reports but it appears that “conversion” of species and family names occurred prior to entering the data pipeline.

Not an unreasonable choice but it does mean that we cannot use the original names as recorded as search terms into literature that existed at the time of the original observations.

Normalization of data often leads to loss of information. Not necessarily but often does.

I first saw this in a tweet by Dr. Mike Whitfield.

June 4, 2015

Reputation instead of obligation:…

Filed under: Open Access,Open Data,Transparency — Patrick Durusau @ 10:16 am

Reputation instead of obligation: forging new policies to motivate academic data sharing by Sascha Friesike, Benedikt Fecher, Marcel Hebing, and Stephanie Linek.

From the post:

Despite strong support from funding agencies and policy makers academic data sharing sees hardly any adoption among researchers. Current policies that try to foster academic data sharing fail, as they try to either motivate researchers to share for the common good or force researchers to publish their data. Instead, Sascha Friesike, Benedikt Fecher, Marcel Hebing, and Stephanie Linek argue that in order to tap into the vast potential that is attributed to academic data sharing we need to forge new policies that follow the guiding principle reputation instead of obligation.

In 1996, leaders of the scientific community met in Bermuda and agreed on a set of rules and standards for the publication of human genome data. What became known as the Bermuda Principles can be considered a milestone for the decoding of our DNA. These principles have been widely acknowledged for their contribution towards an understanding of disease causation and the interplay between the sequence of the human genome. The principles shaped the practice of an entire research field as it established a culture of data sharing. Ever since, the Bermuda Principles are used to showcase how the publication of data can enable scientific progress.

Considering this vast potential, it comes as no surprise that open research data finds prominent support from policy makers, funding agencies, and researchers themselves. However, recent studies show that it is hardly ever practised. We argue that the academic system is a reputation economy in which researchers are best motivated to perform activities if those pay in the form of reputation. Therefore, the hesitant adoption of data sharing practices can mainly be explained by the absence of formal recognition. And we should change this.

(emphasis in the original)

Understanding what motivates researchers to share data is an important step towards encouraging data sharing.

But at the same time, would we say that every researcher is as good as every other researcher at preparing data for sharing? At documenting data for sharing? At doing any number of tasks that aren’t really research, but just as important in order to share data?

Rather than focusing exclusively on researchers, funders should fund projects to include data sharing specialists who have the skills and interests necessary to effectively share data as part of a project’s output. Their reputations will be more closely tied to the successful sharing of data and researchers would gain in reputation for the high quality data that is shared. A much better fit for the recommendation of the authors.

Or to put it differently, lecturing researchers on how they should spend their limited time and resources to satisfy your goals, isn’t going to motivate anyone. “Pay the man!” (Richard Prior from Silver Streak)

May 28, 2015

How journals could “add value”

Filed under: Open Access,Open Data,Open Science,Publishing — Patrick Durusau @ 1:57 pm

How journals could “add value” by Mark Watson.

From the post:

I wrote a piece for Genome Biology, you may have read it, about open science. I said a lot of things in there, but one thing I want to focus on is how journals could “add value”. As brief background: I think if you’re going to make money from academic publishing (and I have no problem if that’s what you want to do), then I think you should “add value”. Open science and open access is coming: open access journals are increasingly popular (and cheap!), preprint servers are more popular, green and gold open access policies are being implemented etc etc. Essentially, people are going to stop paying to access research articles pretty soon – think 5-10 year time frame.

So what can journals do to “add value”? What can they do that will make us want to pay to access them? Here are a few ideas, most of which focus on going beyond the PDF:

Humanities journals and their authors should take heed of these suggestions.

Not applicable in every case but certainly better than “journal editorial board as resume padding.”

May 8, 2015

Open Data: Getting Started/Finding

Filed under: Government Data,Open Data — Patrick Durusau @ 8:23 pm

Data Science – Getting Started With Open Data

23 Resources for Finding Open Data

Ryan Swanstrom has put together two posts will have you using and finding open data.

“Open data” can be a boon to researchers and others, but you should ask the following questions (among others) of any data set:

  1. Who collected the data?
  2. Why was the data collected?
  3. How was the recorded data selected?
  4. How large was the potential data pool?
  5. Was the original data cleaned after collection?
  6. If the original data was cleaned, by what criteria?
  7. How was the accuracy of the data measured?
  8. What instruments were used to collect the data?
  9. How were the instruments used to collect the data developed?
  10. How were the instruments used to collect the data validated?
  11. What publications have relied upon the data?
  12. How did you determine the semantics of the data?

That’s not a compete set but a good starting point.

Just because data is available, open, free, etc. doesn’t mean that it is useful. The best example is the still-in-print Budge translation The book of the dead : the papyrus of Ani in the British Museum. The original was published in 1895, making the current reprints more than a century out of date.

It is a very attractive reproduction (it is rare to see hieroglyphic text with inter-linear transliteration and translation in modern editions) of the papyrus of Ani, but it gives a mis-leading impression of the state of modern knowledge and translation of Middle Egyptian.

Of course, some readers are satisfied with century old encyclopedias as well, but I would not rely upon them or their sources for advice.

March 2, 2015

How To Publish Open Data (in the UK)

Filed under: Humor,Open Data — Patrick Durusau @ 8:49 pm

http://www.owenboswarva.com/opendata/OD_Pub_DecisionTree.jpg

No way this will display properly so I just linked to it.

I don’t know about the UK but a very similar discussion takes place in academic circles before releasing data that less than a dozen people have asked to see, ever.

Enjoy!

I first saw this in a tweet by Irina Bolychevsky.

March 1, 2015

Let Me Get That Data For You (LMGTDFY)

Filed under: Bing,Open Data,Python — Patrick Durusau @ 8:22 pm

Let Me Get That Data For You (LMGTDFY) by U.S. Open Data.

From the post:

LMGTDFY is a web-based utility to catalog all open data file formats found on a given domain name. It finds CSV, XML, JSON, XLS, XLSX, XML, and Shapefiles, and makes the resulting inventory available for download as a CSV file. It does this using Bing’s API.

This is intended for people who need to inventory all data files on a given domain name—these are generally employees of state and municipal government, who are creating an open data repository, and performing the initial step of figuring out what data is already being emitted by their government.

LMGTDFY powers U.S. Open Data’s LMGTDFY site, but anybody can install the software and use it to create their own inventory. You might want to do this if you have more than 300 data files on your site. U.S. Open Data’s LMGTDFY site caps the number of results at 300, in order to avoid winding up with an untenably large invoice for using Bing’s API. (Microsoft allows 5,000 searches/month for free.)

Now there’s a useful utility!

Enjoy!

I first saw this in a tweet by Pycoders Weekly.

February 20, 2015

Academic Karma: a case study in how not to use open data

Filed under: Open Data — Patrick Durusau @ 7:28 pm

Academic Karma: a case study in how not to use open data by Neil Saunders.

From the post:

A news story in Nature last year caused considerable mirth and consternation in my social networks by claiming that ResearchGate, a “Facebook for scientists”, is widely-used and visited by scientists. Since this is true of nobody that we know, we can only assume that there is a whole “other” sub-network of scientists defined by both usage of ResearchGate and willingness to take Nature surveys seriously.

You might be forgiven, however, for assuming that I have a profile at ResearchGate because here it is. Except: it is not. That page was generated automatically by ResearchGate, using what they could glean about me from bits of public data on the Web. Since they have only discovered about one-third of my professional publications, it’s a gross misrepresentation of my achievements and activity. I could claim the profile, log in and improve the data, but I don’t want to expose myself and everyone I know to marketing spam until the end of time.

One issue with providing open data about yourself online is that you can’t predict how it might be used. Which brings me to Academic Karma.

Neil points out that Academic Karma generated an inaccurate profile of Neil’s academic activities. Based on partial information from a profile at ResearchGate, which Neil did not create.

Neil concludes:

So let me try to spell it out as best I can.

  1. I object to the automated generation of public profiles, without my knowledge or consent, which could be construed as having been generated by me
  2. I especially object when those profiles convey an impression of my character, such as “someone who publishes but does not review”, based on incomplete and misleading data

I’m sure that the Academic Karma team mean well and believe that what they’re doing can improve the research process. However, it seems to me that this is a classic case of enthusiasm for technological solutions without due consideration of the human and social aspects.

To their credit, Academic Karma has stopped listing profiles for people who haven’t requested accounts.

How would you define the “human and social aspects” of open data?

In hindsight, the answer to that question seems to be clear. Or at least is thought to be clear. How do you answer that question before your use of open data goes live?

February 7, 2015

Encouraging open data usage…

Filed under: Government Data,Open Data — Patrick Durusau @ 7:04 pm

Encouraging open data usage by commercial developers: Report

From the post:

The second Share-PSI workshop was very different from the first. Apart from presentations in two short plenary sessions, the majority of the two days was spent in facilitated discussions around specific topics. This followed the success of the bar camp sessions at the first workshop, that is, sessions proposed and organised in an ad hoc fashion, enabling people to discuss whatever subject interests them.

Each session facilitator was asked to focus on three key questions:

  1. What X is the thing that should be done to publish or reuse PSI?
  2. Why does X facilitate the publication or reuse of PSI?
  3. How can one achieve X and how can you measure or test it?

This report summarises the 7 plenary presentations, 17 planned sessions and 7 bar camp sessions. As well as the Share-PSI project itself, the workshop benefited from sessions lead by 8 other projects. The agenda for the event includes links to all papers, slides and notes, with many of those notes being available on the project wiki. In addition, the #sharepsi tweets from the event are archived, as are a number of photo albums from Makx Dekkers,
Peter Krantz and José Luis Roda. The event received a generous write up
on the host’s Web site (in Portuguese). The spirit of the event is captured in this video by Noël Van Herreweghe of CORVe.

To avoid confusion, PSI in this context means Public Sector Information, not Published Subject Identifier (PSI).

Amazing coincidence that the W3C has smudged yet another name. You may recall the W3C decided to confuse URIs and IRIs in its latest attempt to re-write history, calling both the the acronym, URI:

Within this specification, the term URI refers to a Universal Resource Identifier as defined in [RFC 3986] and extended in [RFC 2987] [RFC 3987] with the new name IRI. The term URI has been retained in preference to IRI to avoid introducing new names for concepts such as “Base URI” that are defined or referenced across the whole family of XML specifications. (Corrected the RFC listing as shown.) (XQuery and XPath Data Model 3.1 , N. Walsh, J. Snelson, Editors, W3C Candidate Recommendation (work in progress), 18 December 2014, http://www.w3.org/TR/2014/CR-xpath-datamodel-31-20141218/ . Latest version available at http://www.w3.org/TR/xpath-datamodel-31/.)

Interesting discussion but I would pay very close attention to market demand, perhaps I should say, commercial market demand, before planning a start-up based on government data. There is unlimited demand for free data or even better, free enhanced data, but that should not be confused with enhanced data that can be sold to support a start-up on an ongoing basis.

To give you an idea of the uncertainly of conditions for start-ups relying on open data, let me quote the final bullet points of this article:

  • There is a lack of knowledge of what can be done with open data which is hampering uptake.
  • There is a need for many examples of success to help show what can be done.
  • Any long term re-use of PSI must be based on a business plan.
  • Incubators/accelerators should select projects to support based on the business plan.
  • Feedback from re-users is an important component of the ecosystem and can be used to enhance metadata.
  • The boundary between what the public and private sectors can, should and should not do do needs to be better defined to allow the public sector to focus on its core task and businesses to invest with confidence.
  • It is important to build an open data infrastructure, both legal and technical, that supports the sharing of PSI as part of normal activity.
  • Licences and/or rights statements are essential and should be machine readable. This is made easier if the choice of licences is minimised.
  • The most valuable data is the data that the public sector already charges for.
  • Include domain experts who can articulate real problems in hackathons (whether they write code or not).
  • Involvement of the user community and timely response to requests is essential.
  • There are valid business models that should be judged by their effectiveness and/or social impact rather than financial gain.

Just so you know, that last point:

There are valid business models that should be judged by their effectiveness and/or social impact rather than financial gain.

that is not a business model, unless you have renewal financing from some source other than by financial gain. That is a charity model where you are the object of the charity.

January 27, 2015

Nature: A recap of a successful year in open access, and introducing CC BY as default

Filed under: Open Access,Open Data,Publishing — Patrick Durusau @ 1:57 pm

A recap of a successful year in open access, and introducing CC BY as default by Carrie Calder, the Director of Strategy for Open Research, Nature Publishing Group/Palgrave Macmillan.

From the post:

We’re pleased to start 2015 with an announcement that we’re now using Creative Commons Attribution license CC BY 4.0 as default. This will apply to all of the 18 fully open access journals Nature Publishing Group owns, and will also apply to any future titles we launch. Two society- owned titles have introduced CC BY as default today and we expect to expand this in the coming months.

This follows a transformative 2014 for open access and open research at Nature Publishing Group. We’ve always been supporters of new technologies and open research (for example, we’ve had a liberal self-archiving policy in place for ten years now. In 2013 we had 65 journals with an open access option) but in 2014 we:

  • Built a dedicated team of over 100 people working on Open Research across journals, books, data and author services
  • Conducted research on whether there is an open access citation benefit, and researched authors’ views on OA
  • Introduced the Nature Partner Journal series of high-quality open access journals and announced our first ten NPJs
  • Launched Scientific Data, our first open access publication for Data Descriptors
  • And last but not least switched Nature Communications to open access, creating the first Nature-branded fully open access journal

We did this not because it was easy (trust us, it wasn’t always) but because we thought it was the right thing to do. And because we don’t just believe in open access; we believe in driving open research forward, and in working with academics, funders and other publishers to do so. It’s obviously making a difference already. In 2013, 38% of our authors chose to publish open access immediately upon publication – in 2014, this percentage rose to 44%. Both Scientific Reports and Nature Communications had record years in terms of submissions for publication.

Open access is on its way to becoming the expected model for publishing. That isn’t to say that there aren’t economies and kinks to be worked out, but the fundamental principles of open access have been widely accepted.

Not everywhere of course. There are areas of scholarship that think self-isolation makes them important. They shun open access as an attack on their traditions of “Doctor Fathers” and access to original materials as a privilege. Strategies that make them all the more irrelevant in the modern world. Pity because there is so much they could contribute to the public conversation. But a public conversation means you are not insulated from questions that don’t accept “because I say so” as an adequate answer.

If you are working in such an area or know of one, press for emulation of the Nature and the many other efforts to provide open access to both primary and secondary materials. There are many areas of the humanities that already follow that model, but not all. Let’s keep pressing until open access is the default for all disciplines.

Kudos to Nature for their ongoing efforts on open access.

I first saw the news about the post about Nature in a tweet by Ethan White.

January 21, 2015

How to share data with a statistician

Filed under: Open Data,Statistics — Patrick Durusau @ 7:38 pm

How to share data with a statistician by Robert M. Horton.

From the webpage:

This is a guide for anyone who needs to share data with a statistician. The target audiences I have in mind are:

  • Scientific collaborators who need statisticians to analyze data for them
  • Students or postdocs in scientific disciplines looking for consulting advice
  • Junior statistics students whose job it is to collate/clean data sets

The goals of this guide are to provide some instruction on the best way to share data to avoid the most common pitfalls and sources of delay in the transition from data collection to data analysis. The Leek group works with a large number of collaborators and the number one source of variation in the speed to results is the status of the data when they arrive at the Leek group. Based on my conversations with other statisticians this is true nearly universally.

My strong feeling is that statisticians should be able to handle the data in whatever state they arrive. It is important to see the raw data, understand the steps in the processing pipeline, and be able to incorporate hidden sources of variability in one’s data analysis. On the other hand, for many data types, the processing steps are well documented and standardized. So the work of converting the data from raw form to directly analyzable form can be performed before calling on a statistician. This can dramatically speed the turnaround time, since the statistician doesn’t have to work through all the pre-processing steps first.

My favorite part:

The code book

For almost any data set, the measurements you calculate will need to be described in more detail than you will sneak into the spreadsheet. The code book contains this information. At minimum it should contain:

  1. Information about the variables (including units!) in the data set not contained in the tidy data
  2. Information about the summary choices you made
  3. Information about the experimental study design you used

Does a codebook exist for the data that goes into or emerges from your data processing?

If someone has to ask you what variables mean, it’s not really “open” data is it?

I first saw this in a tweet by Christophe Lalanne.

January 16, 2015

Key Court Victory Closer for IRS Open-Records Activist

Filed under: Government Data,Open Data — Patrick Durusau @ 8:12 pm

Key Court Victory Closer for IRS Open-Records Activist by Suzanne Perry.

From the post:

The open-records activist Carl Malamud has moved a step closer to winning his legal battle to give the public greater access to the wealth of information on Form 990 tax returns that nonprofits file.

During a hearing in San Francisco on Wednesday, U.S. District Judge William Orrick said he tentatively planned to rule in favor of Mr. Malamud’s group, Public. Resource. Org, which filed a lawsuit to force the Internal Revenue Service to release nonprofit tax forms in a format that computers can read. That would make it easier to conduct online searches for data about organizations’ finances, governance, and programs.

“It looks like a win for Public. Resource and for the people who care about electronic access to public documents,” said Thomas Burke, the group’s lawyer.

The suit asks the IRS to release Forms 990 in machine-readable format for nine nonprofits that had submitted their forms electronically. Under current practice, the IRS converts all Forms 990 to unsearchable image files, even those that have been filed electronically.

That’s a step in the right direction but not all that will be required.

Suzanne goes on to note that the IRS removes donor lists from the 990 forms.

Any number of organizations will object but I think the donor lists should be public information as well.

Making all donors public may discourage some people from donating to unpopular causes but that’s a hit I would be willing to take to know who owns the political non-profits. And/or who funds the NRA for example.

Data that isn’t open enough to know who is calling the shots at organizations isn’t open data, its an open data tease.

January 14, 2015

SODA Developers

Filed under: Government Data,Open Data,Programming — Patrick Durusau @ 7:50 pm

SODA Developers

From the webpage:

The Socrata Open Data API allows you to programatically access a wealth of open data resources from governments, non-profits, and NGOs around the world.

I have mentioned Socrata and their Open Data efforts more than once on this blog but I don’t think I have ever pointed to their developer site.

Very much worth spending time here if you are interested in governmental data.

Not that I take any data, government or otherwise, at face value. Data is created and released/leaked for reasons that may or may not coincide with your assumptions or goals. Access to data is just the first step in uncovering whose interests the data represents.

January 4, 2015

Project Open Data Dashboard

Filed under: Government,Government Data,Open Data — Patrick Durusau @ 1:54 pm

Project Open Data Dashboard

From the about page:

This website shows how Federal agencies are performing on the latest Open Data Policy (M-13-13) using the guidance provided by Project Open Data. It also provides many other other tools and resources to help agencies and other interested parties implement their open data programs. Features include:

  • A dashboard to track the progress of agencies implementing Project Open Data on a quarterly basis
  • Automated analysis of URLs provided within metadata to see if the links work as expected
  • A validator for v1.0 and v1.1 of the Project Open Data Metadata Schema
  • A converter to transform CSV files into JSON as defined by the Project Open Data Metadata Schema Link broken as of 4 January 2014. Site notified.
  • An export API to export from the CKAN API and transform the metadata into JSON as defined by the Project Open Data Metadata Schema
  • A changeset viewer to compare a data.json file to the metadata currently available in CKAN (eg catalog.data.gov)

You can learn more by reading the main documentation page.

The main documentation defines the “Number of Datasets” on the dashboard as:

This element accounts for the total number of all datasets listed in the Enterprise Data Inventory. This includes those marked as “Public”, “Non-Public” and “Restricted”.

If you compare the “Milestone – May 31st 2014” to November, the number of data sets increases in most cases, as you would expect. However, both the Department of Commerce and the Department of Health and Human Services, had decreases in the number of available data sets.

On May 31st, the Department of Commerce listed 20488 data sets but on November 30th, only 372. A decrease of more than 20,000 data sets.

On May 31st, the Department of Health and Human Services listed 1507 data sets but on November 30th, only 1064, a decrease of 443 data sets.

Looking further, the sudden decrease for both agencies occurred between Milestone 3 and Milestone 4 (August 31st 2014).

Sounds exciting! Yes?

Yes, but this illustrates why you should “drill down” in data whenever possible. And if not possible in interface, check other sources.

I followed the Department of Commerce link (the first column on the left) to the details of the crawl and thence the data link to determine the number of publicly available data sets.

As of today, 04 January 2014, the Department of Commerce has 23,181 datasets and not the 372 reported for Milestones 5 or the 268 reported for Milestone 4.

As of today, 04 January 2014, the Department of Health and Human Services has 1,672 datasets and not the 1064 reported for Milestones 5 or the 1088 reported for Milestone 4.

The reason(s) for the differences are unclear and the dashboard itself offers no explanation for the disparate figures. I suspect there is some glitch in the automatic harvesting of the information and/or in the representation of those results in the dashboard.

Always remember that just because a representation* claims some “fact,” that doesn’t necessarily make it so.

*Representation: Bear in mind that anything you see on a computer screen is a “representation.” There isn’t anything in data storage that has any resemblance to what you see on the screen. Choices have been made out of your sight as to how information will be represented to you.

As I mentioned yesterday, there is a common and naive assumption that data as represented to us has a reliable correspondence with data held in storage. And that the data held in storage has a reliable correspondence to data as entered or obtained from other sources.

Those assumptions aren’t unreasonable, at least until they are. Can you think of ways to illustrate those principles? I ask because at least one way to illustrate those principles makes an excellent case for open source software. More on that anon.

December 12, 2014

Global Open Data Index

Filed under: Government,Government Data,Open Data — Patrick Durusau @ 3:59 pm

Global Open Data Index

From the about page:

For more information on the Open Data Index, you may contact the team at: index@okfn.org

Each year, governments are making more data available in an open format. The Global Open Data Index tracks whether this data is actually released in a way that is accessible to citizens, media and civil society and is unique in crowd-sourcing its survey of open data releases around the world. Each year the open data community and Open Knowledge produces an annual ranking of countries, peer reviewed by our network of local open data experts.

Crowd-sourcing this data provides a tool for communities around the world to learn more about the open data available locally and by country, and ensures that the results reflect the experience of civil society in finding open information, rather than government claims. it also ensures that those who actually collect the information that builds the Index are the very people who use the data and are in a strong position to advocate for more and higher quality open data.

The Global Open Data Index measures and benchmarks the openness of data around the world, and then presents this information in a way that is easy to understand and use. This increases its usefulness as an advocacy tool and broadens its impact.

In 2014 we are expanding to more countries (from 70 in 2013) with an emphasis on countries of the Global South.

See the blog post launching the 2014 Index. For more information, please see the FAQ and the methodology section. Join the conversation with our Open Data Census discussion list.

It is better to have some data rather than none but look at the data by which countries are ranked for openness:

Transport Timetables, Government Budget, Government Spending, Election Results, Company Register, National Map, National Statistics, Postcodes/Zipcodes, Pollutant Emissions.

A listing of data that results in the United Kingdom with a 97% score and first place.

It is hard to imagine a less threatening set of data than those listed. I am sure someone will find a use for them but in the great scheme of things, they are a distraction from the data that isn’t being released.

Off-hand, in the United States at least, public data should include who meets with appointed or elected members of government along with transcripts of those meetings (including phone calls). It should also include all personal or corporate donations made to any organization for any reason of greater than $100.00. It should include documents prepared and/or submitted to the U.S. government and its agencies. And those are just the ones that come to mind rather quickly.

Current disclosures by the U.S. government are a fiction of openness that conceals a much larger dark data set, waiting to be revealed at some future date.

I first saw this in a tweet by ChemConnector.

December 2, 2014

GiveDirectly (Transparency)

Filed under: Open Access,Open Data,Transparency — Patrick Durusau @ 3:53 pm

GiveDirectly

From the post:

Today we’re launching a new website for GiveDirectly—the first major update since www.givedirectly.org went live in 2011.

Our main goal in reimagining the site was to create radical transparency into what we do and how well we do it. We’ve invested a lot to integrate cutting-edge technology into our field model so that we have real-time data to guide internal management. Why not open up that same data to the public? All we needed were APIs to connect the website and our internal field database (which is powered by our technology partner, Segovia).

Transparency is of course a non-profit buzzword, but I usually see it used in reference to publishing quarterly or annual reports, packaged for marketing purposes—not the kind of unfiltered data and facts I want as a donor. We wanted to use our technology to take transparency to an entirely new level.

Two features of the new site that I’m most excited about:

First, you can track how we’re doing on our most important performance metrics, at the same time we do. For example, the performance chart on the home page mirrors the dashboard we use internally to track performance in the field. If recipients aren’t understanding our program, you’ll learn about it when we do. If the follow-up team falls behind or outperforms, metrics will update accordingly. We want to be honest about our successes and failures alike.

Second, you can verify our claims about performance. We don’t think you should have to trust that we’re giving you accurate information. Each “Verify this” tag downloads a csv file with the underlying raw data (anonymized). Every piece of data is generated by a GiveDirectly staff member’s work in the field and is stored using proprietary software; it’s our end-to-end model in action. Explore the data for yourself and absolutely question us on what you find.

Tis the season for soliciting donations, by every known form of media.

Suggestion: Copy and print out this response:

___________________________, I would love to donate to your worthy cause but before I do, please send a weblink to the equivalent of: http://www.givedirectly.org. Wishing you every happiness this holiday season.

___________________________

Where no response or no equivalent website = no donation.

I first saw this in a tweet by Stefano Bertolo.

November 21, 2014

CERN frees LHC data

Filed under: Data,Open Data,Science,Scientific Computing — Patrick Durusau @ 3:55 pm

CERN frees LHC data

From the post:

Today CERN launched its Open Data Portal, which makes data from real collision events produced by LHC experiments available to the public for the first time.

“Data from the LHC program are among the most precious assets of the LHC experiments, that today we start sharing openly with the world,” says CERN Director General Rolf Heuer. “We hope these open data will support and inspire the global research community, including students and citizen scientists.”

The LHC collaborations will continue to release collision data over the coming years.

The first high-level and analyzable collision data openly released come from the CMS experiment and were originally collected in 2010 during the first LHC run. Open source software to read and analyze the data is also available, together with the corresponding documentation. The CMS collaboration is committed to releasing its data three years after collection, after they have been thoroughly studied by the collaboration.

“This is all new and we are curious to see how the data will be re-used,” says CMS data preservation coordinator Kati Lassila-Perini. “We’ve prepared tools and examples of different levels of complexity from simplified analysis to ready-to-use online applications. We hope these examples will stimulate the creativity of external users.”

In parallel, the CERN Open Data Portal gives access to additional event data sets from the ALICE, ATLAS, CMS and LHCb collaborations that have been prepared for educational purposes. These resources are accompanied by visualization tools.

All data on OpenData.cern.ch are shared under a Creative Commons CC0 public domain dedication. Data and software are assigned unique DOI identifiers to make them citable in scientific articles. And software is released under open source licenses. The CERN Open Data Portal is built on the open-source Invenio Digital Library software, which powers other CERN Open Science tools and initiatives.

Awesome is the only term for this data release!

But, when you dig just a little bit further, you discover that embargoes still exist on three (3) out of (4) experiments. Both on data and software.

Disappointing but hopefully a dying practice when it comes to publicly funded data.

I first saw this in a tweet by Ben Evans.

November 12, 2014

Completely open Collections on Europeana

Filed under: Europeana,Open Data — Patrick Durusau @ 8:14 pm

Completely open Collections on Europeana (spreadsheet)

A Google spreadsheet listing collections from Europena.

The title isn’t completely accurate since it also lists collections that are not completely open.

I count ninety-eight (98) collections that are completely open, another two hundred and thirty-three (233) that use a Creative Commons license and four hundred and seven (407) that aren’t completely open or use a Creative Commons license.

You will need to check the individual entries to be sure of the licensing rights. I tried MusicMasters, which is listed as closed, to find that one (1) image could be used with attribution and two hundred and forty-seven (247) only with permission.

Europena is a remarkable site that is marred by a pop-up that takes you to FaceBook or to exhibits. For whatever reason, it is a “feature” of this pop-up that it cannot be closed. At least on Firefox and Chrome.

The spreadsheet should be useful as a quick reference for potentially open materials at Europeana.

I first saw this in a tweet by Amanda French.

Preventing Future Rosetta “Tensions”

Filed under: Astroinformatics,Open Access,Open Data — Patrick Durusau @ 2:34 pm

Tensions surround release of new Rosetta comet data by Eric Hand.

From the post:


For the Rosetta mission, there is an explicit tension between satisfying the public with new discoveries and allowing scientists first crack at publishing papers based on their own hard-won data. “There is a tightrope there,” says Taylor, who’s based at ESA’s European Space Research and Technology Centre (ESTEC) in Noordwijk, the Netherlands. But some ESA officials are worried that the principal investigators for the spacecraft’s 11 instruments are not releasing enough information. In particular, the camera team, led by principal investigator Holger Sierks, has come under special criticism for what some say is a stingy release policy. “It’s a family that’s fighting, and Holger is in the middle of it, because he holds the crown jewels,” says Mark McCaughrean, an ESA senior science adviser at ESTEC.

Allowing scientists to withhold data for some period is not uncommon in planetary science. At NASA, a 6-month period is typical for principal investigator–led spacecraft, such as the MESSENGER mission to Mercury, says James Green, the director of NASA’s planetary science division in Washington, D.C. However, Green says, NASA headquarters can insist that the principal investigator release data for key media events. For larger strategic, or “flagship,” missions, NASA has tried to release data even faster. The Mars rovers, such as Curiosity, have put out images almost as immediately as they are gathered.

Sierks, of the Max Planck Institute for Solar System Research in Göttingen, Germany, feels that the OSIRIS team has already been providing a fair amount of data to the public—about one image every week. Each image his team puts out is better than anything that has ever been seen before in comet research, he says. Furthermore, he says other researchers, unaffiliated with the Rosetta team, have submitted papers based on these released images, while his team has been consumed with the daily task of planning the mission. After working on OSIRIS since 1997, Sierks feels that his team should get the first shot at using the data.

“Let’s give us a chance of a half a year or so,” he says. He also feels that his team has been pressured to release more data than other instruments. “Of course there is more of a focus on our instrument,” which he calls “the eyes of the mission.”

What if there was another solution to the Rosetta “tensions” than 1) privilege researchers with six (6) months exclusive access to data or 2) release data as soon as gathered?

I am sure everyone can gather arguments for one or the other of those sides but either gathering or repeating them isn’t going to move the discussion forward.

What if there were an agreed upon registry for data sets (not a repository but registry) where researchers could register anticipated data and, when acquired, the date the data was deposited to a public repository and a list of researchers entitled to publish using that data?

The set of publications in most subject areas are rather small and if they agreed to not accept or review papers based upon registered data, for six (6) months or some other agreed upon period, that would enable researchers to release data as acquired and yet protect their opportunity for first use of the data for publication purposes.

This simple sketch leaves a host of details to explore and answer but registering data for publication delay could answer the concerns that surround publicly funded data in general.

Thoughts?

November 11, 2014

Crimebot

Filed under: Mapping,Open Data — Patrick Durusau @ 8:41 am

Open Data On the Ground: Jamaica’s Crimebot by Samuel Lee.

From the post:

Some areas of Jamaica, particularly major cities such as Kingston and Montego Bay, experience high levels of crime and violence. If you were to type, “What is Jamaica’s biggest problem” in a Google search, you’ll see that the first five results are about crime.

(image omitted)

Using data to pinpoint high crime areas

CrimeBot (www.crimebot.net) fights crime by providing crime hotspot views and sending out alerts based on locations through mobile devices. By allowing citizens to submit information about suspicious activity in real-time, CrimeBot also serves as a tool to fight back against crime and criminals. As its base of users grow and information expands, CrimeBot can more accurately pinpoint areas of higher crime frequency for informed and improved public safety. Developed by a team in Jamaica, CrimeBot improves the “neighborhood watch” concept by applying mobile technology to information dissemination and real-time data collection. A Google Hangout discussion with CrimeBot team member Dave Oakley can be viewed through this link.

Data collection technology that helps reduce violence and crime

The CrimeBot team – Kashif Hewitt, Dave Oakley, Aldrean Smith, Garth Thompson – came together in the lead up to a Caribbean apps competition in Jamaica called Digital Jam 3, in which CrimeBot was awarded the top prize. Prior to entering the contest, the group researched the most pressing issues in the Caribbean and Jamaica, which turned out to be violence and crime.

(image omitted)

The team decided to help Jamaicans fight and reduce crime by taking a deeper look at international statistics and conducting interviews with potential users of the app among friends and other contacts. In just 19 days of development, the team took CrimeBot from concept to working prototype.

The team discovered that 58% of crimes around the world go unreported. Interviews with potential users of the app revealed that many would-be tipsters feared for their safety, lacked confidence in local authorities, or preferred to take matters into their own hands. To counter some of these barriers, CrimeBot offers an anonymous way to report crime. While this doesn’t directly solve crimes, CrimeBot provides law enforcement officials with better data, intelligence, and affords citizens and tourists greater protection through preventative measures.

Crimebot is of particular interest because it includes unreported crimes, which don’t show up in maps constructed on the basis of arrests.

One can imagine real time crime maps at a concierge desk with offers from local escort (in the traditional sense) services.

Or when merged with other records, the areas with the lowest conviction rates and/or prison sentences.

The project also has a compelling introduction video:

November 6, 2014

EU commits €14.4m to support open data across Europe

Filed under: EU,Open Data — Patrick Durusau @ 2:47 pm

EU commits €14.4m to support open data across Europe by Samuel Gibbs.

From the post:

The European Union has committed €14.4m (£11m) towards open data with projects and institutions lead by the Open Data Institute (ODI), Southampton University, the Open University and Telefonica.

The funding, announced today at the ODI Summit being held in London, is the largest direct investment into open data startups globally and will be used to fund three separate schemes covering startups, open data research and a new training academy for data science.

“This is a decisive investment by the EU to create open data skills, build capabilities, and provide fuel for open data startups across Europe,” said Gavin Starks, chief executive of the ODI a non-for-profit organisation based in London co-founded by inventor of the world wide web Sir Tim Berners-Lee. “It combines three key drivers for open adoption: financing startups, deepening our research and evidence, and training the next generation of data scientists, to exploit emerging open data ecosystems.”

Money from the €14.4m will be divided into three sections. Through the EU’s €80 billion Horizon 2020 research and innovation funding, €7.8m will be used to fund the 30-month Open Data Incubator for Europe (ODInE) for open data startups modelled on the ODI’s UK open data startup incubator that has been running since 2012.

Take a look at Open Data Institute’s Startup page

BTW, on the list of graduates, the text of the links for Provenance and Mastodon C are correct but the underlying hyperlinks,
http://theodi.org/start-ups/www.provenance.it and http://theodi.org/start-ups/www.mastodonc.com, respectively, are incorrect.

With the correct underlying hyperlinks:

Mastodon C

Provenance

I did not check the links for the current startups. I did run the W3C Link Checker on http://theodi.org/start-ups and go some odd results. If you are interested, see what you think.

Sorry, I got diverted by the issues with the Open Data Institute site.

Among other highlights from the article:

A further €3.7m will be used to fund 15 researchers into open data posed with the question “how can we answer complex questions with web data?”.

You can puzzle over that one on your own.

November 5, 2014

Good Open Data. . . by design

Filed under: Data Governance,Data Quality,Open Data — Patrick Durusau @ 8:07 pm

Good Open Data. . . by design by Victoria L. Lemieux, Oleg Petrov, and, Roger Burks.

From the post:

An unprecedented number of individuals and organizations are finding ways to explore, interpret and use Open Data. Public agencies are hosting Open Data events such as meetups, hackathons and data dives. The potential of these initiatives is great, including support for economic development (McKinsey, 2013), anti-corruption (European Public Sector Information Platform, 2014) and accountability (Open Government Partnership, 2012). But is Open Data’s full potential being realized?

A news item from Computer Weekly casts doubt. A recent report notes that, in the United Kingdom (UK), poor data quality is hindering the government’s Open Data program. The report goes on to explain that – in an effort to make the public sector more transparent and accountable – UK public bodies have been publishing spending records every month since November 2010. The authors of the report, who conducted an analysis of 50 spending-related data releases by the Cabinet Office since May 2010, found that that the data was of such poor quality that using it would require advanced computer skills.

Far from being a one-off problem, research suggests that this issue is ubiquitous and endemic. Some estimates indicate that as much as 80 percent of the time and cost of an analytics project is attributable to the need to clean up “dirty data” (Dasu and Johnson, 2003).

In addition to data quality issues, data provenance can be difficult to determine. Knowing where data originates and by what means it has been disclosed is key to being able to trust data. If end users do not trust data, they are unlikely to believe they can rely upon the information for accountability purposes. Establishing data provenance does not “spring full blown from the head of Zeus.” It entails a good deal of effort undertaking such activities as enriching data with metadata – data about data – such as the date of creation, the creator of the data, who has had access to the data over time and ensuring that both data and metadata remain unalterable.

What is it worth to you to use good open data rather than dirty open data?

Take the costs of your analytics projects for the past year and multiply that by eighty (80) percent. Just an estimate, the actual cost will vary from project to project, but did that result get your attention?

If so, contact your sources for open data and lobby for clean open data.

PS: You may find the World Bank’s Open Data Readiness Assessment Tool useful.

November 4, 2014

Open Data 500

Filed under: Open Data — Patrick Durusau @ 8:24 pm

Open Data 500

From the webpage:

The Open Data 500, funded by the John S. and James L. Knight Foundation and conducted by the GovLab, is the first comprehensive study of U.S. companies that use open government data to generate new business and develop new products and services.

The full list is the most likely to be useful resource at this site. You can filter by subject area and/or federal agency that is supplying the data.

Great place to look for gaps in terms of data based products and/or what areas are already being served.

I first saw this in a tweet by Paul Rissen.

October 31, 2014

Enhancing open data with identifiers

Filed under: Linked Data,Open Data — Patrick Durusau @ 12:07 pm

Enhancing open data with identifiers

From the webpage:

The Open Data Institute and Thomson Reuters have published a new white paper, explaining how to use identifiers to create extra value in open data.

Identifiers are at the heart of how data becomes linked. It’s a subject that is fundamentally important to the open data community, and to the evolution of the web itself. However, identifiers are also in relatively early stages of adoption, and not many are aware of what they are.
w
Put simply, identifiers are labels used to refer to an object being discussed or exchanged, such as products, companies or people. The foundation of the web is formed by connections that hold pieces of information together. Identifiers are the anchors that facilitate those links.

This white paper, ‘Creating value with identifiers in an open data world’ is a joint effort between Thomson Reuters and the Open Data Institute. It is written as a guide to identifier schemes:

  • why identity can be difficult to manage;
  • why it is important for open data;
  • what challenges there are today and recommendations for the community to address these in the future.

Illustrative examples of identifier schemes are used to explain these points.

The recommendations are based on specific issues found to occur across different datasets, and should be relevant for anyone using, publishing or handling open data, closed data and/or their own proprietary data sets.

Are you a data consumer?
Learn how identifiers can help you create value from discovering and connecting to other sources of data that add relevant context.

Are you a data publisher?
Learn how understanding and engaging with identifier schemes can reduce your costs, and help you manage complexity.

Are you an identifier publisher?
Learn how open licensing can grow the open data commons and bring you extra value by increasing the use of your identifier scheme.

The design and use of successful identifier schemes requires a mix of social, data and technical engineering. We hope that this white paper will act as a starting point for discussion about how identifiers can and will create value by empowering linked data.

Read the blog post on Linked data and the future of the web, from Chief Enterprise Architect for Thomson Reuters, Dave Weller.

When citing this white paper, please use the following text: Open Data Institute and Thomson Reuters, 2014, Creating Value with Identifiers in an Open Data World, retrieved from thomsonreuters.com/site/data-identifiers/

Creating Value with Identifiers in an Open Data World [full paper]

Creating Value with Identifiers in an Open Data World [management summary]

From the paper:

The coordination of identity is thus not just an inherent component of dataset design, but should be acknowledged as a distinct discipline in its own right.

A great presentation on identity and management of identifiers, echoing many of the themes discussed in topic maps.

A must read!

Next week I will begin a series of posts on the individual issues identified in this white paper.

I first saw this in a tweet by Bob DuCharme.

October 23, 2014

Rich Citations: Open Data about the Network of Research

Filed under: Citation Practices,Open Data,PLOS — Patrick Durusau @ 7:13 pm

Rich Citations: Open Data about the Network of Research by Adam Becker.

From the post:

Why are citations just binary links? There’s a huge difference between the article you cite once in the introduction alongside 15 others, and the data set that you cite eight times in the methods and results sections, and once more in the conclusions for good measure. Yet both appear in the list of references with a single chunk of undifferentiated plain text, and they’re indistinguishable in citation databases — databases that are nearly all behind paywalls. So literature searches are needlessly difficult, and maps of that literature are incomplete.

To address this problem, we need a better form of academic reference. We need citations that carry detailed information about the citing paper, the cited object, and the relationship between the two. And these citations need to be in a format that both humans and computers can read, available under an open license for anyone to use.

This is exactly what we’ve done here at PLOS. We’ve developed an enriched format for citations, called, appropriately enough, rich citations. Rich citations carry a host of information about the citing and cited entities (A and B, respectively), including:

  • Bibliographic information about A and B, including the full list of authors, titles, dates of publication, journal and publisher information, and unique identifiers (e.g. DOIs) for both;
  • The sections and locations in A where a citation to B appears;
  • The license under which B appears;
  • The CrossMark status of B (updated, retracted, etc);
  • How many times B is cited within A, and the context in which it is cited;
  • Whether A and B share any authors (self-citation);
  • Any additional works cited by A at the same location as B (i.e. citation groupings);
  • The data types of A and B (e.g. journal article, book, code, etc.).

As a demonstration of the power of this new citation format, we’ve built a new overlay for PLOS papers, which displays much more information about the references in our papers, and also makes it easier to navigate and search through them. Try it yourself here: http://alpha.richcitations.org.
The suite of open-source tools we’ve built make it easy to extract and display rich citations for any PLOS paper. The rich citation API is available now for interested developers at http://api.richcitations.org.

If you look at one of the test articles such as: Jealousy in Dogs, the potential of rich citations becomes immediately obvious.

Perhaps I was reading “… the relationship between the two…” a bit too much like an association between two topics. It’s great to know how many times a particular cite occurs in a paper, when it is a self-citation, etc. but is a long way from attaching properties to an association between two papers.

On the up side, however, PLOS is already has 10,000 papers with “smart cites” with more on the way.

A project to watch!

September 9, 2014

PLOS Resources on Ebola

Filed under: Bioinformatics,Open Access,Open Data — Patrick Durusau @ 7:09 pm

PLOS Resources on Ebola by Virginia Barbour and PLOS Collections.

From the post:

The current Ebola outbreak in West Africa probably began in Guinea in 2013, but it was only recognized properly in early 2014 and shows, at the time of writing, no sign of subsiding. The continuous human-to-human transmission of this new outbreak virus has become increasingly worrisome.

Analyses thus far of this outbreak mark it as the most serious in recent years and the effects are already being felt far beyond those who are infected and dying; whole communities in West Africa are suffering because of its negative effects on health care and other infrastructures. Globally, countries far removed from the outbreak are considering their local responses, were Ebola to be imported; and the ripple effects on the normal movement of trade and people are just becoming apparent.

A great collection of PLOS resources on Ebola.

Even usual closed sources are making Ebola information available for free:

Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak (Science DOI: 10.1126/science.1259657) This is the gene sequencing report that establishes that one (1) person ate infected bush meat and is the source of all the following Ebola infections.

So much for needing highly specialized labs to “weaponize” biological agents. One infection is likely to result in > 20,000 deaths. You do the math.

I first saw this in a tweet by Alex Vespignani.

August 28, 2014

Thou Shalt Share!

Filed under: Bioinformatics,Genome,Genomics,Open Data — Patrick Durusau @ 1:46 pm

NIH Tells Genomic Researchers: ‘You Must Share Data’ by Paul Basken.

From the post:

Scientists who use government money to conduct genomic research will now be required to quickly share the data they gather under a policy announced on Wednesday by the National Institutes of Health.

The data-sharing policy, which will take effect with grants awarded in January, will give agency-financed researchers six months to load any genomic data they collect—from human or nonhuman subjects—into a government-established database or a recognized alternative.

NIH officials described the move as the latest in a series of efforts by the federal government to improve the efficiency of taxpayer-financed research by ensuring that scientific findings are shared as widely as possible.

“We’ve gone from a circumstance of saying, ‘Everybody should share data,’ to now saying, in the case of genomic data, ‘You must share data,’” said Eric D. Green, director of the National Human Genome Research Institute at the NIH.

A step in the right direction!

Waiting for other government funding sources and private funders (including in the humanities) to take the same step.

I first saw this in a tweet by Kevin Davies.

« Newer PostsOlder Posts »

Powered by WordPress