Good Data Curation: The FAIR principles

A recent workshop held at the University of Melbourne brought up the topic of good data curation practices and this is something that I personally can’t stress enough. At the time of this writing I am in the process of completing an introductory survival guide to ChIP-seq analysis and one of the first topics that I cover is the importance of collecting your experimental meta-data. In fact, I dedicated an entire entry to the topic of meta-data.

While I gave a few practical reasons for keeping track of your metadata and what to do if you can’t find everything you need, Dr. Wilkinson published an article on The FAIR Guiding Principles for scientific data management and stewardship which seemed pertinent to this and worth bringing up here. The point of this article also  fits nicely with the need to obtain good quality, and more importantly, relevant control data.

To summarize the article, we  basically need to find a set of common rules by which to operate when generating high throughput data (or any kind, for that matter) and publishing that data in publicly available repositories. The argument is  made that without good record keeping, data deposited may as well be tossed in the trash since if you can’t describe how data was generated then you can’t use or publish results based on that data since  the analysis could potentially be invalidated.

I’m a little surprised that excellence in curation isn’t already a standard put in to practice, but it was made more or less clear that the repositories are willing to sacrifice good curation for participation. In other words, if we allow researchers to describe a bare minimum about the experiment, then more will participate in depositing their data. What alarms me is the apparent general sentiment that publicly or privately funded research is allowed to be handled in this way to begin with without repercussions from the funding bodies. After all, the output from a lab is a reflection on the choice made to fund that lab.

So to remedy  this problem and help spread the word on better data curation practices, lets have a look at the FAIR principles but forward by  Dr. Wilkinson (1).

The FAIR Guiding Principles

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

A2. metadata are accessible, even when the data are no longer available

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

To be Reusable:

R1. meta(data) are richly described with a plurality of accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

 

To find the original article and more reading on the topic of research funding, follow the links below.


References and more reading

  1. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data3:160018 doi: 10.1038/sdata.2016.18 (2016).
  2. http://www.thenewatlantis.com/publications/the-sources-and-uses-of-us-science-funding
  3. http://undsci.berkeley.edu/article/who_pays

 

 

 

 

 

ggplot2 2.2.0 coming soon!

ggPlot update brings another round of useful improvements.

RStudio Blog

I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version 2.1.0.9001. Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues before the official release.

Install the pre-release version with:

# install.packages("devtools")
devtools::install_github("hadley/ggplot2")

If you discover a major bug that breaks your plots, please file a minimal reprex, and then roll back to the released version with:

install.packages("ggplot2")

ggplot2 2.2.0 will be a relatively major release including:

The majority of this work was carried out by Thomas Pederson, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out other visualisation packages: 

View original post 679 more words