WDS–ECR Data Curation and Management Workshop
A Blog post by Agneta Ghose (2019 WDS–ECR Training Workshop Participant)
If ‘Data is the new gold’ then it certainly must be managed. Science has always valued data. Scientific data are not only an output of research but also an input to new hypotheses, enabling scientific insights and driving innovation. Therefore, accountability, transparency, and verifiability of science make data preservation and sharing part of scientific integrity.
The World Data System (WDS) recently organized a training workshop for early career researchers (ECRs) on data curation and management. The workshop was held at the Institute de Physique du Globe, Paris on 6–8 November 2019. The objective of this workshop was to familiarize ECRs with the methods and jargon used in research data management in addition to introducing future challenges and technological solutions to data management. In this blog, I would like to share the key messages from the informative presentations at this workshop.
Members of the WDS Scientific Committee (WDS-SC), Programme and Technology Offices, and ECR Network presented and discussed methods to ensure how research data remains findable, accessible, interoperable and reuseable (i.e., FAIR). This was reinforced by an interactive exercise, in which the attendees spoke about their personal challenges in accessing, storing, and managing data. This discussion showed us how the challenges are similar across scientific disciplines.
The workshop began with an introduction into understanding what are ‘data’ and their attributes. Aude Chambodut and Alice Frémand explained the characteristics of research data such as their origin, type, size, and format. These characteristics influence data management; specifically, how to access, process, store, and reuse data. We were introduced to the data lifecycle, which refers to the sequence of stages that data go through from their initial generation to their eventual archival and/or deletion at the end of their useful life. The longevity of research data can be increased by implementing a Data Management Plan (DMP), which provides guidelines on how data are to be handled throughout their lifecycle, (i.e., during and after a research project). In principle, a DMP is a pre-requisite when applying for major EU funding, but Isabelle Gärtner Roer and Alice Frémand explained to us the practicality of developing and implementing these plans in relation to our research. Ensuring a robust DMP increases research efficiency, re-enforces scientific integrity, and most importantly promotes innovation by improving the accessibility of data. Most universities and research institutions have platforms that provide advice and support on research data services. For an ECR, it is worth reaching out to these services to understand the recommended DMP in their research domain.
Research Data Life Cycle. Sourced from Massey University: https://www.massey.ac.nz/massey/research/library/library-services/research-services/manage-data/manage-data_home.cfm
We were introduced to the resources available for rigorous data management that will ensure our research data remain FAIR. Sandy Harrison explained the value of Open Data in scientific research. She elaborated on how datasets produced from scientific work are increasingly deposited into data repositories. This is a better alternative to including these only as supplementary materials to a journal paper. Repositories provide long-term data archiving; ensure high technical standards with the possibility of updates. Moreover, publishing research data on Open Access platforms adds to their discoverability. It is important to note that not all research data needs to be openly available. Data can be kept private, but information that the data exists and what are the pre-conditions of accessing it must be shared. Ensuring data accessibility must not take away credit from those who produce data. To prevent or discourage unauthorized use or commercial exploitation, it is important to disclose knowledge (data) safely. Ioana Popescu discussed the importance of copyright and licensing. Different conditions and types of Creative Common Licenses are available to ensure data providers receive due credit, or to determine whether the data is available for commercial use, and so on.
On data interoperability, Elaine Faustman introduced ontologies and knowledge graphs, which define the concepts and relationships between data. Ontologies are useful to turn data into machine-readable formats, and thus connect them to the semantic web: an extension of the World Wide Web that contains machine-readable data. Embedding semantics is advantageous, especially when working with heterogeneous data sources. Karen Payne discussed how data have increased in volume, velocity, and variety over the years (i.e., Big Data). In 2018, the International Data Corporation estimated the global data sphere had reached 33 zettabytes (1 zettabyte = 1 x 1012 gigabytes). The volume and variety in data influences their management. To address issues with Big Data and complex computing, cloud computing resources have been developed that are delivered over the Internet. Cloud computing refers to virtual resources—such as infrastructure resources, services, and applications—orchestrated by management and automation software so they can be accessed by users on-demand through self-service portals. Automatic scaling and resource allocation support these portals.
Technical barriers to data sharing include incomplete datasets or unguaranteed services such as datasets that do not contain what they claim to! Moreover, certification standards play an important role in establishing trust, and hence sustaining the opportunities for long-term data sharing. Rorie Edmunds presented the certification procedures and framework available for data repositories. Certification standards such as the CoreTrustSeal look at technical, organizational, and financial infrastructure, as well as legal aspects, workflows, and risk management. Depositing data into certified repositories ensures longevity, discoverability of one’s data, in addition to access to recognized expertise to address technicalities. On the other hand, those using data from certified databases have the ability to verify results, know the provenance, and even give feedback to the data producer.
With an overview of the various resources available for data management, participants were asked to revisit both the DMPs they had started to create on their respective research projects, as well as the challenges identified at the beginning of the workshop. The workshop definitely helped clarify most of the concerns the attendees had expressed. Personally, it was a great learning experience, and I am grateful to have been selected for this workshop. During the past few months since the workshop took place, I have become much more aware about data management within the realm of my project, as well as having discussions on this with my colleagues. I know that this workshop was the first WDS training event for ECRs, I am glad to have been a part for it, and would definitely recommend it to my peers. Finally, I acknowledge the work of everyone involved in the organization of the workshop. I hope that there are many more such workshops in the future, and especially aimed at ECRs.
Isabelle Gärtner Roer and Aude Chambodut ask whether the Workshop addressed the participants’ RDM Challenges