We would like to bring your attention to the following report published by OECD and that may be of interest to the WDS community:
OECD (2020), “Building digital workforce capacity and skills for data-intensive science”, OECD Science, Technology and Industry Policy Papers, No. 90, OECD Publishing, Paris, https://doi.org/10.1787/e08aa3bb-en
This report was commissioned by the OECD Global Science Forum to identify: the skills needed for data-intensive science, the challenges for building sustainable capacity as these needs evolve, and the policy actions that can be taken by different actors to address these needs. The report includes policy recommendations for various actors and good practice examples to support these recommendations, and also notes the value of international cooperation in skills capacity efforts.
A Blog post by Karen Payne (WDS-ITO Associate Director)
You spoke. We listened.
The WDS International Technology Office (WDS-ITO) was created to support Member Organizations of WDS as they develop their data repositories in the areas of data and metadata management, infrastructure, and interoperability. In order to respond most effectively to Member needs, last year the WDS-ITO, with the support of the WDS International Program Office, conducted a survey to evaluate your areas of interest and determine what types of projects you would like WDS to support. Our key finding was a list of potential WDS-ITO projects, ranked according to interest. You can read the report of the survey here. We discovered that the top two areas of interest were adding: 1) semantic markup to metadata and 2) harvestable metadata services. In response, the WDS-ITO has secured funds from Canada’s national New Digital Research Infrastructure Organization (NDRIO) to hire two fulltime staff members to work on these projects. The funding provides dedicated resources to develop collaborative partnerships among the WDS-ITO, its members, and relevant international and Canadian interest groups to increase availability and interoperability of metadata assets globally.
Over the next year, the WDS-ITO will be working with the Research Data Alliance Research Metadata Schemas Working Group (WG) to help provide repositories with guidance and tools to add Schema.org markup to metadata. As a first step, the WDS-ITO has prototyped an online visualization tool based on a survey of current practices in using schemas to describe research datasets. The tool shows how some communities have crosswalked common metadata terms to Schema.org properties, and can be useful to repositories that are interested in knowing how other repositories are utilizing Schema.org terms. It can also be used as consensus building for communities of practice that have not yet created a crosswalk between their metadata format of choice and Schema.org properties. We will continue to build on that tool, and provide other guidance to WDS Members to help make their metadata more ‘web friendly’ in the coming months.
As part of our support for those groups interested in harvestable metadata, the WDS-ITO has created a WG of WDS Members who are interested in standing up harvestable metadata services. This WDS Harvestable Metadata Services (HMetS) WG is co-chaired by two members of the WDS Scientific Committee: Aude Chambodut, Director of the International Service of Geomagnetic Indices in Strasbourg (WDS Regular Member) and Juanle Wang, Director of the WDC for Renewable Resources and Environment in Beijing (WDS Regular Member). The HMetS WG is coordinated by Alicia Urquidi Diaz, the WDS-ITO’s first employee! To date, eight WDS Member Representatives have expressed interest in participating in the WG, and we welcome any other Members who would like to join.
This project is designed around three objectives:
Documenting use cases, the current challenges faced by WDS Members who wish to create harvestable services. What is their current infrastructure?
Helping develop implementation plans, written by Members to define a pathway to creating harvestable metadata services.
A paper identifying lessons learned and guidance materials that can be used by the wider Research Data Management community
The HMetS WG will convene regular online meetings, and bring in presenters who can speak to some of the pathways and long-term benefits of creating harvestable metadata services.
Both of the above work packages will draw on the expertise of and synchronize with ongoing research data management activities in Canada, with the ultimate goal of opening up more metadata records to the international scientific community.
You can read the NDRIO funding announcement here in English and French.
In this blog post, Varsha Khodiyar (Data Curation Manager, Research Data and New Product Development) describes why Springer Nature has endorsed the TRUST Principles and their importance to data management within the research community.
For more information on the TRUST Principles and how your organization can endorse them, please see our news article.
A Blog post by Juanle Wang (2019 WDS Scientific Committee Member)
Under the dual influences of global climate change and human activities, the frequency and the intensity of natural disasters have been growing in recent years, and resulting in increasingly serious disaster losses. Disaster Risk Reduction (DRR) is thus a common and urgent global challenge. Driven by the United Nations Educational, Scientific and Cultural Organization’s (UNESCO’s) DRR mission, the DRR Knowledge Service (DRRKS) System was founded under the UNESCO International Knowledge Centre for Engineering Sciences and Technology. The remit of the System is to formulate global disaster metadata standards; build global disaster metadata database; integrate global or regional disaster data; establish disaster knowledge services; carry out disaster prevention education, training, and technology promotion; and form comprehensive technology and service capabilities .
The DRRKS System has established 16 online knowledge applications, as shown on their homepage, to mine, analyze, and visualize disaster information based on Big Data resources. In this blog post, I would like to briefly introduce two cases that are supported by Big Data technologies in remote sensing and social media mining.
Case 1: Land Degradation and Restoration Monitoring in Mongolia Using Remote Sensing 
Land degradation is an important environmental problem facing the world. ‘Land Degradation Neutrality’ is one of the core indicators of Goal 15 (Life on Land) of the United Nations Sustainable Development Goals. Mongolia is one of the areas of the world that is most affected by desertification. It is therefore of great importance to accurately comprehend the state of desertification in Mongolia to (1) prevent its further advance, (2) control desertification risks, and (3) guarantee ecological security and sustainable social development. To this end, fine resolution (30-m) land cover datasets of Mongolia were obtained by using an object-oriented method, and the land degradation and restoration patterns during 1990–2010 and 2010–2015 analyzed (Fig.1). For the past 25 years, the trend of land change in Mongolia has been dominated by land degradation. However, this land degradation was accompanied by ongoing restoration of some land areas in Mongolia, and the capacity for land restoration is gradually improving. The northwestern and northeastern parts of Mongolia have shown the most significant land restoration; namely, the areas having relatively sufficient water resources.
Figure 1: Typical regions of land degradation and land restoration between 1995–2010 in Mongolia. (a) 1990–2010 (land degradation), (b) 1990–2010 (land restoration)
Case 2: Public Sentiment Analysis of COVID-19 Events in China Using Social Media
Similar to Twitter, SINA microblog is a social media channel in which Chinese people regularly post their opinions. These types of social media indicate the public’s changing thoughts and emotions rapidly and frequently during an epidemic (now pandemic) such as the Novel Coronavirus Disease (COVID-19). The DRRKS team analyzed the temporal and spatial changes to microblogs referencing the (then) epidemic, and gathered the main topics being discussed by the public according to data from SINA microblog. Through the permitted data Application Programming Interface of the SINA Microblog, original messages have been collected since 00:00 on 9 January 2020 containing the keywords “coronavirus” and “pneumonia”. The following information has been extracted: timestamp (i.e., the time when the message was posted), text (the message posted by a user), and location information. The DRRKS team have then analyzed the Microblog messages related to the Coronavirus outbreak in terms of space and time. Temporal changes over one-hour and one-day intervals, and spatial distribution at provincial levels, have been investigated through a kernel density estimation using ArcGIS to identify hotspots of public opinion. The spatial and temporal distribution of public opinion in China during the early stages of the epidemic has been discovered and is available in a DRRKS online application. For example, Figure 2 shows the distribution of help and donation hot spots from 9 January to 10 February.
Figure 2: Distribution of help and donation hot spots according to microblogs in China (9 January to 10 February 2020)
If ‘Data is the new gold’ then it certainly must be managed. Science has always valued data. Scientific data are not only an output of research but also an input to new hypotheses, enabling scientific insights and driving innovation. Therefore, accountability, transparency, and verifiability of science make data preservation and sharing part of scientific integrity.
The World Data System (WDS) recently organized a training workshop for early career researchers (ECRs) on data curation and management. The workshop was held at the Institute de Physique du Globe, Paris on 6–8 November 2019. The objective of this workshop was to familiarize ECRs with the methods and jargon used in research data management in addition to introducing future challenges and technological solutions to data management. In this blog, I would like to share the key messages from the informative presentations at this workshop.
Members of the WDS Scientific Committee (WDS-SC), Programme and Technology Offices, and ECR Network presented and discussed methods to ensure how research data remains findable, accessible, interoperable and reuseable (i.e., FAIR). This was reinforced by an interactive exercise, in which the attendees spoke about their personal challenges in accessing, storing, and managing data. This discussion showed us how the challenges are similar across scientific disciplines.
The workshop began with an introduction into understanding what are ‘data’ and their attributes. Aude Chambodut and Alice Frémand explained the characteristics of research data such as their origin, type, size, and format. These characteristics influence data management; specifically, how to access, process, store, and reuse data. We were introduced to the data lifecycle, which refers to the sequence of stages that data go through from their initial generation to their eventual archival and/or deletion at the end of their useful life. The longevity of research data can be increased by implementing a Data Management Plan (DMP), which provides guidelines on how data are to be handled throughout their lifecycle, (i.e., during and after a research project). In principle, a DMP is a pre-requisite when applying for major EU funding, but Isabelle Gärtner Roer and Alice Frémand explained to us the practicality of developing and implementing these plans in relation to our research. Ensuring a robust DMP increases research efficiency, re-enforces scientific integrity, and most importantly promotes innovation by improving the accessibility of data. Most universities and research institutions have platforms that provide advice and support on research data services. For an ECR, it is worth reaching out to these services to understand the recommended DMP in their research domain.
We were introduced to the resources available for rigorous data management that will ensure our research data remain FAIR. Sandy Harrison explained the value of Open Data in scientific research. She elaborated on how datasets produced from scientific work are increasingly deposited into data repositories. This is a better alternative to including these only as supplementary materials to a journal paper. Repositories provide long-term data archiving; ensure high technical standards with the possibility of updates. Moreover, publishing research data on Open Access platforms adds to their discoverability. It is important to note that not all research data needs to be openly available. Data can be kept private, but information that the data exists and what are the pre-conditions of accessing it must be shared. Ensuring data accessibility must not take away credit from those who produce data. To prevent or discourage unauthorized use or commercial exploitation, it is important to disclose knowledge (data) safely. Ioana Popescu discussed the importance of copyright and licensing. Different conditions and types of Creative Common Licenses are available to ensure data providers receive due credit, or to determine whether the data is available for commercial use, and so on.
On data interoperability, Elaine Faustman introduced ontologies and knowledge graphs, which define the concepts and relationships between data. Ontologies are useful to turn data into machine-readable formats, and thus connect them to the semantic web: an extension of the World Wide Web that contains machine-readable data. Embedding semantics is advantageous, especially when working with heterogeneous data sources. Karen Payne discussed how data have increased in volume, velocity, and variety over the years (i.e., Big Data). In 2018, the International Data Corporation estimated the global data sphere had reached 33 zettabytes (1 zettabyte = 1 x 1012 gigabytes). The volume and variety in data influences their management. To address issues with Big Data and complex computing, cloud computing resources have been developed that are delivered over the Internet. Cloud computing refers to virtual resources—such as infrastructure resources, services, and applications—orchestrated by management and automation software so they can be accessed by users on-demand through self-service portals. Automatic scaling and resource allocation support these portals.
Technical barriers to data sharing include incomplete datasets or unguaranteed services such as datasets that do not contain what they claim to! Moreover, certification standards play an important role in establishing trust, and hence sustaining the opportunities for long-term data sharing. Rorie Edmunds presented the certification procedures and framework available for data repositories. Certification standards such as the CoreTrustSeal look at technical, organizational, and financial infrastructure, as well as legal aspects, workflows, and risk management. Depositing data into certified repositories ensures longevity, discoverability of one’s data, in addition to access to recognized expertise to address technicalities. On the other hand, those using data from certified databases have the ability to verify results, know the provenance, and even give feedback to the data producer.
With an overview of the various resources available for data management, participants were asked to revisit both the DMPs they had started to create on their respective research projects, as well as the challenges identified at the beginning of the workshop. The workshop definitely helped clarify most of the concerns the attendees had expressed. Personally, it was a great learning experience, and I am grateful to have been selected for this workshop. During the past few months since the workshop took place, I have become much more aware about data management within the realm of my project, as well as having discussions on this with my colleagues. I know that this workshop was the first WDS training event for ECRs, I am glad to have been a part for it, and would definitely recommend it to my peers. Finally, I acknowledge the work of everyone involved in the organization of the workshop. I hope that there are many more such workshops in the future, and especially aimed at ECRs.
Isabelle Gärtner Roer and Aude Chambodut ask whether the Workshop addressed the participants’ RDM Challenges
A Blog post by Libby Liggins(2019 WDS Data Stewardship Award Winner)
For over four decades, scientists have been collecting genetic DNA sequence data for thousands of the world’s species. In the biodiversity and eco-evolutionary sciences, these data are generated to describe new species, define their evolutionary relationships, determine the levels of dispersal among populations, and assess levels of genetic diversity across a species range. The rate at which we accrue these DNA sequences has increased over time as the use of genetic data has diversified, and the sequencing technologies used to decode the DNA sequences of organisms have become faster, cheaper, and much higher through-put. As this trend continues into the future, it is anticipated that we may soon have more DNA sequences in a digital form than we have existing in the natural world.
This massive and growing data resource could now be consolidated for multiple species and populations and reused to better understand the world’s biodiversity at the genetic level. Genes are recognized as a fundamental component of the biodiversity hierarchy, but have received less attention than species- and ecosystem-level measures of biodiversity. In part, this may be due to synthetic analyses of genetic data being challenging and sometimes impossible, as there has been no concerted effort towards the curation and stewardship of this valuable data resource. While funding agencies and publishers advocate deposition of DNA sequence data in open-access repositories (such as the National Center for Biotechnology Information; and the European Bioinformatics Institute), they do not require the deposition of standardized metadata such as the sampling location, date, and habitat of the sampling event (Pope et al. 2015). This ‘metadata gap’ means that information essential for multispecies analyses to better understand biodiversity and evolutionary patterns across our globe, has not been readily available.
The Genomic Observatories MetaDatabase (GEOME; Deck et al. 2017) has recently provided a solution to this metadata gap. GEOME links ecologically and evolutionarily relevant metadata with DNA sequences uploaded to open-access repositories. The metadatabase incorporates the latest international standards for biodiversity and genomic data, and helps researchers store and access genetic data relevant to studies concerning large scale biodiversity and conservation problems. In conjunction with the open-access DNA sequence repositories, GEOME ensures that researchers and projects generating genetic data can adhere to the FAIR Principles (Findable, Accessible, Interoperable, Reusable; Wilkinson et al. 2016), promoting research community best-practice.
The Ira Moana Project logo. The Māori phrase Ira Moana could be interpreted as meaning ‘ocean genes’ or ‘dot in the ocean’. Both seem appropriate when thinking about the scale of DNA in the vastness of the ocean. The use of te reo Māori (Māori language) resonates with the project objectives that are uniquely New Zealand, as is the Māori language. Yet, moana is used to describe the ocean by many Pacific nations, reminding us of the connections that New Zealand’s biodiversity has with the wider Pacific region.
The Ira Moana Project has partnered with GEOME both to enable a collaborative network of researchers to adhere to these standards in community best-practice, and deliver a searchable metadatabase for the genetic data of Aotearoa New Zealand’s marine organisms. The Project aims to build and maintain the most comprehensive national database of marine genetic data in the world, ensuring kaitiakitanga (guardianship and stewardship) and creating opportunities for data synthesis to inform New Zealand’s future research directions and conservation decisions. The Ira Moana Project builds on the success of the Diversity of the Indo-Pacific Network (DIPnet) that through the use of GEOME and multi-national collaboration, has created the largest population genetic database in the world. DIPnet consolidated over 200 genetic datasets for Indo-Pacific marine organisms, and is now delivering novel biodiversity insights for the Indo-Pacific Ocean (e.g., Crandall et al. 2018), which is the largest and one of the most threatened biogeographic regions on our globe.
The Ira Moana Project is similarly founded in concern for the marine environment. New Zealand is a marine nation—we have one of the largest exclusive maritime economic zones in the world, which sustains our marine and tourism industries, and provides significant recreational and social benefits for New Zealanders. Nationally, and as global citizens, we are under pressure to make informed decisions regarding commercial and recreational activities, and how they can be balanced with the protection of our marine ecosystems. Such decisions of environmental, economic, and societal impact need to be transparent and based on robust information, as well as including knowledge about biodiversity that stretches from ecosystems to genes. The Ira Moana Project has established that there are over 430 genetic datasets for New Zealand marine organisms, and is now working to consolidate these data for the benefit of future researchers and generations of New Zealanders.
The data lifecycle in genetic research. DNA sequence data is routinely deposited into open-access genetic data repositories (under OUTPUTS). Despite metadata being accrued at every step of research (*), starting with COLLECTION, the practice of depositing metadata into repositories such as the Genomics Observatory Metadatabase (GEOME) is very recent. The Ira Moana Project is one of the project’s using the infrastructure provided by GEOME. Stewardship of metadata alongside DNA sequence data ensures that genetic research in the biodiversity, ecological, and evolutionary sciences can be reproducible, the genetic data can be re-used, and that the provenance of the genetic data and the rights of the local communities involved in the research are maintained.
As the first national project to make use of the GEOME infrastructure, the Ira Moana Project has worked with GEOME to extend the capability of the metadatabase to additionally acknowledge indigenous rights. It has become apparent that what is considered fair and equitable research practice within the research community, may not be fair and equitable within broader society. Through collaboration with Local Contexts and Te Mana Rauranga (the Māori Data Sovereignty Network), the Ira Moana Project and GEOME are now beta-testing the capacity for researchers to add Notices (such as the Traditional Knowledge Notice; TK Notice) and new Biocultural Labels as metadata for DNA sequence data. Notices signal that there are accompanying Indigenous rights needing further attention for any responsible and equitable future use of the data. Biocultural Labels further allow the addition of provenance information and community expectations for future use based on Indigenous Data Sovereignty principles—including the CARE Principles (Collective Benefit, Authority to Control, Responsibility, Ethics) launched by the Global Indigenous Data Alliance—thereby enabling Indigenous stewardship and persistent recognition of Indigenous rights within an international framework (complying with the Nagoya Protocol to the Convention on Biological Diversity). The implementation of Notices and Biocultural Labels using GEOME infrastructure is a first for a biological resource and for genetic data, establishing new ethical standards in this research community.
Workshops and datathons for New Zealand researchers have encouraged uptake and use of the metadata infrastructure provided through the Ira Moana Project and GEOME. There are now greater than 85 researchers who have joined the Ira Moana Project Network; being part of the network means being ‘on-board’ both with the things that the Ira Moana Project is trying to achieve for New Zealand, and the metadata standards that GEOME is accommodating for researchers worldwide. As there is a global community of researchers who generate genetic data, it will be some time before there is universal uptake of these newly recognized standards of best-practice. Nonetheless, we should be encouraged by the fact that as a community, we have made similar transformations in our practice in the past; since the introduction of the Joint Data Archiving Policy, it has been considered standard practice to deposit genetic data into open-access repositories. As such, we anticipate that the Ira Moana Project metadatabase will continue to grow and serve New Zealander’s, and there will be increasing uptake of the services that GEOME provides to the research and wider community.
Literature cited – Crandall ED, Riginos C, Bird CE, Liggins L, Treml E, Beger M, Barber PH, Connolly SR, Cowman PF, DiBattista JD, et al. 2019. The molecular biogeography of the Indo-Pacific: Testing hypotheses with multispecies genetic patterns. Global Ecology and Biogeography. 58(5):403–418. – Deck J, Gaither MR, Ewing R, Bird CE, Davies N, Meyer C, Riginos C, Toonen RJ, Crandall ED. 2017. The Genomic Observatories Metadatabase (GEOME): A new repository for field and sampling event metadata associated with genetic samples. PLoS Biology. 15(8):e2002925. – Pope LC, Liggins L, Keyse J, Carvalho SB, Riginos C. 2015. Not the time or the place: the missing spatio‐temporal link in publicly available genetic data. Molecular Ecology. 24(15):3802-9. – Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 3