A Blog post by Karen Payne (WDS-ITO Associate Director)
I would like to bring your attention the following white paper that was recently published by David Castle (WDS-SC member), Mark Leggott (Executive Director, Research Data Canada), and I. This paper is one of a set collected by Canada's New Digital Research Infrastructure Organization (NDRIO) as part of their needs assessment and strategic planning activities. We believe it is also of interest to the WDS community:
The need for Canada to differentiate the national government’s role from that of commercial providers;
The need to meet researchers where they are, by taking stock of the tools they already use; and most importantly,
The need to support international coordination mechanisms such as the World Data System.
The article concludes that “a principled approach to building scientific infrastructure will best serve Canada and our international partners. No single community or country can address every consideration in the [digital research infrastructure] landscape, making it incumbent upon NDRIO to coordinate with international scientific federations as they marshal their strengths to address global challenges.”
A Blog post by Karen Payne (WDS-ITO Associate Director)
Trustworthy Data Repositories (TDRs) are a key pillar within the Global Open Research Commons (GORC), utilized by researchers as they address societal grand challenges such as climate change, pandemics, and poverty. The realized vision of the GORC will provide frictionless access to all research resources, including data, publications, software, and compute resources; plus the metadata, vocabulary, and identification services that enable their discovery and use by humans and machines. Part of the mission of the WDS International Technology Office (WDS-ITO) is to ensure that WDS Members are well represented in the coordination bodies, infrastructures, and functional pipelines that connect TDRs, analytics, and computing resources, globally. As part of this work, the WDS-ITO has taken a leadership role within the Research Data Alliance’s (RDA’s) GORC Interest Group (IG), and the GORC Working Group (WG) on International Benchmarking.
The GORC IG is working on a set of deliverables to support coordination amongst organizations that are building commons, including a roadmap for global alignment to help set priorities for commons development and integration. In support of this roadmap, the GORC Benchmarking WG will develop and collect a set of benchmarks for organizations such that they can internally measure their user engagement and development, and gauge their maturity and compare features across commons.
This WG is motivated by the broader goal of openly sharing data and related services across technologies, disciplines, and countries. The deliverables of the WG will inform roadmaps for development of the infrastructure necessary to meet this goal, while forging strong partnerships across the national-, regional-, and domain-focused commons that will be crucial to its success. Observable and measurable benchmarks will help create a tangible path for the development and support of strategic planning across science commons infrastructures and build a common that is globally interoperable. It will also support developers as they seek resources to build the GORC by helping them respond to funding agency requirements for measurable deliverables. WDS Members are a key component of this vision.
The work will build on previous RDA groups that some WDS Members may have previously or are currently involved with, such as the National Data Services IG, the Domain Repositories IG, the Data Fabric IG, and the Virtual Research Environment IG. These groups, and many others outside of RDA, will have recommendations that speak to functionality and features of various components of commons; for example, the re3data.org schema for collecting information on research data repositories for registration, the European Open Science Cloud’s (EOSC’s) FAIR and Sustainability WGs that seek to define the EOSC as a Minimum Viable Product. We will review these and other related outputs to see if they have identified benchmarks that we feel will support our goals.
The GORC International Benchmarking WG Case Statement is open for public review until Monday, 8 February 2021, and we have submitted a session proposal for RDA’s 17th Plenary Meeting to be held in April. We invite all WDS Members to provide comments or get involved in the RDA GORC IG and WG. If you have any questions, do not hesitate to reach out to Karen Payne (ito-director[at]oceannetworks[dot]ca) at the WDS-ITO. We would love to talk to you!
This article was written by Michael M. Crow and Greg Tananbaum to emphasize how Open Science contributed to the fight against the COVID-19 pandemic, and the importance of 'openness' not only for science and the economy but also for society. The article concludes with a paragraph that strongly resonates with the work of the World Data System:
There are hurdles to widespread adoption of open science practices, to be sure. Researchers need proper training on data management plans, reuse licensing and other good open science hygiene. Infrastructure must be developed and nurtured to preserve scientific data, curate it and render it actionable. And organizations must overcome their natural entropy, which makes tackling big, cross-cutting initiatives like open science challenging. While these obstacles are nontrivial, they are small in comparison to the scientific, economic, and societal benefits of open. In a moment of great peril, maintaining the status quo will ultimately prove more costly.
A Blog post by Robert R. Downs (Senior Digital Archivist and Senior Staff Associate Officer of Research, Center for International Earth Science Information Network)
The recent EOS opinion article, Data Sets Are Foundational to Research. Why Don’t We Cite Them?, reflects the perspectives of a team of data stewards from five Distributed Active Archive Centers of NASA’s Earth Observing System Data and Information System. In that piece, a salient issue that the authors emphasize is the need to properly cite data, since it is apparent that data citation behavior has not been adopted as a norm, yet, across the Earth Sciences. The authors have observed that data citation has increased during the past decade, but they also have higher expectations for community adoption of data citation practices and, in particular, proper data citation practices. A renewed effort to promote data citation practices is necessary to remind the research community about the need to properly cite data that are used during the preparation of a journal article, report, or other publication.
It is gratifying to see that there has been progress in the adoption of data citation practices. However, when considering the slow adoption of data citation practices, we need to improve communication about the importance of data citation and the benefits that proper data citation offers to all stakeholders. For example, at the risk of oversimplifying such benefits, we might say that, as a result of proper data citation practices:
– Article authors can inform readers about the data used in their work. – Article readers can access the data used in studies of interest. – Journal editors can model ethical publication practices. – Data producers can be recognized for sharing their data. – Data repositories and their host institutions can measure their effectiveness. – Promotion committees can assess the research contributions of colleagues. – Sponsors can see how their investments have been leveraged for scientific progress.
Some efforts have reinforced the importance of proper data citation along with techniques for how article authors can cite data. Data repositories display recommended data citations on data landing pages and in metadata and documentation. Journal editors have begun to require authors to cite the data that are used in the preparation of publications. But more must be done to inform colleagues about the importance of data citation so that the adoption of proper data citation practices becomes the norm when publishing research reports. Too often, data are not cited, or are cited without the necessary attribution information to enable the data to be accessed. Furthermore, references to data are not always included in the bibliography section of a publication, or the data reference that is included in the bibliography is incorrect or incomplete.
In many ways, citing data is similar to citing articles. Like articles, the bibliographic reference for data citations should include the following six elements to describe the data that have been used for a publication:
Hopefully, the adoption of proper data citation practices will become more prevalent across the Earth science community, as well as within other research communities, as the research culture continues to evolve. Promoting and serving as exemplars for proper data citation practices could help to encourage others to properly cite data in their publications.
In 2016, the International Science Council’s Committee on Data (CODATA) and World Data System (WDS) formed a Task Group (TG) on the Validation, Curation, and Management of Citizen Science and Crowdsourced Data. The objectives of the TG were to better understand the ecosystem of data-generating citizen science, crowdsourcing, and volunteered geographic information (hereafter ‘citizen science’ or simply ‘CS’) to characterize the potential and challenges of these developments for science. This blog post represents a summary of findings from the resulting open access journal article ‘Still in Need of Norms: The State of the Data in Citizen Science’.
I served as co-chair of the TG, and currently serve as co-chair of its successor—which focuses on citizen science data for the Sustainable Development Goals (SDGs). My interest in this topic was piqued when, as chair of an earlier CODATA TG on Global Roads Data Development, I oversaw a validation exercise for OpenStreetMap (OSM) data in West Africa. At this time, OSM has become an authoritative data source used by many countries and organizations, yet until recently it was considered incomplete outside of Europe and North America, and there were concerns about its completeness and spatial accuracy. Given the rapid rise in citizen science, I was curious to understand perspectives in the scientific community on the validity of these data streams, and to explore how the citizen science community had addressed data validation and data management issues. I was joined on the TG by enthusiastic colleagues, including Elaine Faustman (Vice-chair of the WDS-SC) and Rorie Edmunds (Acting WDS Executive Director). In addition, we hired a consultant, Anne Bowser of the US-based Woodrow Wilson Center, to coordinate a survey of citizen science efforts and compile the results.
The survey covered 36 CS efforts, covering a reasonably representative sample of projects across thematic areas (biodiversity, environmental justice, health, etc.), methods (extreme CS to crowdsourcing), and regions. Survey participants were recruited following a purposive sampling scheme designed to capture data management practices from a wide range of initiatives through a landscape sampling methodology. While this is not a statistically representative sample, it is sufficient to gain an overview of the ‘state of the data’ in citizen science.
We asked our participants to describe the full range of data collection or processing tasks that were used in their citizen science research. For data management, we asked about quality assurance/quality control (QA/QC) processes, including those related to data collection, as well as volunteer training; instrument control, such as the use of a standardized instrument; and data verification/validation strategies, such as voucher collection (e.g., through a photo or specimen) or expert review. We asked questions on data access, including whether access to analyzed, aggregated, and/or raw data were provided, and how data discovery and dissemination were supported (if at all).
Results suggest that QA/QC methods are widespread, and include practices such as expert review, crowdsourced review, voucher collection, statistical identification of outliers, and replication or calibration across volunteers. Fourteen projects removed data considered suspect or unreliable, while nine contacted volunteers to get additional information on questionable data. A high percentage of projects conduct volunteer training before data collection and/or on an ongoing basis. Domain data standards are also commonly applied. Overall, CS data practices appear to be approaching best practices of science more generally, but a strong onus rests on CS data developers to ‘prove’ that their data are robust, in part because of past suspicions that citizen science data were not of research quality. Those concerns are waning. As mentioned, OSM roads and settlements data are now considered to be the best available open access data for humanitarian decision-making and research applications, and ornithological data from eBird are integrated into the Global Biodiversity Information Facility (GBIF)—a WDS Regular Member.
In terms of data infrastructure, many projects adopt existing data collection applications and online communities, such as iNaturalist, BioCollect, CitSci.org, and Spotteron; leverage existing crowdsourcing platforms, such as Zooniverse or OSM; or develop their own fit-for-purpose platform with robust infrastructure, often including backups and redundancies. Outsourcing infrastructure is common, but for some smaller projects, project principals on the science side often seemed largely unaware of the backend systems in place. Data security practices are generally appropriate, and different projects have different approaches to dealing with personally identifiable information (PII) on volunteers—ranging from anonymizing the data to adopting a social network approach where only fellow members of the community could access PII, to having volunteers opt in to share personal information.
Documentation of CS data can generally be improved. Thirteen respondents mentioned publishing information about the methodology or protocol, while eight documented limitations. Five projects offered fitness-for-use statements or use cases. A few projects offered simple disclaimers, such as ‘data are provided as is’.
Of perhaps greatest interest to the WDS community were the responses on data access. Eighteen projects make their data discoverable through the project website, 10 projects make data available through a topical or field-based repository (such as GBIF), 8 projects share their data through an institutional repository, 4 through a public sector data repository, and 2 through a publication-based repository. Only nine projects do not easily enable secondary users to find their data. Fourteen projects publish open data, and 13 offer data upon request, including by emailing the principal investigator. Six projects disseminate the data openly, but required processes like creating user accounts that effectively prohibit automated access, and seven projects state that their data are never available.
Persistent identifiers and data licensing are critical aspects of FAIR (Findable, Accessible, Interoperable, Reusable) data practices. Eleven projects offered Digital Object Identifiers to support reuse and citation; the other 25 either do not offer one, or participants did not know if they have them. Only 16 projects have a standardized license to support data reuse, most often Creative Commons (CC) licenses such as CC-BY and CC-BY-SA licenses (which require attribution), but also CC0 public domain dedication. A few apply non-commercial use restrictions. However, 18 project representatives could not identify any standardized license for their data, and two participants didn’t know whether their project had a license or not.
In most cases, raw and processed (cleaned, aggregated, summarized, visualized) data are provided by projects, but some projects only provide access to processed data. This takes a variety of forms. Nineteen projects share findings through project publications or whitepapers, while 16 share findings through peer-reviewed publications. Many projects note that scholarly publication was ‘a longer-term goal’. Only six projects provide no access to analyzed data. Sixteen projects provide tools for user-specified queries or downloads (with several also providing Application Programming Interfaces for machine queries); 14 make data available through web services or data visualizations, including maps; 10 offer bulk download options; and 5 provide custom analyses or services.
What do these findings mean?
A fundamental rationale for improving data management practices in citizen science is to ensure the ability of citizens, scientists, and policymakers to reuse the data for scientific research or policy purposes. While citizen science has emerged as a promising means to collect data on a massive scale and is maturing in regard to data practices, there is still much progress to be made in approaches to the data lifecycle, from acquisition to management to dissemination. Science as a whole is often having difficulty keeping up with new norms, and CS is no different. Yet lags in CS may also reflect lack of resources, particularly for smaller or startup citizen science efforts that struggle to maintain staff and funding and that perhaps find data management falls to the bottom of the to-do list. They may also reflect the passions of the first movers in the citizen science space—who were motivated by environmental justice concerns or scientific discovery, and were less concerned about long-term data stewardship.
The characterization of data practices in the journal article (and this blog post) is not intended as a criticism of the field, but rather an effort to identify areas where improvements are needed and to provide a call to action and greater maturation. In this spirit, my co-authors and I offer the following recommendations:
Data quality: While significant QA/QC checks are taken across the data lifecycle, these are not always documented in a standardized way. Citizen science practitioners should document their QA/QC practices on project websites and/or through formal QA/QC plans. Researchers seeking to advance the field could help develop controlled vocabularies for articulating common data quality practices that can be included in metadata for datasets and/or observations.
Data infrastructure: Citizen science practitioners should consider leveraging existing infrastructures across the data lifecycle, such as for data collection and data archiving (e.g., in large and stable data aggregation repositories). Researchers seeking to advance the field should fully document supporting infrastructures to make their strengths and limitations transparent and to increase their utility, as well as develop additional supporting infrastructures as needed.
Data documentation: Citizen science practitioners should make discovery metadata (structured descriptive information about datasets) available through data catalogues, and should share information on methods used to develop datasets on project websites. Researchers seeking to advance the field could develop controlled vocabularies for metadata documentation, particularly to enable fitness-for-purpose assessments.
Data access: In addition to discovery metadata, citizen science practitioners should select and use one or more open, machine-readable licenses, such as CC licenses. Researchers seeking to advance the field should identify, share information about, and (if necessary) develop long-term infrastructures for data discovery and preservation.
The CODATA-WDS TG continues its work under a new banner, ‘Citizen Science Data for the SDGs’. A major focus is on citizen science data collection in Africa, a region that is lagging in the achievement of many SDGs—particularly those related to water, sanitation, and the urban environment.
The main focus of the report is to understand the training needs of public sector research that is becoming digitized as scientific disciplines evolve, data management becomes more prevalent and rigorous, and open science continues to be a call to action and emerging practice. Digitization of research across all disciplines has also attracted digital infrastructure and cybersecurity investments. At the same time, however, digitization both drives research competitiveness in new directions for scientists and demands greater expertise in new competencies for research support personnel. Is everyone keeping up with the pace of change?
The answer to this question is mixed in three main ways. First, as the Venn diagram from the report visualizes it (Fig. 1), there are roles and responsibilities for researchers and support personnel working in data-intensive sciences that have functional titles, but where their competencies overlap. Using illustrative examples from several case studies shows that roles and competencies have been changing for some time. Second, because the mix of capacity and competencies is a moving target, the present challenge is to identify the skills needed as the composition of the research workforce changes. The third point is that training has been lagging behind the front wave of digitizing research, leaving skills gaps that may be ignored or go unnoticed.
The Expert Group convened by the OECD GSF, on which I served as a member, realizes that not every OECD member state, or non-members for that matter, has recognized the challenge of building workforce capacity and digital skills at the same pace, or with the same level of resource commitment. A ‘digital workforce capacity maturity model’ was developed to capture this diversity (Fig. 2). It serves as a rough indicator of what training is needed most urgently, according to where one lies on a spectrum of training depth.
The report also offers insights, organized initially as a matrix (Fig. 3), into who might do what to provide training. The ‘who’ are the main actors: national and regional governments; research agencies and professional science associations; research institutes and infrastructures; and universities. The ‘what’ includes a wide array of initiatives: defining needs; provisioning of training and community building; career path rewards; and broader enablers. This is more fully fleshed out in many examples from around the world, describing some of the initiatives that have been undertaken to develop training.
Recommendations are made for the various actors, and the report takes special note of what can and should be done at research universities, and their associated libraries. The overall recommendation to OECD members is that policies recognizing and enabling both the need for workforce capacity growth and access to digital skills training must be embraced to maintain the competitiveness of national and internationally collaborative research, and thus achieving its highest goals.
The report was in its final stages of review and approval when the COVID-19 pandemic struck. As we observed in the conclusion of our Foreword, ’The COVID-19 pandemic highlights the importance and potential of data intensive science. All countries need to make digital skills and capacity for science a priority and they need to work together internationally to achieve this. To this end, the recommendations in this report are even more pertinent now than they were when they were first drafted in late 2019’. As we get nearer to the end of 2020, all indications are that the need to build workforce capacity and digital skills for data-intensive sciences has not only escalated, but now must address new realities, research priorities, urgent timelines for training, and challenge socioeconomic circumstances.