The International Science Council (ISC) and its World Data System (WDS) are extremely pleased to announce that the Oak Ridge Institute at the University of Tennessee (ORI at UT) has been chosen as the new host of the WDS International Programme Office (WDS-IPO). The new hosting is subject to the signing of a Memorandum of Understanding between ISC and ORI at UT, which is currently under ...
The International Science Council (ISC) cordially invites the WDS community to join a twofold (to cater for different time zones) online event that will kick off 2021 as a year of recovery and transformations within reach on 25 and 26 January. This event is open to everyone. Register here Part one: UN Roadmap for the COVID-19 Recovery 25 January, 16:00 CET (15:00 UTC) The ...
Proposals are invited for sessions at International Data Week’s SciDataCon 2021: Data to Improve our World. SciDataCon 2021 is an integral part of International Data Week 2021, which will be held both virtually and onsite in Seoul, Republic of Korea, on 8–11 November 2021. Session proposals should be submitted to: http://www.scidatacon.org/IDW2021/ . The deadline for proposals is ...
Together with the International Science Council’s Committee on Data (CODATA) and the Research Data Alliance (RDA), we are delighted to announce the launch of the new International Data Week (IDW) website , coordinated by the three convening organizations of IDW. IDW brings together data scientists, researchers, industry leaders, entrepreneurs, policymakers, and data stewards from ...
We Must Tear Down the Barriers That Impede Scientific Progress
We would like to bring your attention the following article published on 18 December 2020 in Scientific American that we believe is of interest to the WDS community:
This article was written by Michael M. Crow and Greg Tananbaum to emphasize how Open Science contributed to the fight against the COVID-19 pandemic, and the importance of 'openness' not only for science and the economy but also for society. The article concludes with a paragraph that strongly resonates with the work of the World Data System:
There are hurdles to widespread adoption of open science practices, to be sure. Researchers need proper training on data management plans, reuse licensing and other good open science hygiene. Infrastructure must be developed and nurtured to preserve scientific data, curate it and render it actionable. And organizations must overcome their natural entropy, which makes tackling big, cross-cutting initiatives like open science challenging. While these obstacles are nontrivial, they are small in comparison to the scientific, economic, and societal benefits of open. In a moment of great peril, maintaining the status quo will ultimately prove more costly.
Promoting Proper Data Citation Practices
A Blog post by Robert R. Downs (Senior Digital Archivist and Senior Staff Associate Officer of Research, Center for International Earth Science Information Network)
The recent EOS opinion article, Data Sets Are Foundational to Research. Why Don’t We Cite Them?, reflects the perspectives of a team of data stewards from five Distributed Active Archive Centers of NASA’s Earth Observing System Data and Information System. In that piece, a salient issue that the authors emphasize is the need to properly cite data, since it is apparent that data citation behavior has not been adopted as a norm, yet, across the Earth Sciences. The authors have observed that data citation has increased during the past decade, but they also have higher expectations for community adoption of data citation practices and, in particular, proper data citation practices. A renewed effort to promote data citation practices is necessary to remind the research community about the need to properly cite data that are used during the preparation of a journal article, report, or other publication.
It is gratifying to see that there has been progress in the adoption of data citation practices. However, when considering the slow adoption of data citation practices, we need to improve communication about the importance of data citation and the benefits that proper data citation offers to all stakeholders. For example, at the risk of oversimplifying such benefits, we might say that, as a result of proper data citation practices:
– Article authors can inform readers about the data used in their work.
– Article readers can access the data used in studies of interest.
– Journal editors can model ethical publication practices.
– Data producers can be recognized for sharing their data.
– Data repositories and their host institutions can measure their effectiveness.
– Promotion committees can assess the research contributions of colleagues.
– Sponsors can see how their investments have been leveraged for scientific progress.
Some efforts have reinforced the importance of proper data citation along with techniques for how article authors can cite data. Data repositories display recommended data citations on data landing pages and in metadata and documentation. Journal editors have begun to require authors to cite the data that are used in the preparation of publications. But more must be done to inform colleagues about the importance of data citation so that the adoption of proper data citation practices becomes the norm when publishing research reports. Too often, data are not cited, or are cited without the necessary attribution information to enable the data to be accessed. Furthermore, references to data are not always included in the bibliography section of a publication, or the data reference that is included in the bibliography is incorrect or incomplete.
In many ways, citing data is similar to citing articles. Like articles, the bibliographic reference for data citations should include the following six elements to describe the data that have been used for a publication:
– Authors (data producers).
– Complete title (including version).
– Publication date.
– Publisher (data distributor).
– Persistent identifier.
– Date accessed.
For those seeking recommendations on proper data citation, guidance materials are freely available online. The Data Preservation and Stewardship Committee of the Earth Science Information Partners (WDS Partner Member) recently updated its Data Citation Guidelines for Earth Science Data with detailed explanations and examples. The Quick Guide to Data Citation was produced by the International Association of Social Science Information Services and Technology and focusses on the simplicity of a data citation.
Hopefully, the adoption of proper data citation practices will become more prevalent across the Earth science community, as well as within other research communities, as the research culture continues to evolve. Promoting and serving as exemplars for proper data citation practices could help to encourage others to properly cite data in their publications.
Current Data Practices in Citizen Science
A Blog post by Alex de Sherbinin (WDS Scientific Committee Chair)
In 2016, the International Science Council’s Committee on Data (CODATA) and World Data System (WDS) formed a Task Group (TG) on the Validation, Curation, and Management of Citizen Science and Crowdsourced Data. The objectives of the TG were to better understand the ecosystem of data-generating citizen science, crowdsourcing, and volunteered geographic information (hereafter ‘citizen science’ or simply ‘CS’) to characterize the potential and challenges of these developments for science. This blog post represents a summary of findings from the resulting open access journal article ‘Still in Need of Norms: The State of the Data in Citizen Science’.
I served as co-chair of the TG, and currently serve as co-chair of its successor—which focuses on citizen science data for the Sustainable Development Goals (SDGs). My interest in this topic was piqued when, as chair of an earlier CODATA TG on Global Roads Data Development, I oversaw a validation exercise for OpenStreetMap (OSM) data in West Africa. At this time, OSM has become an authoritative data source used by many countries and organizations, yet until recently it was considered incomplete outside of Europe and North America, and there were concerns about its completeness and spatial accuracy. Given the rapid rise in citizen science, I was curious to understand perspectives in the scientific community on the validity of these data streams, and to explore how the citizen science community had addressed data validation and data management issues. I was joined on the TG by enthusiastic colleagues, including Elaine Faustman (Vice-chair of the WDS-SC) and Rorie Edmunds (Acting WDS Executive Director). In addition, we hired a consultant, Anne Bowser of the US-based Woodrow Wilson Center, to coordinate a survey of citizen science efforts and compile the results.
The survey covered 36 CS efforts, covering a reasonably representative sample of projects across thematic areas (biodiversity, environmental justice, health, etc.), methods (extreme CS to crowdsourcing), and regions. Survey participants were recruited following a purposive sampling scheme designed to capture data management practices from a wide range of initiatives through a landscape sampling methodology. While this is not a statistically representative sample, it is sufficient to gain an overview of the ‘state of the data’ in citizen science.
We asked our participants to describe the full range of data collection or processing tasks that were used in their citizen science research. For data management, we asked about quality assurance/quality control (QA/QC) processes, including those related to data collection, as well as volunteer training; instrument control, such as the use of a standardized instrument; and data verification/validation strategies, such as voucher collection (e.g., through a photo or specimen) or expert review. We asked questions on data access, including whether access to analyzed, aggregated, and/or raw data were provided, and how data discovery and dissemination were supported (if at all).
Results suggest that QA/QC methods are widespread, and include practices such as expert review, crowdsourced review, voucher collection, statistical identification of outliers, and replication or calibration across volunteers. Fourteen projects removed data considered suspect or unreliable, while nine contacted volunteers to get additional information on questionable data. A high percentage of projects conduct volunteer training before data collection and/or on an ongoing basis. Domain data standards are also commonly applied. Overall, CS data practices appear to be approaching best practices of science more generally, but a strong onus rests on CS data developers to ‘prove’ that their data are robust, in part because of past suspicions that citizen science data were not of research quality. Those concerns are waning. As mentioned, OSM roads and settlements data are now considered to be the best available open access data for humanitarian decision-making and research applications, and ornithological data from eBird are integrated into the Global Biodiversity Information Facility (GBIF)—a WDS Regular Member.
In terms of data infrastructure, many projects adopt existing data collection applications and online communities, such as iNaturalist, BioCollect, CitSci.org, and Spotteron; leverage existing crowdsourcing platforms, such as Zooniverse or OSM; or develop their own fit-for-purpose platform with robust infrastructure, often including backups and redundancies. Outsourcing infrastructure is common, but for some smaller projects, project principals on the science side often seemed largely unaware of the backend systems in place. Data security practices are generally appropriate, and different projects have different approaches to dealing with personally identifiable information (PII) on volunteers—ranging from anonymizing the data to adopting a social network approach where only fellow members of the community could access PII, to having volunteers opt in to share personal information.
Documentation of CS data can generally be improved. Thirteen respondents mentioned publishing information about the methodology or protocol, while eight documented limitations. Five projects offered fitness-for-use statements or use cases. A few projects offered simple disclaimers, such as ‘data are provided as is’.
Of perhaps greatest interest to the WDS community were the responses on data access. Eighteen projects make their data discoverable through the project website, 10 projects make data available through a topical or field-based repository (such as GBIF), 8 projects share their data through an institutional repository, 4 through a public sector data repository, and 2 through a publication-based repository. Only nine projects do not easily enable secondary users to find their data. Fourteen projects publish open data, and 13 offer data upon request, including by emailing the principal investigator. Six projects disseminate the data openly, but required processes like creating user accounts that effectively prohibit automated access, and seven projects state that their data are never available.
Persistent identifiers and data licensing are critical aspects of FAIR (Findable, Accessible, Interoperable, Reusable) data practices. Eleven projects offered Digital Object Identifiers to support reuse and citation; the other 25 either do not offer one, or participants did not know if they have them. Only 16 projects have a standardized license to support data reuse, most often Creative Commons (CC) licenses such as CC-BY and CC-BY-SA licenses (which require attribution), but also CC0 public domain dedication. A few apply non-commercial use restrictions. However, 18 project representatives could not identify any standardized license for their data, and two participants didn’t know whether their project had a license or not.
In most cases, raw and processed (cleaned, aggregated, summarized, visualized) data are provided by projects, but some projects only provide access to processed data. This takes a variety of forms. Nineteen projects share findings through project publications or whitepapers, while 16 share findings through peer-reviewed publications. Many projects note that scholarly publication was ‘a longer-term goal’. Only six projects provide no access to analyzed data. Sixteen projects provide tools for user-specified queries or downloads (with several also providing Application Programming Interfaces for machine queries); 14 make data available through web services or data visualizations, including maps; 10 offer bulk download options; and 5 provide custom analyses or services.
What do these findings mean?
A fundamental rationale for improving data management practices in citizen science is to ensure the ability of citizens, scientists, and policymakers to reuse the data for scientific research or policy purposes. While citizen science has emerged as a promising means to collect data on a massive scale and is maturing in regard to data practices, there is still much progress to be made in approaches to the data lifecycle, from acquisition to management to dissemination. Science as a whole is often having difficulty keeping up with new norms, and CS is no different. Yet lags in CS may also reflect lack of resources, particularly for smaller or startup citizen science efforts that struggle to maintain staff and funding and that perhaps find data management falls to the bottom of the to-do list. They may also reflect the passions of the first movers in the citizen science space—who were motivated by environmental justice concerns or scientific discovery, and were less concerned about long-term data stewardship.
The characterization of data practices in the journal article (and this blog post) is not intended as a criticism of the field, but rather an effort to identify areas where improvements are needed and to provide a call to action and greater maturation. In this spirit, my co-authors and I offer the following recommendations:
Data quality: While significant QA/QC checks are taken across the data lifecycle, these are not always documented in a standardized way. Citizen science practitioners should document their QA/QC practices on project websites and/or through formal QA/QC plans. Researchers seeking to advance the field could help develop controlled vocabularies for articulating common data quality practices that can be included in metadata for datasets and/or observations.
Number of quality assurance/quality control (QA/QC) methods per project
Data infrastructure: Citizen science practitioners should consider leveraging existing infrastructures across the data lifecycle, such as for data collection and data archiving (e.g., in large and stable data aggregation repositories). Researchers seeking to advance the field should fully document supporting infrastructures to make their strengths and limitations transparent and to increase their utility, as well as develop additional supporting infrastructures as needed.
Data documentation: Citizen science practitioners should make discovery metadata (structured descriptive information about datasets) available through data catalogues, and should share information on methods used to develop datasets on project websites. Researchers seeking to advance the field could develop controlled vocabularies for metadata documentation, particularly to enable fitness-for-purpose assessments.
Data access: In addition to discovery metadata, citizen science practitioners should select and use one or more open, machine-readable licenses, such as CC licenses. Researchers seeking to advance the field should identify, share information about, and (if necessary) develop long-term infrastructures for data discovery and preservation.
The CODATA-WDS TG continues its work under a new banner, ‘Citizen Science Data for the SDGs’. A major focus is on citizen science data collection in Africa, a region that is lagging in the achievement of many SDGs—particularly those related to water, sanitation, and the urban environment.
Digital Skill and Workforce Capacity
A Blog post by David Castle (WDS Scientific Committee Member)
In July of this year, the Organization for Economic Cooperation and Development (OECD) released its report, Building Digital Workforce Capacity and Skills for Data Intensive Science. Commissioned by the OECD’s Global Science Forum (GSF) in 2019, this is the ninetieth report in the OECD’s series of Science, Technology and Innovation Policy Papers.
The main focus of the report is to understand the training needs of public sector research that is becoming digitized as scientific disciplines evolve, data management becomes more prevalent and rigorous, and open science continues to be a call to action and emerging practice. Digitization of research across all disciplines has also attracted digital infrastructure and cybersecurity investments. At the same time, however, digitization both drives research competitiveness in new directions for scientists and demands greater expertise in new competencies for research support personnel. Is everyone keeping up with the pace of change?
Figure 1. Venn Diagram of Roles and Responsibilities
(Figure 3 from Building Digital Workforce Capacity and Skills for Data Intensive Science)
The answer to this question is mixed in three main ways. First, as the Venn diagram from the report visualizes it (Fig. 1), there are roles and responsibilities for researchers and support personnel working in data-intensive sciences that have functional titles, but where their competencies overlap. Using illustrative examples from several case studies shows that roles and competencies have been changing for some time. Second, because the mix of capacity and competencies is a moving target, the present challenge is to identify the skills needed as the composition of the research workforce changes. The third point is that training has been lagging behind the front wave of digitizing research, leaving skills gaps that may be ignored or go unnoticed.
Figure 2. Digital Workforce Capacity Maturity Model
(Figure 5 from Building Digital Workforce Capacity and Skills for Data Intensive Science)
The Expert Group convened by the OECD GSF, on which I served as a member, realizes that not every OECD member state, or non-members for that matter, has recognized the challenge of building workforce capacity and digital skills at the same pace, or with the same level of resource commitment. A ‘digital workforce capacity maturity model’ was developed to capture this diversity (Fig. 2). It serves as a rough indicator of what training is needed most urgently, according to where one lies on a spectrum of training depth.
Figure 3. Opportunities for Actors to Effect Change Across the Five Main Action Areas
(Table 2 from Building Digital Workforce Capacity and Skills for Data Intensive Science)
The report also offers insights, organized initially as a matrix (Fig. 3), into who might do what to provide training. The ‘who’ are the main actors: national and regional governments; research agencies and professional science associations; research institutes and infrastructures; and universities. The ‘what’ includes a wide array of initiatives: defining needs; provisioning of training and community building; career path rewards; and broader enablers. This is more fully fleshed out in many examples from around the world, describing some of the initiatives that have been undertaken to develop training.
Recommendations are made for the various actors, and the report takes special note of what can and should be done at research universities, and their associated libraries. The overall recommendation to OECD members is that policies recognizing and enabling both the need for workforce capacity growth and access to digital skills training must be embraced to maintain the competitiveness of national and internationally collaborative research, and thus achieving its highest goals.
The report was in its final stages of review and approval when the COVID-19 pandemic struck. As we observed in the conclusion of our Foreword, ’The COVID-19 pandemic highlights the importance and potential of data intensive science. All countries need to make digital skills and capacity for science a priority and they need to work together internationally to achieve this. To this end, the recommendations in this report are even more pertinent now than they were when they were first drafted in late 2019’. As we get nearer to the end of 2020, all indications are that the need to build workforce capacity and digital skills for data-intensive sciences has not only escalated, but now must address new realities, research priorities, urgent timelines for training, and challenge socioeconomic circumstances.