In 2016, the International Science Council’s Committee on Data (CODATA) and World Data System (WDS) formed a Task Group (TG) on the Validation, Curation, and Management of Citizen Science and Crowdsourced Data. The objectives of the TG were to better understand the ecosystem of data-generating citizen science, crowdsourcing, and volunteered geographic information (hereafter ‘citizen science’ or simply ‘CS’) to characterize the potential and challenges of these developments for science. This blog post represents a summary of findings from the resulting open access journal article ‘Still in Need of Norms: The State of the Data in Citizen Science’.
I served as co-chair of the TG, and currently serve as co-chair of its successor—which focuses on citizen science data for the Sustainable Development Goals (SDGs). My interest in this topic was piqued when, as chair of an earlier CODATA TG on Global Roads Data Development, I oversaw a validation exercise for OpenStreetMap (OSM) data in West Africa. At this time, OSM has become an authoritative data source used by many countries and organizations, yet until recently it was considered incomplete outside of Europe and North America, and there were concerns about its completeness and spatial accuracy. Given the rapid rise in citizen science, I was curious to understand perspectives in the scientific community on the validity of these data streams, and to explore how the citizen science community had addressed data validation and data management issues. I was joined on the TG by enthusiastic colleagues, including Elaine Faustman (Vice-chair of the WDS-SC) and Rorie Edmunds (Acting WDS Executive Director). In addition, we hired a consultant, Anne Bowser of the US-based Woodrow Wilson Center, to coordinate a survey of citizen science efforts and compile the results.
The survey covered 36 CS efforts, covering a reasonably representative sample of projects across thematic areas (biodiversity, environmental justice, health, etc.), methods (extreme CS to crowdsourcing), and regions. Survey participants were recruited following a purposive sampling scheme designed to capture data management practices from a wide range of initiatives through a landscape sampling methodology. While this is not a statistically representative sample, it is sufficient to gain an overview of the ‘state of the data’ in citizen science.
We asked our participants to describe the full range of data collection or processing tasks that were used in their citizen science research. For data management, we asked about quality assurance/quality control (QA/QC) processes, including those related to data collection, as well as volunteer training; instrument control, such as the use of a standardized instrument; and data verification/validation strategies, such as voucher collection (e.g., through a photo or specimen) or expert review. We asked questions on data access, including whether access to analyzed, aggregated, and/or raw data were provided, and how data discovery and dissemination were supported (if at all).
Results suggest that QA/QC methods are widespread, and include practices such as expert review, crowdsourced review, voucher collection, statistical identification of outliers, and replication or calibration across volunteers. Fourteen projects removed data considered suspect or unreliable, while nine contacted volunteers to get additional information on questionable data. A high percentage of projects conduct volunteer training before data collection and/or on an ongoing basis. Domain data standards are also commonly applied. Overall, CS data practices appear to be approaching best practices of science more generally, but a strong onus rests on CS data developers to ‘prove’ that their data are robust, in part because of past suspicions that citizen science data were not of research quality. Those concerns are waning. As mentioned, OSM roads and settlements data are now considered to be the best available open access data for humanitarian decision-making and research applications, and ornithological data from eBird are integrated into the Global Biodiversity Information Facility (GBIF)—a WDS Regular Member.
In terms of data infrastructure, many projects adopt existing data collection applications and online communities, such as iNaturalist, BioCollect, CitSci.org, and Spotteron; leverage existing crowdsourcing platforms, such as Zooniverse or OSM; or develop their own fit-for-purpose platform with robust infrastructure, often including backups and redundancies. Outsourcing infrastructure is common, but for some smaller projects, project principals on the science side often seemed largely unaware of the backend systems in place. Data security practices are generally appropriate, and different projects have different approaches to dealing with personally identifiable information (PII) on volunteers—ranging from anonymizing the data to adopting a social network approach where only fellow members of the community could access PII, to having volunteers opt in to share personal information.
Documentation of CS data can generally be improved. Thirteen respondents mentioned publishing information about the methodology or protocol, while eight documented limitations. Five projects offered fitness-for-use statements or use cases. A few projects offered simple disclaimers, such as ‘data are provided as is’.
Of perhaps greatest interest to the WDS community were the responses on data access. Eighteen projects make their data discoverable through the project website, 10 projects make data available through a topical or field-based repository (such as GBIF), 8 projects share their data through an institutional repository, 4 through a public sector data repository, and 2 through a publication-based repository. Only nine projects do not easily enable secondary users to find their data. Fourteen projects publish open data, and 13 offer data upon request, including by emailing the principal investigator. Six projects disseminate the data openly, but required processes like creating user accounts that effectively prohibit automated access, and seven projects state that their data are never available.
Persistent identifiers and data licensing are critical aspects of FAIR (Findable, Accessible, Interoperable, Reusable) data practices. Eleven projects offered Digital Object Identifiers to support reuse and citation; the other 25 either do not offer one, or participants did not know if they have them. Only 16 projects have a standardized license to support data reuse, most often Creative Commons (CC) licenses such as CC-BY and CC-BY-SA licenses (which require attribution), but also CC0 public domain dedication. A few apply non-commercial use restrictions. However, 18 project representatives could not identify any standardized license for their data, and two participants didn’t know whether their project had a license or not.
In most cases, raw and processed (cleaned, aggregated, summarized, visualized) data are provided by projects, but some projects only provide access to processed data. This takes a variety of forms. Nineteen projects share findings through project publications or whitepapers, while 16 share findings through peer-reviewed publications. Many projects note that scholarly publication was ‘a longer-term goal’. Only six projects provide no access to analyzed data. Sixteen projects provide tools for user-specified queries or downloads (with several also providing Application Programming Interfaces for machine queries); 14 make data available through web services or data visualizations, including maps; 10 offer bulk download options; and 5 provide custom analyses or services.
What do these findings mean?
A fundamental rationale for improving data management practices in citizen science is to ensure the ability of citizens, scientists, and policymakers to reuse the data for scientific research or policy purposes. While citizen science has emerged as a promising means to collect data on a massive scale and is maturing in regard to data practices, there is still much progress to be made in approaches to the data lifecycle, from acquisition to management to dissemination. Science as a whole is often having difficulty keeping up with new norms, and CS is no different. Yet lags in CS may also reflect lack of resources, particularly for smaller or startup citizen science efforts that struggle to maintain staff and funding and that perhaps find data management falls to the bottom of the to-do list. They may also reflect the passions of the first movers in the citizen science space—who were motivated by environmental justice concerns or scientific discovery, and were less concerned about long-term data stewardship.
The characterization of data practices in the journal article (and this blog post) is not intended as a criticism of the field, but rather an effort to identify areas where improvements are needed and to provide a call to action and greater maturation. In this spirit, my co-authors and I offer the following recommendations:
Data quality: While significant QA/QC checks are taken across the data lifecycle, these are not always documented in a standardized way. Citizen science practitioners should document their QA/QC practices on project websites and/or through formal QA/QC plans. Researchers seeking to advance the field could help develop controlled vocabularies for articulating common data quality practices that can be included in metadata for datasets and/or observations.
Data infrastructure: Citizen science practitioners should consider leveraging existing infrastructures across the data lifecycle, such as for data collection and data archiving (e.g., in large and stable data aggregation repositories). Researchers seeking to advance the field should fully document supporting infrastructures to make their strengths and limitations transparent and to increase their utility, as well as develop additional supporting infrastructures as needed.
Data documentation: Citizen science practitioners should make discovery metadata (structured descriptive information about datasets) available through data catalogues, and should share information on methods used to develop datasets on project websites. Researchers seeking to advance the field could develop controlled vocabularies for metadata documentation, particularly to enable fitness-for-purpose assessments.
Data access: In addition to discovery metadata, citizen science practitioners should select and use one or more open, machine-readable licenses, such as CC licenses. Researchers seeking to advance the field should identify, share information about, and (if necessary) develop long-term infrastructures for data discovery and preservation.
The CODATA-WDS TG continues its work under a new banner, ‘Citizen Science Data for the SDGs’. A major focus is on citizen science data collection in Africa, a region that is lagging in the achievement of many SDGs—particularly those related to water, sanitation, and the urban environment.
The main focus of the report is to understand the training needs of public sector research that is becoming digitized as scientific disciplines evolve, data management becomes more prevalent and rigorous, and open science continues to be a call to action and emerging practice. Digitization of research across all disciplines has also attracted digital infrastructure and cybersecurity investments. At the same time, however, digitization both drives research competitiveness in new directions for scientists and demands greater expertise in new competencies for research support personnel. Is everyone keeping up with the pace of change?
The answer to this question is mixed in three main ways. First, as the Venn diagram from the report visualizes it (Fig. 1), there are roles and responsibilities for researchers and support personnel working in data-intensive sciences that have functional titles, but where their competencies overlap. Using illustrative examples from several case studies shows that roles and competencies have been changing for some time. Second, because the mix of capacity and competencies is a moving target, the present challenge is to identify the skills needed as the composition of the research workforce changes. The third point is that training has been lagging behind the front wave of digitizing research, leaving skills gaps that may be ignored or go unnoticed.
The Expert Group convened by the OECD GSF, on which I served as a member, realizes that not every OECD member state, or non-members for that matter, has recognized the challenge of building workforce capacity and digital skills at the same pace, or with the same level of resource commitment. A ‘digital workforce capacity maturity model’ was developed to capture this diversity (Fig. 2). It serves as a rough indicator of what training is needed most urgently, according to where one lies on a spectrum of training depth.
The report also offers insights, organized initially as a matrix (Fig. 3), into who might do what to provide training. The ‘who’ are the main actors: national and regional governments; research agencies and professional science associations; research institutes and infrastructures; and universities. The ‘what’ includes a wide array of initiatives: defining needs; provisioning of training and community building; career path rewards; and broader enablers. This is more fully fleshed out in many examples from around the world, describing some of the initiatives that have been undertaken to develop training.
Recommendations are made for the various actors, and the report takes special note of what can and should be done at research universities, and their associated libraries. The overall recommendation to OECD members is that policies recognizing and enabling both the need for workforce capacity growth and access to digital skills training must be embraced to maintain the competitiveness of national and internationally collaborative research, and thus achieving its highest goals.
The report was in its final stages of review and approval when the COVID-19 pandemic struck. As we observed in the conclusion of our Foreword, ’The COVID-19 pandemic highlights the importance and potential of data intensive science. All countries need to make digital skills and capacity for science a priority and they need to work together internationally to achieve this. To this end, the recommendations in this report are even more pertinent now than they were when they were first drafted in late 2019’. As we get nearer to the end of 2020, all indications are that the need to build workforce capacity and digital skills for data-intensive sciences has not only escalated, but now must address new realities, research priorities, urgent timelines for training, and challenge socioeconomic circumstances.
A Blog post by Seiya Terada (WDS-ITO Co-op Student)
I was very fortunate to have the opportunity to work as a co-op student at the World Data System – International Technology Office (WDS-ITO). The skills I developed and the experience that I gained from this 8-month work term were not something that I could learn in school, only from being in a professional working environment.
During my co-op term, I had opportunities to work on many projects, including creating websites, visualizations, presentation material, and much more. Some projects were more challenging than others, but I had lots of fun learning as I worked on them. The first big project I worked on was making a WDS Member visualization with Adobe After Effects. The visualization shows a globe that spins a full 360-degrees while highlighting the location of each WDS Member. This was my first time using After Effects, let alone making an animation-type visualization, so I had a hard time at first. I learned the basics of After Effects using online resources, then I learned to use more advanced features like masking, which I applied to the animation. The biggest struggle in making the animation was keeping the file size small, since it is to be used on the WDS-ITO website. This meant keeping the animation to a bare minimum, so that the file doesn’t get bloated.
The project I am particularly proud of and had the most fun working on was the website I made for the Research Metadata Schemas Working Group (WG) of the Research Data Alliance. The website hosts visualizations that are based on data from a survey conducted by the WG. As a software engineer undergrad, I was excited that I had a chance to build a website from scratch using my coding skills. I had never used HTML to build a website until this project, I had not even taken any courses on it at university, and so everything was new to me. I therefore had to learn HTML syntax as well as coding practices by using online resources before I started working on the website. I realized that building a sleek website from scratch with my current knowledge would have taken forever, so I decided to use a website template I found online to fill in my knowledge gaps, and tweaked it to fit to what I needed. The skills and experiences I gained from these projects are something I will never forget moving forward with my career.
Overall, I had a lot of fun working as a co-op student and it was a good experience. Although some of the projects were challenging, I was able to learn a lot and developed skills that I did not have before. The work environment was relaxed and easy to work in. I was also able to make a lot of unforgettable memories along the way thanks to the people I worked with. This whole experience will definitely help me with my career moving forward.
We would like to bring your attention to the following report published by OECD and that may be of interest to the WDS community:
OECD (2020), “Building digital workforce capacity and skills for data-intensive science”, OECD Science, Technology and Industry Policy Papers, No. 90, OECD Publishing, Paris, https://doi.org/10.1787/e08aa3bb-en
This report was commissioned by the OECD Global Science Forum to identify: the skills needed for data-intensive science, the challenges for building sustainable capacity as these needs evolve, and the policy actions that can be taken by different actors to address these needs. The report includes policy recommendations for various actors and good practice examples to support these recommendations, and also notes the value of international cooperation in skills capacity efforts.
A Blog post by Karen Payne (WDS-ITO Associate Director)
You spoke. We listened.
The WDS International Technology Office (WDS-ITO) was created to support Member Organizations of WDS as they develop their data repositories in the areas of data and metadata management, infrastructure, and interoperability. In order to respond most effectively to Member needs, last year the WDS-ITO, with the support of the WDS International Program Office, conducted a survey to evaluate your areas of interest and determine what types of projects you would like WDS to support. Our key finding was a list of potential WDS-ITO projects, ranked according to interest. You can read the report of the survey here. We discovered that the top two areas of interest were adding: 1) semantic markup to metadata and 2) harvestable metadata services. In response, the WDS-ITO has secured funds from Canada’s national New Digital Research Infrastructure Organization (NDRIO) to hire two fulltime staff members to work on these projects. The funding provides dedicated resources to develop collaborative partnerships among the WDS-ITO, its members, and relevant international and Canadian interest groups to increase availability and interoperability of metadata assets globally.
Over the next year, the WDS-ITO will be working with the Research Data Alliance Research Metadata Schemas Working Group (WG) to help provide repositories with guidance and tools to add Schema.org markup to metadata. As a first step, the WDS-ITO has prototyped an online visualization tool based on a survey of current practices in using schemas to describe research datasets. The tool shows how some communities have crosswalked common metadata terms to Schema.org properties, and can be useful to repositories that are interested in knowing how other repositories are utilizing Schema.org terms. It can also be used as consensus building for communities of practice that have not yet created a crosswalk between their metadata format of choice and Schema.org properties. We will continue to build on that tool, and provide other guidance to WDS Members to help make their metadata more ‘web friendly’ in the coming months.
As part of our support for those groups interested in harvestable metadata, the WDS-ITO has created a WG of WDS Members who are interested in standing up harvestable metadata services. This WDS Harvestable Metadata Services (HMetS) WG is co-chaired by two members of the WDS Scientific Committee: Aude Chambodut, Director of the International Service of Geomagnetic Indices in Strasbourg (WDS Regular Member) and Juanle Wang, Director of the WDC for Renewable Resources and Environment in Beijing (WDS Regular Member). The HMetS WG is coordinated by Alicia Urquidi Diaz, the WDS-ITO’s first employee! To date, eight WDS Member Representatives have expressed interest in participating in the WG, and we welcome any other Members who would like to join.
This project is designed around three objectives:
Documenting use cases, the current challenges faced by WDS Members who wish to create harvestable services. What is their current infrastructure?
Helping develop implementation plans, written by Members to define a pathway to creating harvestable metadata services.
A paper identifying lessons learned and guidance materials that can be used by the wider Research Data Management community
The HMetS WG will convene regular online meetings, and bring in presenters who can speak to some of the pathways and long-term benefits of creating harvestable metadata services.
Both of the above work packages will draw on the expertise of and synchronize with ongoing research data management activities in Canada, with the ultimate goal of opening up more metadata records to the international scientific community.
You can read the NDRIO funding announcement here in English and French.
In this blog post, Varsha Khodiyar (Data Curation Manager, Research Data and New Product Development) describes why Springer Nature has endorsed the TRUST Principles and their importance to data management within the research community.
For more information on the TRUST Principles and how your organization can endorse them, please see our news article.