Current Data Practices in Citizen Science
A Blog post by Alex de Sherbinin (WDS Scientific Committee Chair)
In 2016, the International Science Council’s Committee on Data (CODATA) and World Data System (WDS) formed a Task Group (TG) on the Validation, Curation, and Management of Citizen Science and Crowdsourced Data. The objectives of the TG were to better understand the ecosystem of data-generating citizen science, crowdsourcing, and volunteered geographic information (hereafter ‘citizen science’ or simply ‘CS’) to characterize the potential and challenges of these developments for science. This blog post represents a summary of findings from the resulting open access journal article ‘Still in Need of Norms: The State of the Data in Citizen Science’.
I served as co-chair of the TG, and currently serve as co-chair of its successor—which focuses on citizen science data for the Sustainable Development Goals (SDGs). My interest in this topic was piqued when, as chair of an earlier CODATA TG on Global Roads Data Development, I oversaw a validation exercise for OpenStreetMap (OSM) data in West Africa. At this time, OSM has become an authoritative data source used by many countries and organizations, yet until recently it was considered incomplete outside of Europe and North America, and there were concerns about its completeness and spatial accuracy. Given the rapid rise in citizen science, I was curious to understand perspectives in the scientific community on the validity of these data streams, and to explore how the citizen science community had addressed data validation and data management issues. I was joined on the TG by enthusiastic colleagues, including Elaine Faustman (Vice-chair of the WDS-SC) and Rorie Edmunds (Acting WDS Executive Director). In addition, we hired a consultant, Anne Bowser of the US-based Woodrow Wilson Center, to coordinate a survey of citizen science efforts and compile the results.
The survey covered 36 CS efforts, covering a reasonably representative sample of projects across thematic areas (biodiversity, environmental justice, health, etc.), methods (extreme CS to crowdsourcing), and regions. Survey participants were recruited following a purposive sampling scheme designed to capture data management practices from a wide range of initiatives through a landscape sampling methodology. While this is not a statistically representative sample, it is sufficient to gain an overview of the ‘state of the data’ in citizen science.
We asked our participants to describe the full range of data collection or processing tasks that were used in their citizen science research. For data management, we asked about quality assurance/quality control (QA/QC) processes, including those related to data collection, as well as volunteer training; instrument control, such as the use of a standardized instrument; and data verification/validation strategies, such as voucher collection (e.g., through a photo or specimen) or expert review. We asked questions on data access, including whether access to analyzed, aggregated, and/or raw data were provided, and how data discovery and dissemination were supported (if at all).
Results suggest that QA/QC methods are widespread, and include practices such as expert review, crowdsourced review, voucher collection, statistical identification of outliers, and replication or calibration across volunteers. Fourteen projects removed data considered suspect or unreliable, while nine contacted volunteers to get additional information on questionable data. A high percentage of projects conduct volunteer training before data collection and/or on an ongoing basis. Domain data standards are also commonly applied. Overall, CS data practices appear to be approaching best practices of science more generally, but a strong onus rests on CS data developers to ‘prove’ that their data are robust, in part because of past suspicions that citizen science data were not of research quality. Those concerns are waning. As mentioned, OSM roads and settlements data are now considered to be the best available open access data for humanitarian decision-making and research applications, and ornithological data from eBird are integrated into the Global Biodiversity Information Facility (GBIF)—a WDS Regular Member.
In terms of data infrastructure, many projects adopt existing data collection applications and online communities, such as iNaturalist, BioCollect, CitSci.org, and Spotteron; leverage existing crowdsourcing platforms, such as Zooniverse or OSM; or develop their own fit-for-purpose platform with robust infrastructure, often including backups and redundancies. Outsourcing infrastructure is common, but for some smaller projects, project principals on the science side often seemed largely unaware of the backend systems in place. Data security practices are generally appropriate, and different projects have different approaches to dealing with personally identifiable information (PII) on volunteers—ranging from anonymizing the data to adopting a social network approach where only fellow members of the community could access PII, to having volunteers opt in to share personal information.
Documentation of CS data can generally be improved. Thirteen respondents mentioned publishing information about the methodology or protocol, while eight documented limitations. Five projects offered fitness-for-use statements or use cases. A few projects offered simple disclaimers, such as ‘data are provided as is’.
Of perhaps greatest interest to the WDS community were the responses on data access. Eighteen projects make their data discoverable through the project website, 10 projects make data available through a topical or field-based repository (such as GBIF), 8 projects share their data through an institutional repository, 4 through a public sector data repository, and 2 through a publication-based repository. Only nine projects do not easily enable secondary users to find their data. Fourteen projects publish open data, and 13 offer data upon request, including by emailing the principal investigator. Six projects disseminate the data openly, but required processes like creating user accounts that effectively prohibit automated access, and seven projects state that their data are never available.
Persistent identifiers and data licensing are critical aspects of FAIR (Findable, Accessible, Interoperable, Reusable) data practices. Eleven projects offered Digital Object Identifiers to support reuse and citation; the other 25 either do not offer one, or participants did not know if they have them. Only 16 projects have a standardized license to support data reuse, most often Creative Commons (CC) licenses such as CC-BY and CC-BY-SA licenses (which require attribution), but also CC0 public domain dedication. A few apply non-commercial use restrictions. However, 18 project representatives could not identify any standardized license for their data, and two participants didn’t know whether their project had a license or not.
In most cases, raw and processed (cleaned, aggregated, summarized, visualized) data are provided by projects, but some projects only provide access to processed data. This takes a variety of forms. Nineteen projects share findings through project publications or whitepapers, while 16 share findings through peer-reviewed publications. Many projects note that scholarly publication was ‘a longer-term goal’. Only six projects provide no access to analyzed data. Sixteen projects provide tools for user-specified queries or downloads (with several also providing Application Programming Interfaces for machine queries); 14 make data available through web services or data visualizations, including maps; 10 offer bulk download options; and 5 provide custom analyses or services.
What do these findings mean?
A fundamental rationale for improving data management practices in citizen science is to ensure the ability of citizens, scientists, and policymakers to reuse the data for scientific research or policy purposes. While citizen science has emerged as a promising means to collect data on a massive scale and is maturing in regard to data practices, there is still much progress to be made in approaches to the data lifecycle, from acquisition to management to dissemination. Science as a whole is often having difficulty keeping up with new norms, and CS is no different. Yet lags in CS may also reflect lack of resources, particularly for smaller or startup citizen science efforts that struggle to maintain staff and funding and that perhaps find data management falls to the bottom of the to-do list. They may also reflect the passions of the first movers in the citizen science space—who were motivated by environmental justice concerns or scientific discovery, and were less concerned about long-term data stewardship.
The characterization of data practices in the journal article (and this blog post) is not intended as a criticism of the field, but rather an effort to identify areas where improvements are needed and to provide a call to action and greater maturation. In this spirit, my co-authors and I offer the following recommendations:
Data quality: While significant QA/QC checks are taken across the data lifecycle, these are not always documented in a standardized way. Citizen science practitioners should document their QA/QC practices on project websites and/or through formal QA/QC plans. Researchers seeking to advance the field could help develop controlled vocabularies for articulating common data quality practices that can be included in metadata for datasets and/or observations.
Number of quality assurance/quality control (QA/QC) methods per project
Data infrastructure: Citizen science practitioners should consider leveraging existing infrastructures across the data lifecycle, such as for data collection and data archiving (e.g., in large and stable data aggregation repositories). Researchers seeking to advance the field should fully document supporting infrastructures to make their strengths and limitations transparent and to increase their utility, as well as develop additional supporting infrastructures as needed.
Data documentation: Citizen science practitioners should make discovery metadata (structured descriptive information about datasets) available through data catalogues, and should share information on methods used to develop datasets on project websites. Researchers seeking to advance the field could develop controlled vocabularies for metadata documentation, particularly to enable fitness-for-purpose assessments.
Data access: In addition to discovery metadata, citizen science practitioners should select and use one or more open, machine-readable licenses, such as CC licenses. Researchers seeking to advance the field should identify, share information about, and (if necessary) develop long-term infrastructures for data discovery and preservation.
The CODATA-WDS TG continues its work under a new banner, ‘Citizen Science Data for the SDGs’. A major focus is on citizen science data collection in Africa, a region that is lagging in the achievement of many SDGs—particularly those related to water, sanitation, and the urban environment.