Postnatal nurturing – giving data at risk a better chance of a long and fruitful life
A Blog post by David J. Patterson (WDS Scientific Committee member)
Big data gurus and advocates for a cyberinfrastructure or big data science describe a data-centric future in which massive quantities of digital data will be available for reuse in research, artificial intelligence, making predictions, or engaging in data-driven discovery. In the world of biology, this translates into the expectation is that molecular sequence information will be available from the nucleotide repositories such as GenBank, or that any and all occurrence data can be found at GBIF. It is also presumed that the data will have been vetted and, in all aspects, are trustworthy.
The vision is flawed. An unknown but large fraction of newborn digital data does not make it beyond the maternity ward. If data are to be properly prepared for re-use in the big data world, they must have moved a long way from the hands of its creators and into the custody of data managers, and repositories that will guarantee access to vetted content in perpetuity.
There are hundreds of thousands of sources where digital information is born. The long tail of parents include individual researchers, research teams, research programs, legacy data recovery projects, local, state, national governmental bodies and international initiatives. These parents rarely have the understanding or skillsets to ensure that their newborn will mature appropriately for a rôle in the big data world. For this to happen, data must be handed on to those who specialise in data management and curation. These adoptive parents will shepherd the content through the maturation process that will make it ready for repositories that are designed to make trustworthy data and services available to the public. The challenges to completion of the path are numerous. The first step is simply to make the data visible and accessible. Bad data need to be set aside or put back on the right path. For content to be discoverable, standardized metadata and ontologies need to be added so that the data can be found in the appropriate context. Interoperability requires access through appropriate services and for the data to be clothed with standardized ontologies and metadata. Just as the idiosyncratic swaddling clothes must be set aside, the new descriptors will need to be embellished with increasing detail, and be continually corrected and improved. Provenance metadata will help creators and managers gain credit for their effort, and will open up a pathway through which concerns about the data can be expressed. There will be problems that are specific to particular disciplines. As an example, relationships among taxa in ‘evolutionary trees’ which are created by algorithms become less trustworthy as new information and new algorithms emerge. In the biodiversity sciences, taxa may be mis-identified. Further, with the passage of time, new species are discovered - a process that renders ambiguous taxa identified by earlier less stringent criteria The ecosystem through which the content moves must provide the support that ensures continued fitness for purpose. Confidentiality and ethical concerns vary with subject matter but also have to be addressed.
As data mature, they will move from the hundreds of thousands of parents to a small number of data repositories that are funded using models that guarantee the persistence of their services. As far as is feasible, we expect the managers and repositories to apply the FAIR principles to the content they hold. Then, if the holder of the baton can meet the expectations of the CoreTrustSeal accreditation, the data will have found a secure and persistent home, with data ready for reuse. Fifty or so repositories have gained the CoreTrustSeal certification. But, as we have seen from the recent US governmental shutdown in December 2018 and January 2019, even major and certified data suppliers cannot be relied upon and may blink out unpredictably.
Many components already exist, nor are they joined up. Not only do most data fall by the wayside, much is not fit for a rôle in a data-centric world. The data are too contextualised, descriptors are incomplete or inaccurate. Few, if any, of the big data world providers allow users to correct errors. The consequence is that users of open data have to work with contaminated material. The World Data System is charged by the International Science Council to promote universal and equitable access to scientific data and information and increasing the capacity to generate new knowledge. WDS is especially concerned with the trustworthiness of the data and services. We will move further faster when we acknowledge that the research and discovery paradigm needs to be complemented with an investment in infrastructure and services. That investment will provide the framework and support that is required for data to live long and to prosper.
Credit for child image: Traitlin Burke