Full article title Global data quality assessment and the situated nature of “best” research practices in biology
Journal Data Science Journal
Author(s) Leonelli, Sabina
Author affiliation(s) University of Adelaide and University of Exeter
Primary contact Email: s dot Leonelli at exeter dot ac dot uk
Year published 2017
Volume and issue 16
Page(s) 32
DOI 10.5334/dsj-2017-032
ISSN 1683-1470
Distribution license Creative Commons Attribution 4.0 International
Website http://datascience.codata.org/articles/10.5334/dsj-2017-032/
Download http://datascience.codata.org/articles/10.5334/dsj-2017-032/galley/690/download/ (PDF)


This paper reflects on the relation between international debates around data quality assessment and the diversity characterizing research practices, goals and environments within the life sciences. Since the emergence of molecular approaches, many biologists have focused their research, and related methods and instruments for data production, on the study of genes and genomes. While this trend is now shifting, prominent institutions and companies with stakes in molecular biology continue to set standards for what counts as "good science" worldwide, resulting in the use of specific data production technologies as proxy for assessing data quality. This is problematic considering (1) the variability in research cultures, goals and the very characteristics of biological systems, which can give rise to countless different approaches to knowledge production; and (2) the existence of research environments that produce high-quality, significant datasets despite not availing themselves of the latest technologies. Ethnographic research carried out in such environments evidences a widespread fear among researchers that providing extensive information about their experimental set-up will affect the perceived quality of their data, making their findings vulnerable to criticisms by better-resourced peers. These fears can make scientists resistant to sharing data or describing their provenance. To counter this, debates around open data need to include critical reflection on how data quality is evaluated, and the extent to which that evaluation requires a localized assessment of the needs, means and goals of each research environment.

Keywords: data quality, research assessment, peer review, scientific publication, research methods, data generation

Introduction: Open data and the assessment of data quality in the life sciences

Much of the international discussion around open science, and particularly debates around open data, is concerned with how to assess and monitor the quality and reliability of data being disseminated through repositories and databases.[1] Finding reliable ways to guarantee data quality is of great import when attempting to incentivize data sharing and re-use, since trust in the reliability of data available online is crucial to researchers considering them as a starting point for – or even just complement to – their ongoing work.[2][3][4]

Indeed, the quality and reliability of data hosted by digital databases is key to the success of open data, particularly in the wake of the "replicability crisis" recently experienced by fields such as psychology and biomedicine[5], and given the constant acceleration of the pace at which researchers produce and publish results.[6] However, the wide variation among the methods, materials, goals, techniques used in pluralistic fields such as biology, as well as the diverse ways in which data can be evaluated depending on the goals of the investigation at hand, make it hard to set common standards and establish international guidelines for evaluating data quality.[1] Attempts to implement peer review of the datasets donated to digital databases are also proving problematic, given the constraints in resources, personnel and expertise experienced by most data infrastructures, and the scarce time and rewards available to researchers contributing expertise to such efforts. This problem is aggravated by the speed with which standards, technologies and knowledge change and develop in any given domain, which makes it difficult, time-intensive and expensive to maintain and update databases and related quality standards as needed.

This paper examines the relation between international discussions around how to evaluate data quality, and the existing diversity characterizing research work within the life sciences, particularly in relation to biologists’ access to and use of instruments, infrastructures and materials. Since the molecular bandwagon took off in Europe and the U.S. in the 1950s, the majority of resources and attention within biology has been dedicated to creating methods and technologies to study the lowest levels of organizations of organisms, particularly genomics.[7][8] This trend is now reversing, with substantial interest returning to the ways in which environmental, phenotypic and epigenetic factors interact with molecular components.[9][10][11] However, countries which adopted and supported the molecular approach – including Japan, China and Singapore – continue to set the standards for what counts as "good science" worldwide. In practice, this means that the technologies and methods fostered by top research sites in these countries – such as, most glaringly, next generation sequencing methods and instruments – are often taken as exemplary of best laboratory practice, to the point that the use of software and machines popular in those locations is widely used as proxy for assessing the quality of the resulting findings.

This situation turns out to be problematic when considering the sophisticated relationship between the goals and interests of researchers at different locations, the specific characteristics of each target system in biology, and the methods devised to study those systems. These factors may vary and be combined in myriad ways, giving rise to countless different ways to conduct and validate research, and thus to assess the quality of relevant data. It is also troubling when considering research environments that do not have the financial and infrastructural resources to avail themselves of the latest software or instrument, but which are nevertheless producing high-quality data of potential biological significance – because of the materials they have access to, their innovative conceptual or methodological approach, or their focus on questions and phenomena of little interest to researchers based elsewhere. All too often, researchers working in such environments are afraid that lack of access to the latest technologies will affect the quality and reliability of their data, and will make their findings vulnerable to criticisms by better-resourced peers. These fears can result in researchers being unwilling to share their data and/or to describe the specific circumstances and tools through which they were obtained, thus making it impossible for others to build on their research and replicate it elsewhere.

Against this background, this paper defends the idea that debates around open data can and should foster critical reflection on how data quality can and should be evaluated, and the extent to which this involves a localized assessment of the challenges, limitations and imperfections characterizing all research environments. To this aim, I first reflect on existing models of data quality assessment in the life sciences and illustrate why the use of specific technologies for data production can end up being deployed as a proxy for data quality. I then discuss the problems with this approach to data quality assessment, focusing both on the history of molecular biology to date and on contemporary perceptions of technological expectations and standards by researchers in both African and European countries. I stress how technologies for data production and dissemination have become markers for researchers’ identity and perception of their own role and status within their fields, in ways that are potentially damaging both to researchers' careers and to scientific advancement as a whole.

This discussion is based on observations acquired in the course of ethnographic visits to biological laboratories in Wales, Britain, the United States, Belgium, Germany, Kenya and South Africa; extensive interviews with researchers working on those sites conducted between 2012 and 2016; and discussions on open data and data quality carried out with African members of the Global Young Academy (GYA) as part of my work as coordinator for the open science working group (https://globalyoungacademy.net/activities/open-science/).[a] I conclude that it is essential for research data to be evaluated in a manner that is localized and context-sensitive, and open data advocates and policies can play a critical role in fostering constructive and inclusive practices of data quality assessment.


  1. The empirical research for this paper was carried out by me within research sites in Wales, Britain, the United States, Germany and Belgium, and by Louise Bezuidenhout within sites in South Africa and Kenya (for more details on the latter research and related methods, see the paper by Bezuidenhout in this special issue). Given the sensitive nature of the interview materials, the raw data underpinning this paper cannot be openly disseminated; however, a digested and anonymized version of the data is provided on Figshare.[12]


This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article had citations listed alphabetically; they are listed in the order they appear here due to the way the wiki works. In several cases, the original article cited sources inline but failed to list them in the References section; these have been omitted here.