Journal:Global data quality assessment and the situated nature of “best” research practices in biology

From LIMSWiki
Revision as of 21:37, 26 June 2017 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Global data quality assessment and the situated nature of “best” research practices in biology
Journal Data Science Journal
Author(s) Leonelli, Sabina
Author affiliation(s) University of Adelaide and University of Exeter
Primary contact Email: s dot Leonelli at exeter dot ac dot uk
Year published 2017
Volume and issue 16
Page(s) 32
DOI 10.5334/dsj-2017-032
ISSN 1683-1470
Distribution license Creative Commons Attribution 4.0 International
Website http://datascience.codata.org/articles/10.5334/dsj-2017-032/
Download http://datascience.codata.org/articles/10.5334/dsj-2017-032/galley/690/download/ (PDF)

Abstract

This paper reflects on the relation between international debates around data quality assessment and the diversity characterizing research practices, goals and environments within the life sciences. Since the emergence of molecular approaches, many biologists have focused their research, and related methods and instruments for data production, on the study of genes and genomes. While this trend is now shifting, prominent institutions and companies with stakes in molecular biology continue to set standards for what counts as "good science" worldwide, resulting in the use of specific data production technologies as proxy for assessing data quality. This is problematic considering (1) the variability in research cultures, goals and the very characteristics of biological systems, which can give rise to countless different approaches to knowledge production; and (2) the existence of research environments that produce high-quality, significant datasets despite not availing themselves of the latest technologies. Ethnographic research carried out in such environments evidences a widespread fear among researchers that providing extensive information about their experimental set-up will affect the perceived quality of their data, making their findings vulnerable to criticisms by better-resourced peers. These fears can make scientists resistant to sharing data or describing their provenance. To counter this, debates around open data need to include critical reflection on how data quality is evaluated, and the extent to which that evaluation requires a localized assessment of the needs, means and goals of each research environment.

Keywords: data quality, research assessment, peer review, scientific publication, research methods, data generation

Introduction: Open data and the assessment of data quality in the life sciences

Much of the international discussion around open science, and particularly debates around open data, is concerned with how to assess and monitor the quality and reliability of data being disseminated through repositories and databases.[1] Finding reliable ways to guarantee data quality is of great import when attempting to incentivize data sharing and re-use, since trust in the reliability of data available online is crucial to researchers considering them as a starting point for – or even just complement to – their ongoing work.[2][3][4]

Indeed, the quality and reliability of data hosted by digital databases is key to the success of open data, particularly in the wake of the "replicability crisis" recently experienced by fields such as psychology and biomedicine[5], and given the constant acceleration of the pace at which researchers produce and publish results.[6] However, the wide variation among the methods, materials, goals, techniques used in pluralistic fields such as biology, as well as the diverse ways in which data can be evaluated depending on the goals of the investigation at hand, make it hard to set common standards and establish international guidelines for evaluating data quality.[1] Attempts to implement peer review of the datasets donated to digital databases are also proving problematic, given the constraints in resources, personnel and expertise experienced by most data infrastructures, and the scarce time and rewards available to researchers contributing expertise to such efforts. This problem is aggravated by the speed with which standards, technologies and knowledge change and develop in any given domain, which makes it difficult, time-intensive and expensive to maintain and update databases and related quality standards as needed.

This paper examines the relation between international discussions around how to evaluate data quality, and the existing diversity characterizing research work within the life sciences, particularly in relation to biologists’ access to and use of instruments, infrastructures and materials. Since the molecular bandwagon took off in Europe and the U.S. in the 1950s, the majority of resources and attention within biology has been dedicated to creating methods and technologies to study the lowest levels of organizations of organisms, particularly genomics.[7][8] This trend is now reversing, with substantial interest returning to the ways in which environmental, phenotypic and epigenetic factors interact with molecular components.[9][10][11] However, countries which adopted and supported the molecular approach – including Japan, China and Singapore – continue to set the standards for what counts as "good science" worldwide. In practice, this means that the technologies and methods fostered by top research sites in these countries – such as, most glaringly, next generation sequencing methods and instruments – are often taken as exemplary of best laboratory practice, to the point that the use of software and machines popular in those locations is widely used as proxy for assessing the quality of the resulting findings.

This situation turns out to be problematic when considering the sophisticated relationship between the goals and interests of researchers at different locations, the specific characteristics of each target system in biology, and the methods devised to study those systems. These factors may vary and be combined in myriad ways, giving rise to countless different ways to conduct and validate research, and thus to assess the quality of relevant data. It is also troubling when considering research environments that do not have the financial and infrastructural resources to avail themselves of the latest software or instrument, but which are nevertheless producing high-quality data of potential biological significance – because of the materials they have access to, their innovative conceptual or methodological approach, or their focus on questions and phenomena of little interest to researchers based elsewhere. All too often, researchers working in such environments are afraid that lack of access to the latest technologies will affect the quality and reliability of their data, and will make their findings vulnerable to criticisms by better-resourced peers. These fears can result in researchers being unwilling to share their data and/or to describe the specific circumstances and tools through which they were obtained, thus making it impossible for others to build on their research and replicate it elsewhere.

Against this background, this paper defends the idea that debates around open data can and should foster critical reflection on how data quality can and should be evaluated, and the extent to which this involves a localized assessment of the challenges, limitations and imperfections characterizing all research environments. To this aim, I first reflect on existing models of data quality assessment in the life sciences and illustrate why the use of specific technologies for data production can end up being deployed as a proxy for data quality. I then discuss the problems with this approach to data quality assessment, focusing both on the history of molecular biology to date and on contemporary perceptions of technological expectations and standards by researchers in both African and European countries. I stress how technologies for data production and dissemination have become markers for researchers’ identity and perception of their own role and status within their fields, in ways that are potentially damaging both to researchers' careers and to scientific advancement as a whole.

This discussion is based on observations acquired in the course of ethnographic visits to biological laboratories in Wales, Britain, the United States, Belgium, Germany, Kenya and South Africa; extensive interviews with researchers working on those sites conducted between 2012 and 2016; and discussions on open data and data quality carried out with African members of the Global Young Academy (GYA) as part of my work as coordinator for the open science working group (https://globalyoungacademy.net/activities/open-science/).[a] I conclude that it is essential for research data to be evaluated in a manner that is localized and context-sensitive, and open data advocates and policies can play a critical role in fostering constructive and inclusive practices of data quality assessment.

Existing approaches to research data quality assessment

Data quality is a notoriously slippery and multifaceted notion, which has long been the subject of scholarly discussion. A comprehensive review of such debates is provided by Luciano Floridi and Phyllis Illari[13], who highlight how the various approaches available, while usefully focusing on aspects such as error detection and countering misinformation, are ultimately tied to domain-specific estimations of what counts as quality and reliability (and for what purposes) that cannot be transferred easily across fields, and sometimes even across specific cases of data use. This does not help towards the development and implementation of mechanisms that can guarantee the quality of the vast amounts of research data stored in large digital repositories for open consultation. Data dissemination through widely available data infrastructures is characteristic of the current open data landscape, which fits the current policy agenda in making research results visible and potentially re-usable by anybody with the skills and interest to explore them. This mode of data dissemination relies on the assumption that the data made accessible online are of sufficient quality to be useful for further investigation. At the same time, data curators and researchers are well-aware that this assumption is problematic and easy to challenge. This is, first, because no data type is "intrinsically" trustworthy, but rather data are regarded as reliable on the basis of the methods, instruments, commitments, values and goals employed by the people who generate them[1]; and second, because while it possible to evaluate the quality of data through a review of related metadata, this evaluation typically require expert skills that not all prospective data users possess or care to exercise.[14][b]

The problems involved in continuing to develop large research data collections without clear quality benchmarks is widely recognized by academies, institutions and expert bodies involved in open data debates, and debates over data quality feature regularly in meetings of the Research Data Alliance, CODATA and many other learned societies and organizations around the world. While it is impossible to summarize these extensive debates within the scope of this paper, I now briefly examine six modes of data quality evaluation that have been widely employed so far within the sciences, and which continue to hold sway while new solutions are being developed and tested.

The first and most common mode of data quality evaluation consists of traditional peer review of research articles where data appear as evidence for scientific claims. The idea here is that whenever scientific publications are refereed, reviewers also need to assess the quality of the data used as evidence for the claims being made, and will not approve of publications grounded on untrustworthy data. Data attached to peer-reviewed publications are therefore often assumed to be of high quality and can be therefore be openly disseminated without problems. However, there are reasons to doubt the effectiveness of this strategy in the current research environment. This only works for data extracted from journal publications, and is of little use when it comes to data that have not yet been analyzed for publication – thus restricting the scope of databases in ways that many find unacceptable, particularly in the current big data landscape where the velocity with which data are generated has dramatically increased, and a key reason for open dissemination of data is precisely to facilitate their interpretation. It is also not clear that peer review of publications is a reliable way to peer review data. As noted by critics of this approach, traditional peer review focuses on the credibility of methods and claims made in the given publication, not on data per se (which are anyhow often presented within unstructured "supplementary information" sections, when they are presented at all[18]). Reviewers are not usually evaluating whether data could usefully be employed to answer research questions other than the one being asked in the paper, and as a result, they provide a skewed evaluation. This could be regarded as an advantage of peer review, since through this system data are always contextualized and assessed in relation to a particular research goal; yet, it does not help to assess the quality of data in contexts of dissemination and re-use. Thus, data curators in charge of retrieving and assessing the quality of data originally published in association with papers need to employ considerable domain-specific expertise to be able to extract the data from existing publications and making them findable and usable. An example of this is the well-known Gene Ontology, whose curators annotate data about gene products by mining published sources and adapting them to common standards and terminology used within the database, which involves considerable labor and skill.[19][20]

Footnotes

  1. The empirical research for this paper was carried out by me within research sites in Wales, Britain, the United States, Germany and Belgium, and by Louise Bezuidenhout within sites in South Africa and Kenya (for more details on the latter research and related methods, see the paper by Bezuidenhout in this special issue). Given the sensitive nature of the interview materials, the raw data underpinning this paper cannot be openly disseminated; however, a digested and anonymized version of the data is provided on Figshare.[12]
  2. It has also been argued that data quality does not matter within big data collections, because existing data can be triangulated with other datasets documenting the same phenomenon, and datasets that corroborate each other can justifiably be viewed as more reliable.[15] Against this view, myself and others pointed out that triangulation only works when there are enough datasets that document the same phenomenon from different angles, which is not always the case in scientific research.[16][17]

References

  1. 1.0 1.1 1.2 Cai, L.; Zhu, Y. (2015). "The challenges of data quality and data quality assessment in the big data era". Data Science Journal 14: 2. doi:10.5334/dsj-2015-002. 
  2. Ossorio, P.N. (2011). "Bodies of data: Genomic data and bioscience data sharing". Social Research 78 (3): 907-932. PMC PMC3984581. PMID 24733955. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3984581. 
  3. Borgman, C.L. (2012). "The conundrum of sharing research data". Journal for the Association for Information Science and Technology 63 (6): 1059–1078. doi:10.1002/asi.22634. 
  4. Leonelli, S. (2016). Data-Centric Biology: A Philosophical Study. University of Chicago Press. pp. 288. ISBN 9780226416472. http://press.uchicago.edu/ucp/books/book/chicago/D/bo24957334.html. 
  5. Allison, D.B.; Brown, A.W.; George, B.J.; Kaiser, K.A.. "Reproducibility: A tragedy of errors". Nature 530 (7588): 27–9. doi:10.1038/530027a. PMC PMC4831566. PMID 26842041. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4831566. 
  6. Pulverer, B.. "Reproducibility blues". EMBO Journal 34 (22): 2721-4. doi:10.15252/embj.201570090. PMC PMC4682652. PMID 26538323. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4682652. 
  7. Nowotny, H.; Testa, G. (2011). Naked Genes. The MIT Press. pp. 152. ISBN 9780262014939. https://mitpress.mit.edu/books/naked-genes. 
  8. Müller-Wille, S.W.; Rheinberger, H.J. (2012). A Cultural History of Heredity. University of Chicago Press. pp. 288. ISBN 9780226545721. http://www.press.uchicago.edu/ucp/books/book/chicago/C/bo8787518.html. 
  9. Barnes, B.; Dupré, J. (2009). Genomes and What to Make of Them. University of Chicago Press. pp. 288. ISBN 9780226172965. http://press.uchicago.edu/ucp/books/book/chicago/G/bo5705879.html. 
  10. Dupré, J. (2012). Processes of Life. Oxford University Press. pp. 320. ISBN 9780199691982. https://global.oup.com/academic/product/processes-of-life-9780199691982?cc=us&lang=en&. 
  11. Müller-Wille, S.W.; Rheinberger, H.J. (2017). The Gene: From Genetics to Postgenomics. University of Chicago Press. pp. 176. ISBN 9780226510002. http://press.uchicago.edu/ucp/books/book/chicago/G/bo20952390.html. 
  12. Bezuidenhout, L.; Rappert, B.; Leonelli, S. (2016). "Beyond the Digital Divide: Sharing Research Data across Developing and Developed Countries". figshare. https://figshare.com/articles/Beyond_the_Digital_Divide_Sharing_Research_Data_across_Developing_and_Developed_Countries/3203809/1. 
  13. Floridi, L.; Illari, P. (2014). The Philosophy of Information Quality. Springer International Publishing. pp. 315. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213. 
  14. Leonelli, S.. "Locating ethics in data science: Responsibility and accountability in global and distributed knowledge production systems". Philosophical Transactions of the Royal Society A 374 (2083): 20160122. doi:10.1098/rsta.2016.0122. PMC PMC5124067. PMID 28336799. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5124067. 
  15. Mayer-Schönberger, V., Cukier, K. (2013). Big Data: A Revolution That Will Transform How We Live, Work and Think. John Murray. pp. 256. ISBN 9781848547926. 
  16. Leonelli, S.. "What Difference Does Quantity Make? On the Epistemology of Big Data in Biology". Big Data and Society 1 (1). doi:10.1177/2053951714534395. PMC PMC4340542. PMID 25729586. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340542. 
  17. Calude, C.S.; Longo, G.. "The Deluge of Spurious Correlations in Big Data". Foundations of Science 21: 1–18. doi:10.1007/s10699-016-9489-4. 
  18. Morey, R.D.; Chambers, C.D.; Etchells, P.J. et al.. "The Peer Reviewers' Openness Initiative: Incentivizing open research practices through peer review". Royal Society Open Science 3 (1): 150547. doi:10.1098/rsos.150547. PMC PMC4736937. PMID 26909182. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4736937. 
  19. Leonelli, S.; Diehl, A.D.; Christie, K.R. et al.. "How the gene ontology evolves". BMC Bioinformatics 12: 325. doi:10.1186/1471-2105-12-325. PMC PMC3166943. PMID 21819553. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166943. 
  20. Blake, J.A.; Christie, K.R.; Dolan, M.E. et al.. "Gene Ontology Consortium: Going forward". Nucleic Acids Research 43 (DB1): D1049-56. doi:10.1093/nar/gku1179. PMC PMC4383973. PMID 25428369. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383973. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article had citations listed alphabetically; they are listed in the order they appear here due to the way the wiki works. In several cases, the original article cited sources inline but failed to list them in the References section; these have been omitted here.