Difference between revisions of "Journal:Global data quality assessment and the situated nature of “best” research practices in biology"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Saving and adding more.)
Line 48: Line 48:


The first and most common mode of data quality evaluation consists of ''traditional peer review'' of research articles where data appear as evidence for scientific claims. The idea here is that whenever scientific publications are refereed, reviewers also need to assess the quality of the data used as evidence for the claims being made, and will not approve of publications grounded on untrustworthy data. Data attached to peer-reviewed publications are therefore often assumed to be of high quality and can be therefore be openly disseminated without problems. However, there are reasons to doubt the effectiveness of this strategy in the current research environment. This only works for data extracted from journal publications, and is of little use when it comes to data that have not yet been analyzed for publication – thus restricting the scope of databases in ways that many find unacceptable, particularly in the current big data landscape where the velocity with which data are generated has dramatically increased, and a key reason for open dissemination of data is precisely to facilitate their interpretation. It is also not clear that peer review of publications is a reliable way to peer review data. As noted by critics of this approach, traditional peer review focuses on the credibility of methods and claims made in the given publication, not on data per se (which are anyhow often presented within unstructured "supplementary information" sections, when they are presented at all<ref name="MoreyThePeer16">{{cite journal |title=The Peer Reviewers' Openness Initiative: Incentivizing open research practices through peer review |journal=Royal Society Open Science |author=Morey, R.D.; Chambers, C.D.; Etchells, P.J. et al. |volume=3 |issue=1 |pages=150547 |doi=10.1098/rsos.150547 |pmid=26909182 |pmc=PMC4736937}}</ref>). Reviewers are not usually evaluating whether data could usefully be employed to answer research questions other than the one being asked in the paper, and as a result, they provide a skewed evaluation. This could be regarded as an advantage of peer review, since through this system data are always contextualized and assessed in relation to a particular research goal; yet, it does not help to assess the quality of data in contexts of dissemination and re-use. Thus, data curators in charge of retrieving and assessing the quality of data originally published in association with papers need to employ considerable domain-specific expertise to be able to extract the data from existing publications and making them findable and usable. An example of this is the well-known Gene Ontology, whose curators annotate data about gene products by mining published sources and adapting them to common standards and terminology used within the database, which involves considerable labor and skill.<ref name="LeonelliHowTheGene11">{{cite journal |title=How the gene ontology evolves |journal=BMC Bioinformatics |author=Leonelli, S.; Diehl, A.D.; Christie, K.R. et al. |volume=12 |pages=325 |doi=10.1186/1471-2105-12-325 |pmid=21819553 |pmc=PMC3166943}}</ref><ref name="BlakeGeneOnt15">{{cite journal |title=Gene Ontology Consortium: Going forward |journal=Nucleic Acids Research |author=Blake, J.A.; Christie, K.R.; Dolan, M.E. et al. |volume=43 |issue=DB1 |pages=D1049-56 |doi=10.1093/nar/gku1179 |pmid=25428369 |pmc=PMC4383973}}</ref>
The first and most common mode of data quality evaluation consists of ''traditional peer review'' of research articles where data appear as evidence for scientific claims. The idea here is that whenever scientific publications are refereed, reviewers also need to assess the quality of the data used as evidence for the claims being made, and will not approve of publications grounded on untrustworthy data. Data attached to peer-reviewed publications are therefore often assumed to be of high quality and can be therefore be openly disseminated without problems. However, there are reasons to doubt the effectiveness of this strategy in the current research environment. This only works for data extracted from journal publications, and is of little use when it comes to data that have not yet been analyzed for publication – thus restricting the scope of databases in ways that many find unacceptable, particularly in the current big data landscape where the velocity with which data are generated has dramatically increased, and a key reason for open dissemination of data is precisely to facilitate their interpretation. It is also not clear that peer review of publications is a reliable way to peer review data. As noted by critics of this approach, traditional peer review focuses on the credibility of methods and claims made in the given publication, not on data per se (which are anyhow often presented within unstructured "supplementary information" sections, when they are presented at all<ref name="MoreyThePeer16">{{cite journal |title=The Peer Reviewers' Openness Initiative: Incentivizing open research practices through peer review |journal=Royal Society Open Science |author=Morey, R.D.; Chambers, C.D.; Etchells, P.J. et al. |volume=3 |issue=1 |pages=150547 |doi=10.1098/rsos.150547 |pmid=26909182 |pmc=PMC4736937}}</ref>). Reviewers are not usually evaluating whether data could usefully be employed to answer research questions other than the one being asked in the paper, and as a result, they provide a skewed evaluation. This could be regarded as an advantage of peer review, since through this system data are always contextualized and assessed in relation to a particular research goal; yet, it does not help to assess the quality of data in contexts of dissemination and re-use. Thus, data curators in charge of retrieving and assessing the quality of data originally published in association with papers need to employ considerable domain-specific expertise to be able to extract the data from existing publications and making them findable and usable. An example of this is the well-known Gene Ontology, whose curators annotate data about gene products by mining published sources and adapting them to common standards and terminology used within the database, which involves considerable labor and skill.<ref name="LeonelliHowTheGene11">{{cite journal |title=How the gene ontology evolves |journal=BMC Bioinformatics |author=Leonelli, S.; Diehl, A.D.; Christie, K.R. et al. |volume=12 |pages=325 |doi=10.1186/1471-2105-12-325 |pmid=21819553 |pmc=PMC3166943}}</ref><ref name="BlakeGeneOnt15">{{cite journal |title=Gene Ontology Consortium: Going forward |journal=Nucleic Acids Research |author=Blake, J.A.; Christie, K.R.; Dolan, M.E. et al. |volume=43 |issue=DB1 |pages=D1049-56 |doi=10.1093/nar/gku1179 |pmid=25428369 |pmc=PMC4383973}}</ref>
Indeed, a second mode of data quality assessment currently in use relies on ''evaluations by data curators in charge of data infrastructures''. The argument in this case is that these researchers are experts in data dissemination – they are the data equivalent of a librarian for traditional manuscripts – and are therefore best equipped to assess whether or not the data considered for online dissemination are trustworthy and of good enough quality for re-use. Hence, in the gene ontology case (cited above), curators not only select which data are of relevance to the categories used in the database, but also assign "confidence rankings" to the data depending on what they perceive as the reliability of the source – a mechanism that certainly assigns considerable responsibility for data quality assessment to those who manage data infrastructures. This solution works reasonably well for relatively small and well-financed data collections, but fails as soon as the funding required to support data curation ceases to exist, or the volume of data becomes so large as to make manual curation impossible. Also, this type of data quality assessment is only as reliable as the curators in charge, especially in cases where data users are too far removed from the development and maintenance of databases to be able or willing to give feedback and check on curators' decisions.
A third mode of data quality assessment is thus ''to leave decisions around data quality to those who have generated the data in the first place'', which avoids potential misunderstandings between data producers, reviewers and curators. Again, this solution is not ideal. For one thing, existing databases have a hard time getting data producers to post and appropriately annotate their own data (cases such as PomBase, where over half of the authors of relevant papers post and annotate datasets themselves, are far and few between, and typically occur in relatively small and close-knit communities where trust and accountability are high<ref name="McDowallPomBase15">{{cite journal |title=PomBase 2015: Updates to the fission yeast database |journal=Nucleic Acids Research |author=McDowall, M.D.; Harris, M.A.; Lock, A. et al. |volume=43 |issue=DB1 |pages=D656-61 |doi=10.1093/nar/gku1040 |pmid=25361970 |pmc=PMC4383888}}</ref>). Furthermore, whatever standards data producers are using to evaluate the quality of their data, it will unavoidably be steeped in the research culture, habits and methods of their own community and subfield, as well as the goals and materials used in their own research. This means that data producers do not typically have the ability to compare different datasets and evaluate their own data in relation to data produced by other research environments, as would be required when assembling a large data infrastructure. Whenever data leave their context of production and enter new contexts of potential re-use, new standards for quality and reliability may well be required, which in turn demands for external assessment and validation from outside the research environment where data were originally generated.
A fourth method for data quality assessment consists in ''the employment of automated processes and algorithms'', which have the potential to reduce dramatically the manual labor associated with data curation. There is no doubt that automation facilitates a variety of techniques to test the validity, reliability and veracity of data being disseminated, particularly in the context of data linkage facilities and infrastructures.<ref name="KambatlaTrends14">{{cite journal |title=Trends in big data analytics |journal=Journal of Parallel and Distributed Computing |author=Kambatla, K.; Kollias, G.; Kumar, V.; Grama, A. |volume=74 |issue=7 |pages=2561-2573 |doi=10.1016/j.jpdc.2014.01.003}}</ref> However, such tools typically need to make substantive general assumptions about what types of data are most reliable, which are hard to defend given the user-related nature of data quality metrics and their dependence on the context and goals of data assessment. An interesting model for the development of future data quality assessment processes within the life sciences is provided by the many quality assessment tools used to evaluate clinical data in biomedical research, though that approach relies again on the exercise of human judgement, which in turn results in contentious disparities in its application.
As a fifth option, there have been attempts ''to crowdsource quality assessment'' by enabling prospective data users to grade the quality of data that they find available on digital databases. While this method holds great promise, it is hard to apply consistently and reliably in a situation where researchers receive little or no credit for engaging with the curation and reuse of existing data sources, and providing feedback to data infrastructures that may enhance their usefulness and long-term sustainability. As a result of the lack of incentive to participate in the curation of open data, most databases operating within the life sciences receive little feedback from their users, despite the (sometimes considerable) effort put into creating channels for users to provide comments and assess the data being disseminated. Moreover, it is perfectly possible that users' judgements differ considerably depending on their research goals and methodological commitments.
Given the difficulties encountered by the methods listed above, researchers involved in data quality assessments (for instance, related to data publication or to the inclusion of data into a database) may recur to a sixth, unofficial and implicit method: the ''reliance on specific technologies for data production as proxy markers for data quality''. In this case, specific pieces of equipment, methods and materials are taken to be intrinsically reliable and thus to enhance – if not guarantee – the chance that data produced through those techniques and tools will be of good quality. Within the life sciences, prominent examples of such proxies include the use of next generation sequencing machines and mass spectrometry in model organism biology, microbiomes and systems biology; light-producing reporter genes produced by reputable companies in cell and developmental biology; and ''de novo'' gene synthesis and design/simulation software in synthetic biology. These tools are strongly embedded in leading research repertoires within biology and are extensively adopted by laboratories around the world.<ref name="AnkenyValuing15">{{cite book |url=https://www.dukeupress.edu/postgenomics |chapter=Valuing Data in Postgenomic Biology: How Data Donation and Curation Practices Challenge the Scientific Publication System |title=Postgenomics: Perspectives on Biology after the Genome |author=Ankeny, R.A.; Leonelli, S. |publisher=Duke University Press |year=2015 |pages=126–149 |isbn=9780822358947}}</ref><ref name="AnkenyRepertoires16">{{cite journal |title=Repertoires: A post-Kuhnian perspective on scientific change and collaborative research |journal=Studies in History and Philosophy of Science |author=Ankeny, R.A.; Leonelli, S. |volume=60 |pages=18–28 |doi=10.1016/j.shpsa.2016.08.003 |pmid=27938718}}</ref> They are typically easy to verify, with well-established protocols in place and little additional expertise or labor needed, giving rise to what philosopher Ulrich Krohs calls "convenience experimentation."<ref name="KrohsConvenience12">{{cite journal |title=Convenience experimentation |journal=Studies in History and Philosophy of Biological and Biomedical Sciences |author=Krohs, U. |volume=43 |issue=1 |pages=52-7 |doi=10.1016/j.shpsc.2011.10.005 |pmid=22326072}}</ref> And they are typically a good fit for existing open data infrastructures and formats, which are often developed alongside such technologies as part of the same repertoire (as in the case of sequencing data<ref name="LeonelliRepertoires15">{{cite journal |title=Repertoires: How to Transform a Project into a Research Community |journal=BioScience |author=Leonelli, S.; Ankeny, R.A. |volume=65 |issue=7 |pages=701–708 |doi=10.1093/biosci/biv061 |pmid=26412866 |pmc=PMC4580990}}</ref>).


==Footnotes==
==Footnotes==
Line 56: Line 66:


==Notes==
==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article had citations listed alphabetically; they are listed in the order they appear here due to the way the wiki works. In several cases, the original article cited sources inline but failed to list them in the References section; these have been omitted here.
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article had citations listed alphabetically; they are listed in the order they appear here due to the way the wiki works. In several cases, the original article cited sources inline (such as Primiero 2014 and Stegenga 2014) but failed to list them in the References section; these have been omitted here.


<!--Place all category tags here-->
<!--Place all category tags here-->

Revision as of 22:20, 26 June 2017

Full article title Global data quality assessment and the situated nature of “best” research practices in biology
Journal Data Science Journal
Author(s) Leonelli, Sabina
Author affiliation(s) University of Adelaide and University of Exeter
Primary contact Email: s dot Leonelli at exeter dot ac dot uk
Year published 2017
Volume and issue 16
Page(s) 32
DOI 10.5334/dsj-2017-032
ISSN 1683-1470
Distribution license Creative Commons Attribution 4.0 International
Website http://datascience.codata.org/articles/10.5334/dsj-2017-032/
Download http://datascience.codata.org/articles/10.5334/dsj-2017-032/galley/690/download/ (PDF)

Abstract

This paper reflects on the relation between international debates around data quality assessment and the diversity characterizing research practices, goals and environments within the life sciences. Since the emergence of molecular approaches, many biologists have focused their research, and related methods and instruments for data production, on the study of genes and genomes. While this trend is now shifting, prominent institutions and companies with stakes in molecular biology continue to set standards for what counts as "good science" worldwide, resulting in the use of specific data production technologies as proxy for assessing data quality. This is problematic considering (1) the variability in research cultures, goals and the very characteristics of biological systems, which can give rise to countless different approaches to knowledge production; and (2) the existence of research environments that produce high-quality, significant datasets despite not availing themselves of the latest technologies. Ethnographic research carried out in such environments evidences a widespread fear among researchers that providing extensive information about their experimental set-up will affect the perceived quality of their data, making their findings vulnerable to criticisms by better-resourced peers. These fears can make scientists resistant to sharing data or describing their provenance. To counter this, debates around open data need to include critical reflection on how data quality is evaluated, and the extent to which that evaluation requires a localized assessment of the needs, means and goals of each research environment.

Keywords: data quality, research assessment, peer review, scientific publication, research methods, data generation

Introduction: Open data and the assessment of data quality in the life sciences

Much of the international discussion around open science, and particularly debates around open data, is concerned with how to assess and monitor the quality and reliability of data being disseminated through repositories and databases.[1] Finding reliable ways to guarantee data quality is of great import when attempting to incentivize data sharing and re-use, since trust in the reliability of data available online is crucial to researchers considering them as a starting point for – or even just complement to – their ongoing work.[2][3][4]

Indeed, the quality and reliability of data hosted by digital databases is key to the success of open data, particularly in the wake of the "replicability crisis" recently experienced by fields such as psychology and biomedicine[5], and given the constant acceleration of the pace at which researchers produce and publish results.[6] However, the wide variation among the methods, materials, goals, techniques used in pluralistic fields such as biology, as well as the diverse ways in which data can be evaluated depending on the goals of the investigation at hand, make it hard to set common standards and establish international guidelines for evaluating data quality.[1] Attempts to implement peer review of the datasets donated to digital databases are also proving problematic, given the constraints in resources, personnel and expertise experienced by most data infrastructures, and the scarce time and rewards available to researchers contributing expertise to such efforts. This problem is aggravated by the speed with which standards, technologies and knowledge change and develop in any given domain, which makes it difficult, time-intensive and expensive to maintain and update databases and related quality standards as needed.

This paper examines the relation between international discussions around how to evaluate data quality, and the existing diversity characterizing research work within the life sciences, particularly in relation to biologists’ access to and use of instruments, infrastructures and materials. Since the molecular bandwagon took off in Europe and the U.S. in the 1950s, the majority of resources and attention within biology has been dedicated to creating methods and technologies to study the lowest levels of organizations of organisms, particularly genomics.[7][8] This trend is now reversing, with substantial interest returning to the ways in which environmental, phenotypic and epigenetic factors interact with molecular components.[9][10][11] However, countries which adopted and supported the molecular approach – including Japan, China and Singapore – continue to set the standards for what counts as "good science" worldwide. In practice, this means that the technologies and methods fostered by top research sites in these countries – such as, most glaringly, next generation sequencing methods and instruments – are often taken as exemplary of best laboratory practice, to the point that the use of software and machines popular in those locations is widely used as proxy for assessing the quality of the resulting findings.

This situation turns out to be problematic when considering the sophisticated relationship between the goals and interests of researchers at different locations, the specific characteristics of each target system in biology, and the methods devised to study those systems. These factors may vary and be combined in myriad ways, giving rise to countless different ways to conduct and validate research, and thus to assess the quality of relevant data. It is also troubling when considering research environments that do not have the financial and infrastructural resources to avail themselves of the latest software or instrument, but which are nevertheless producing high-quality data of potential biological significance – because of the materials they have access to, their innovative conceptual or methodological approach, or their focus on questions and phenomena of little interest to researchers based elsewhere. All too often, researchers working in such environments are afraid that lack of access to the latest technologies will affect the quality and reliability of their data, and will make their findings vulnerable to criticisms by better-resourced peers. These fears can result in researchers being unwilling to share their data and/or to describe the specific circumstances and tools through which they were obtained, thus making it impossible for others to build on their research and replicate it elsewhere.

Against this background, this paper defends the idea that debates around open data can and should foster critical reflection on how data quality can and should be evaluated, and the extent to which this involves a localized assessment of the challenges, limitations and imperfections characterizing all research environments. To this aim, I first reflect on existing models of data quality assessment in the life sciences and illustrate why the use of specific technologies for data production can end up being deployed as a proxy for data quality. I then discuss the problems with this approach to data quality assessment, focusing both on the history of molecular biology to date and on contemporary perceptions of technological expectations and standards by researchers in both African and European countries. I stress how technologies for data production and dissemination have become markers for researchers’ identity and perception of their own role and status within their fields, in ways that are potentially damaging both to researchers' careers and to scientific advancement as a whole.

This discussion is based on observations acquired in the course of ethnographic visits to biological laboratories in Wales, Britain, the United States, Belgium, Germany, Kenya and South Africa; extensive interviews with researchers working on those sites conducted between 2012 and 2016; and discussions on open data and data quality carried out with African members of the Global Young Academy (GYA) as part of my work as coordinator for the open science working group (https://globalyoungacademy.net/activities/open-science/).[a] I conclude that it is essential for research data to be evaluated in a manner that is localized and context-sensitive, and open data advocates and policies can play a critical role in fostering constructive and inclusive practices of data quality assessment.

Existing approaches to research data quality assessment

Data quality is a notoriously slippery and multifaceted notion, which has long been the subject of scholarly discussion. A comprehensive review of such debates is provided by Luciano Floridi and Phyllis Illari[13], who highlight how the various approaches available, while usefully focusing on aspects such as error detection and countering misinformation, are ultimately tied to domain-specific estimations of what counts as quality and reliability (and for what purposes) that cannot be transferred easily across fields, and sometimes even across specific cases of data use. This does not help towards the development and implementation of mechanisms that can guarantee the quality of the vast amounts of research data stored in large digital repositories for open consultation. Data dissemination through widely available data infrastructures is characteristic of the current open data landscape, which fits the current policy agenda in making research results visible and potentially re-usable by anybody with the skills and interest to explore them. This mode of data dissemination relies on the assumption that the data made accessible online are of sufficient quality to be useful for further investigation. At the same time, data curators and researchers are well-aware that this assumption is problematic and easy to challenge. This is, first, because no data type is "intrinsically" trustworthy, but rather data are regarded as reliable on the basis of the methods, instruments, commitments, values and goals employed by the people who generate them[1]; and second, because while it possible to evaluate the quality of data through a review of related metadata, this evaluation typically require expert skills that not all prospective data users possess or care to exercise.[14][b]

The problems involved in continuing to develop large research data collections without clear quality benchmarks is widely recognized by academies, institutions and expert bodies involved in open data debates, and debates over data quality feature regularly in meetings of the Research Data Alliance, CODATA and many other learned societies and organizations around the world. While it is impossible to summarize these extensive debates within the scope of this paper, I now briefly examine six modes of data quality evaluation that have been widely employed so far within the sciences, and which continue to hold sway while new solutions are being developed and tested.

The first and most common mode of data quality evaluation consists of traditional peer review of research articles where data appear as evidence for scientific claims. The idea here is that whenever scientific publications are refereed, reviewers also need to assess the quality of the data used as evidence for the claims being made, and will not approve of publications grounded on untrustworthy data. Data attached to peer-reviewed publications are therefore often assumed to be of high quality and can be therefore be openly disseminated without problems. However, there are reasons to doubt the effectiveness of this strategy in the current research environment. This only works for data extracted from journal publications, and is of little use when it comes to data that have not yet been analyzed for publication – thus restricting the scope of databases in ways that many find unacceptable, particularly in the current big data landscape where the velocity with which data are generated has dramatically increased, and a key reason for open dissemination of data is precisely to facilitate their interpretation. It is also not clear that peer review of publications is a reliable way to peer review data. As noted by critics of this approach, traditional peer review focuses on the credibility of methods and claims made in the given publication, not on data per se (which are anyhow often presented within unstructured "supplementary information" sections, when they are presented at all[18]). Reviewers are not usually evaluating whether data could usefully be employed to answer research questions other than the one being asked in the paper, and as a result, they provide a skewed evaluation. This could be regarded as an advantage of peer review, since through this system data are always contextualized and assessed in relation to a particular research goal; yet, it does not help to assess the quality of data in contexts of dissemination and re-use. Thus, data curators in charge of retrieving and assessing the quality of data originally published in association with papers need to employ considerable domain-specific expertise to be able to extract the data from existing publications and making them findable and usable. An example of this is the well-known Gene Ontology, whose curators annotate data about gene products by mining published sources and adapting them to common standards and terminology used within the database, which involves considerable labor and skill.[19][20]

Indeed, a second mode of data quality assessment currently in use relies on evaluations by data curators in charge of data infrastructures. The argument in this case is that these researchers are experts in data dissemination – they are the data equivalent of a librarian for traditional manuscripts – and are therefore best equipped to assess whether or not the data considered for online dissemination are trustworthy and of good enough quality for re-use. Hence, in the gene ontology case (cited above), curators not only select which data are of relevance to the categories used in the database, but also assign "confidence rankings" to the data depending on what they perceive as the reliability of the source – a mechanism that certainly assigns considerable responsibility for data quality assessment to those who manage data infrastructures. This solution works reasonably well for relatively small and well-financed data collections, but fails as soon as the funding required to support data curation ceases to exist, or the volume of data becomes so large as to make manual curation impossible. Also, this type of data quality assessment is only as reliable as the curators in charge, especially in cases where data users are too far removed from the development and maintenance of databases to be able or willing to give feedback and check on curators' decisions.

A third mode of data quality assessment is thus to leave decisions around data quality to those who have generated the data in the first place, which avoids potential misunderstandings between data producers, reviewers and curators. Again, this solution is not ideal. For one thing, existing databases have a hard time getting data producers to post and appropriately annotate their own data (cases such as PomBase, where over half of the authors of relevant papers post and annotate datasets themselves, are far and few between, and typically occur in relatively small and close-knit communities where trust and accountability are high[21]). Furthermore, whatever standards data producers are using to evaluate the quality of their data, it will unavoidably be steeped in the research culture, habits and methods of their own community and subfield, as well as the goals and materials used in their own research. This means that data producers do not typically have the ability to compare different datasets and evaluate their own data in relation to data produced by other research environments, as would be required when assembling a large data infrastructure. Whenever data leave their context of production and enter new contexts of potential re-use, new standards for quality and reliability may well be required, which in turn demands for external assessment and validation from outside the research environment where data were originally generated.

A fourth method for data quality assessment consists in the employment of automated processes and algorithms, which have the potential to reduce dramatically the manual labor associated with data curation. There is no doubt that automation facilitates a variety of techniques to test the validity, reliability and veracity of data being disseminated, particularly in the context of data linkage facilities and infrastructures.[22] However, such tools typically need to make substantive general assumptions about what types of data are most reliable, which are hard to defend given the user-related nature of data quality metrics and their dependence on the context and goals of data assessment. An interesting model for the development of future data quality assessment processes within the life sciences is provided by the many quality assessment tools used to evaluate clinical data in biomedical research, though that approach relies again on the exercise of human judgement, which in turn results in contentious disparities in its application.

As a fifth option, there have been attempts to crowdsource quality assessment by enabling prospective data users to grade the quality of data that they find available on digital databases. While this method holds great promise, it is hard to apply consistently and reliably in a situation where researchers receive little or no credit for engaging with the curation and reuse of existing data sources, and providing feedback to data infrastructures that may enhance their usefulness and long-term sustainability. As a result of the lack of incentive to participate in the curation of open data, most databases operating within the life sciences receive little feedback from their users, despite the (sometimes considerable) effort put into creating channels for users to provide comments and assess the data being disseminated. Moreover, it is perfectly possible that users' judgements differ considerably depending on their research goals and methodological commitments.

Given the difficulties encountered by the methods listed above, researchers involved in data quality assessments (for instance, related to data publication or to the inclusion of data into a database) may recur to a sixth, unofficial and implicit method: the reliance on specific technologies for data production as proxy markers for data quality. In this case, specific pieces of equipment, methods and materials are taken to be intrinsically reliable and thus to enhance – if not guarantee – the chance that data produced through those techniques and tools will be of good quality. Within the life sciences, prominent examples of such proxies include the use of next generation sequencing machines and mass spectrometry in model organism biology, microbiomes and systems biology; light-producing reporter genes produced by reputable companies in cell and developmental biology; and de novo gene synthesis and design/simulation software in synthetic biology. These tools are strongly embedded in leading research repertoires within biology and are extensively adopted by laboratories around the world.[23][24] They are typically easy to verify, with well-established protocols in place and little additional expertise or labor needed, giving rise to what philosopher Ulrich Krohs calls "convenience experimentation."[25] And they are typically a good fit for existing open data infrastructures and formats, which are often developed alongside such technologies as part of the same repertoire (as in the case of sequencing data[26]).

Footnotes

  1. The empirical research for this paper was carried out by me within research sites in Wales, Britain, the United States, Germany and Belgium, and by Louise Bezuidenhout within sites in South Africa and Kenya (for more details on the latter research and related methods, see the paper by Bezuidenhout in this special issue). Given the sensitive nature of the interview materials, the raw data underpinning this paper cannot be openly disseminated; however, a digested and anonymized version of the data is provided on Figshare.[12]
  2. It has also been argued that data quality does not matter within big data collections, because existing data can be triangulated with other datasets documenting the same phenomenon, and datasets that corroborate each other can justifiably be viewed as more reliable.[15] Against this view, myself and others pointed out that triangulation only works when there are enough datasets that document the same phenomenon from different angles, which is not always the case in scientific research.[16][17]

References

  1. 1.0 1.1 1.2 Cai, L.; Zhu, Y. (2015). "The challenges of data quality and data quality assessment in the big data era". Data Science Journal 14: 2. doi:10.5334/dsj-2015-002. 
  2. Ossorio, P.N. (2011). "Bodies of data: Genomic data and bioscience data sharing". Social Research 78 (3): 907-932. PMC PMC3984581. PMID 24733955. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3984581. 
  3. Borgman, C.L. (2012). "The conundrum of sharing research data". Journal for the Association for Information Science and Technology 63 (6): 1059–1078. doi:10.1002/asi.22634. 
  4. Leonelli, S. (2016). Data-Centric Biology: A Philosophical Study. University of Chicago Press. pp. 288. ISBN 9780226416472. http://press.uchicago.edu/ucp/books/book/chicago/D/bo24957334.html. 
  5. Allison, D.B.; Brown, A.W.; George, B.J.; Kaiser, K.A.. "Reproducibility: A tragedy of errors". Nature 530 (7588): 27–9. doi:10.1038/530027a. PMC PMC4831566. PMID 26842041. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4831566. 
  6. Pulverer, B.. "Reproducibility blues". EMBO Journal 34 (22): 2721-4. doi:10.15252/embj.201570090. PMC PMC4682652. PMID 26538323. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4682652. 
  7. Nowotny, H.; Testa, G. (2011). Naked Genes. The MIT Press. pp. 152. ISBN 9780262014939. https://mitpress.mit.edu/books/naked-genes. 
  8. Müller-Wille, S.W.; Rheinberger, H.J. (2012). A Cultural History of Heredity. University of Chicago Press. pp. 288. ISBN 9780226545721. http://www.press.uchicago.edu/ucp/books/book/chicago/C/bo8787518.html. 
  9. Barnes, B.; Dupré, J. (2009). Genomes and What to Make of Them. University of Chicago Press. pp. 288. ISBN 9780226172965. http://press.uchicago.edu/ucp/books/book/chicago/G/bo5705879.html. 
  10. Dupré, J. (2012). Processes of Life. Oxford University Press. pp. 320. ISBN 9780199691982. https://global.oup.com/academic/product/processes-of-life-9780199691982?cc=us&lang=en&. 
  11. Müller-Wille, S.W.; Rheinberger, H.J. (2017). The Gene: From Genetics to Postgenomics. University of Chicago Press. pp. 176. ISBN 9780226510002. http://press.uchicago.edu/ucp/books/book/chicago/G/bo20952390.html. 
  12. Bezuidenhout, L.; Rappert, B.; Leonelli, S. (2016). "Beyond the Digital Divide: Sharing Research Data across Developing and Developed Countries". figshare. https://figshare.com/articles/Beyond_the_Digital_Divide_Sharing_Research_Data_across_Developing_and_Developed_Countries/3203809/1. 
  13. Floridi, L.; Illari, P. (2014). The Philosophy of Information Quality. Springer International Publishing. pp. 315. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213. 
  14. Leonelli, S.. "Locating ethics in data science: Responsibility and accountability in global and distributed knowledge production systems". Philosophical Transactions of the Royal Society A 374 (2083): 20160122. doi:10.1098/rsta.2016.0122. PMC PMC5124067. PMID 28336799. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5124067. 
  15. Mayer-Schönberger, V., Cukier, K. (2013). Big Data: A Revolution That Will Transform How We Live, Work and Think. John Murray. pp. 256. ISBN 9781848547926. 
  16. Leonelli, S.. "What Difference Does Quantity Make? On the Epistemology of Big Data in Biology". Big Data and Society 1 (1). doi:10.1177/2053951714534395. PMC PMC4340542. PMID 25729586. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340542. 
  17. Calude, C.S.; Longo, G.. "The Deluge of Spurious Correlations in Big Data". Foundations of Science 21: 1–18. doi:10.1007/s10699-016-9489-4. 
  18. Morey, R.D.; Chambers, C.D.; Etchells, P.J. et al.. "The Peer Reviewers' Openness Initiative: Incentivizing open research practices through peer review". Royal Society Open Science 3 (1): 150547. doi:10.1098/rsos.150547. PMC PMC4736937. PMID 26909182. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4736937. 
  19. Leonelli, S.; Diehl, A.D.; Christie, K.R. et al.. "How the gene ontology evolves". BMC Bioinformatics 12: 325. doi:10.1186/1471-2105-12-325. PMC PMC3166943. PMID 21819553. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166943. 
  20. Blake, J.A.; Christie, K.R.; Dolan, M.E. et al.. "Gene Ontology Consortium: Going forward". Nucleic Acids Research 43 (DB1): D1049-56. doi:10.1093/nar/gku1179. PMC PMC4383973. PMID 25428369. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383973. 
  21. McDowall, M.D.; Harris, M.A.; Lock, A. et al.. "PomBase 2015: Updates to the fission yeast database". Nucleic Acids Research 43 (DB1): D656-61. doi:10.1093/nar/gku1040. PMC PMC4383888. PMID 25361970. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383888. 
  22. Kambatla, K.; Kollias, G.; Kumar, V.; Grama, A.. "Trends in big data analytics". Journal of Parallel and Distributed Computing 74 (7): 2561-2573. doi:10.1016/j.jpdc.2014.01.003. 
  23. Ankeny, R.A.; Leonelli, S. (2015). "Valuing Data in Postgenomic Biology: How Data Donation and Curation Practices Challenge the Scientific Publication System". Postgenomics: Perspectives on Biology after the Genome. Duke University Press. pp. 126–149. ISBN 9780822358947. https://www.dukeupress.edu/postgenomics. 
  24. Ankeny, R.A.; Leonelli, S.. "Repertoires: A post-Kuhnian perspective on scientific change and collaborative research". Studies in History and Philosophy of Science 60: 18–28. doi:10.1016/j.shpsa.2016.08.003. PMID 27938718. 
  25. Krohs, U.. "Convenience experimentation". Studies in History and Philosophy of Biological and Biomedical Sciences 43 (1): 52-7. doi:10.1016/j.shpsc.2011.10.005. PMID 22326072. 
  26. Leonelli, S.; Ankeny, R.A.. "Repertoires: How to Transform a Project into a Research Community". BioScience 65 (7): 701–708. doi:10.1093/biosci/biv061. PMC PMC4580990. PMID 26412866. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4580990. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article had citations listed alphabetically; they are listed in the order they appear here due to the way the wiki works. In several cases, the original article cited sources inline (such as Primiero 2014 and Stegenga 2014) but failed to list them in the References section; these have been omitted here.