Difference between revisions of "Journal:Towards a contextual approach to data quality"

Full article title	Towards a contextual approach to data quality
Journal	Data
Author(s)	Canali, Stefano
Author affiliation(s)	Leibniz University Hannover
Primary contact	Email: stefano dot canali at philos dot uni-hannover dot de
Year published	2020
Volume and issue	5(4)
Article #	90
DOI	10.3390/data5040090
ISSN	2306-5729
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.mdpi.com/2306-5729/5/4/90/htm
Download	https://www.mdpi.com/2306-5729/5/4/90/pdf (PDF)

Revision as of 17:14, 11 October 2020

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

This essay delves into the need for a framework for approaching data quality in the context of scientific research. First, the concept of "quality" as a property of information, evidence, and data is presented, and research on the philosophy of information, science, and biomedicine is reviewed. Based on this review, the need for a more purpose-dependent and contextual approach to data quality in scientific research is argued, whereby the quality of a dataset is dependent on the context of use of the dataset as much as the dataset itself. The rationale to the approach is then exemplified by discussing current critiques and debates of scientific quality, thus showcasing how data quality can be approached contextually.

Keywords: research data management, scientific epistemology, data quality, FAIR, reproducibility crisis

Introduction

Determining the quality of scientific data is a task of key importance for any research project and involves considerations at conceptual, practical, and methodological levels. The task has arguably become even more pressing in recent years, as a result of the ways in which the volume, variety, value, volatility, veracity, and validity of scientific data have changed with the rise of data-intensive methods in the sciences.^[1] At the start of the last decade, many commentators argued that these changes would bring dramatic shifts to the scientific method and would per se make science better, thanks to fully automated reasoning, more data-driven methods, less theorizing, and more objectivity.^[2] However, analyses of the use of data-intensive methods in the sciences have shown that the feasibility and benefits of these methods are not automatic results of these changes, but crucially rest upon the transparency, validity, and quality of data practices.^[3] As a consequence, there are currently various attempts at implementing guidelines to maintain and promote the quality of datasets, developing ways and tools to measure it, and conceptualizing the notion of quality.^[4]^[5]^[6]

This essay focuses on the latter line of research and discusses the following question: what are high-quality data? At the essay's core is a framework for data quality that suggests a contextual approach, whereby quality should be seen as a result of the context where a dataset is used, and not only of the intrinsic features of the data. This approach is based on the integration of philosophical discussions on the quality of data, information, and evidence. The next section begins by reviewing analyses of quality in different areas of philosophical research, particularly in the philosophy of information, science, and biomedicine. Then, shared results from this review are identified and integrated, with those results arguably pointing towards the need for a contextual approach. A discussion of what the approach entails and how it can be used in practice follows, looking at current debates on quality in the scientific and philosophical literature. Finally, in the conclusion, a discussion of the commentary is made and future research is proposed.

Quality as a property of information, evidence, and data

Quality has been discussed in areas of philosophical work highly engaged with research practices and debates in the sciences. In this context, three main areas of research were identified, whose results are particularly significant for conceptualizations of quality and yet have only partially been applied to issues in data quality. These results and their integration as important contributions for more general and interdisciplinary discussions on data quality are worthy of discussion. As such, this essay proposes that quality can be discussed as a property of three closely related notions: information, data, and evidence.

First, research on quality has traditionally focused on information quality, which became prominent in computer science in the 1990s. In this context, an influential line of research started to move beyond traditional interpretations of quality in terms of solely accuracy, developing a multi-dimensional and purpose-dependent view whereby a piece of information is of high quality insofar as it is fit for a certain purpose.^[7] This line of research has developed into two main approaches since the 1990s: surveying opinions and definitions of academics and practices from an “empirical” point of view; and studying the different dimensions of quality and interrelations between these from a theoretical and “ontological” perspective.^[8] The empirical approach has expanded conceptualizations of information quality to include not only traditional dimensions such as accuracy, but also objectivity, completeness, relevance, security, access. and timeliness; here, the goal has primarily been to categorize these dimensions, rather than to define them.^[9] On the other hand, the goal of the ontological approach has been to understand how to connect different dimensions of information quality (such as those surveyed through the empirical approach^[10]) and conceptualize and measure potential disconnections as errors.^[11]

These discussions have been picked up and analyzed in the area of research known as "philosophy of information." According to Phyllis Illari and Luciano Floridi, computer science has not fully embraced the purpose-dependent approach to information quality in all of its implications, and theoretical understandings of information quality are still in search of a way of applying the approach to concrete contexts.^[6] With these problems and goals in mind, Illari has suggested that information quality suffers from a "rock-and-a-hard-place" problem.^[12] While information quality is defined as information that is fit for purpose, many still think that some aspects and dimensions of information quality should be independent of specific purposes (the rock). At the same time, there is a sense in which quality should make information fit for multiple if not all purposes; a piece of information that is fit for a specific purpose, but not for others, will not be considered of high quality (the hard place). As a way of going beyond the impasse, Illari has argued that we should classify information quality on the basis of a relational model, which links the different dimensions of quality to specific purposes and uses.^[12] Therefore, Illari conceives of quality as a property of information that is highly dependent on its context, i.e., the specific uses, aims, and purposes we want to employ a piece of information for. In other words, quality cannot be independent of fit for a specific purpose and cannot consist in a single fit-for-any purpose.

Here, a similar push for the purpose-dependent and contextual approach has been identified in a second area of philosophical analyses, which have more specifically focused on the use of data in the context of scientific practice. The increasing volume and variety of data used in the sciences—with related and different levels of veracity, validity, volatility, and value—have created a number of potential benefits as well as challenges for scientific epistemology.^[13] Determining and assessing quality is one of the main challenges of data-intensive science because of the diversity of sources of data and integration practices, the often short “timespan” and relevance of data, the difficulties of providing quality assessments and evaluations in a timely manner, and the overall lack of unified standards.^[4]

Partly as a result of these shifts, philosophers of science have recently expanded their focus on data as an important component of scientific epistemology.^[14] In this context, some analyses have focused on the tools that are used to calibrate, standardize, and assess the quality of data in the sciences. For instance, data quality assessment tools are often applied to clinical studies, in the form of scales or checklists about specific aspects of the study, with the goal of checking whether the study, e.g., makes use of specific statistical methods, sufficiently describes subject withdrawal, etc. According to Jacob Stegenga, there are two main issues affecting the use of these tools in the biomedical context: a poor level of inter-rating operability, i.e., different users of the tools achieve different instead of similar results; and a low level of inter-tool operability, i.e., different types of tools give different instead of similar results when assessing the same study.^[15] Stegenga has argued that this can be conceptualized as a result of the underdetermination of the evidential significance of data: there is no uniquely correct way of estimating information quality, and different results will always be obtained in relation to the context, users, and type of study. These results can be interpreted in similar terms to the aforementioned analysis by Illari^[12], as pointing to the crucial role that the context where data are analyzed and used plays in determination of its quality. Quality is not an intrinsic property of data that only depends on the characteristics of the data itself: quality will differ depending on contextual features, such as the tools used to assess quality, who uses them, their purposes, etc.

References

↑ Leonelli, S. (2020). "Scientific Research and Big Data". Stanford Encyclopedia of Philosophy Archive (Summer 2020). https://plato.stanford.edu/archives/sum2020/entries/science-big-data/.
↑ Canali, S. (2016). "Big Data, epistemology and causality: Knowledge in and knowledge out in EXPOsOMICS". Big Data & Society 3 (2). doi:10.1177/2053951716669530.
↑ Leonelli, S. (2014). "What difference does quantity make? On the epistemology of Big Data in biology". Big Data & Society 1 (1). doi:10.1177/2053951714534395.
↑ ^4.0 ^4.1 Cai, L.; Zhu, Y. (2015). "The Challenges of Data Quality and Data Quality Assessment in the Big Data Era". Data Science Journal 14: 2. doi:10.5334/dsj-2015-002.
↑ Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J. et al. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data 3: 160018. doi:10.1038/sdata.2016.18. PMC PMC4792175. PMID 26978244. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175.
↑ ^6.0 ^6.1 Illari, P.; Floridi, L. (2014). "Chapter 2: Information Quality, Data and Philosophy". In Floridi, L., Illari, P.. The Philosophy of Information Quality. Springer International Publishing. pp. 5–23. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213.
↑ Wang, R.Y.; Reddy, M.P.; Kon, H.B. (1995). "Toward quality data: An attribute-based approach". Decision Support Systems 13 (3–4): 349–72. doi:10.1016/0167-9236(93)E0050-N.
↑ Wang, R.Y. (1998). "A product perspective on total data quality management". Communications of the ACM 41 (2): 58–65. doi:10.1145/269012.269022.
↑ Batini, C.; Scannapieca, M. (2006). Data Quality: Concepts, Methodologies and Techniques. Springer. ISBN 9783540331728.
↑ Wand, Y.; Wang, R.Y. (1996). "Anchoring data quality dimensions in ontological foundations". Communications of the ACM 39 (11): 86-95. doi:10.1145/240455.240479.
↑ Primiero, G. (2014). "Chapter 7: Algorithmic Check of Standards for Information Quality Dimensions". In Floridi, L., Illari, P.. The Philosophy of Information Quality. Springer International Publishing. pp. 107–34. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213.
↑ ^12.0 ^12.1 ^12.2 Illari, P. (2014). "Chapter 14: IQ: Purpose and Dimensions". In Floridi, L., Illari, P.. The Philosophy of Information Quality. Springer International Publishing. pp. 281–301. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213.
↑ Leonelli, S.; Tempini, N. (2020). Data Journeys in the Sciences. Springer. doi:10.1007/978-3-030-37177-7. ISBN 9783030371777.
↑ Leonelli, S. (2016). Data-Centric Biology: A Philosphical Study. University of Chicago Press. ISBN 9780226416502.
↑ Stegenga, J. (2013). "Down with the Hierarchies". Topoi 33: 313–22. doi:10.1007/s11245-013-9189-4.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.

[LeonelliScient20-1] Leonelli, S. (2020). "Scientific Research and Big Data". Stanford Encyclopedia of Philosophy Archive (Summer 2020). https://plato.stanford.edu/archives/sum2020/entries/science-big-data/.

[CanaliBig16-2] Canali, S. (2016). "Big Data, epistemology and causality: Knowledge in and knowledge out in EXPOsOMICS". Big Data & Society 3 (2). doi:10.1177/2053951716669530.

[LeonelliWhat14-3] Leonelli, S. (2014). "What difference does quantity make? On the epistemology of Big Data in biology". Big Data & Society 1 (1). doi:10.1177/2053951714534395.

[CaiTheChall15-4] 4.0 ^4.1 Cai, L.; Zhu, Y. (2015). "The Challenges of Data Quality and Data Quality Assessment in the Big Data Era". Data Science Journal 14: 2. doi:10.5334/dsj-2015-002.

[WilkinsonTheFAIR16-5] Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J. et al. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data 3: 160018. doi:10.1038/sdata.2016.18. PMC PMC4792175. PMID 26978244. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175.

[IllariInform14-6] 6.0 ^6.1 Illari, P.; Floridi, L. (2014). "Chapter 2: Information Quality, Data and Philosophy". In Floridi, L., Illari, P.. The Philosophy of Information Quality. Springer International Publishing. pp. 5–23. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213.

[WangToward95-7] Wang, R.Y.; Reddy, M.P.; Kon, H.B. (1995). "Toward quality data: An attribute-based approach". Decision Support Systems 13 (3–4): 349–72. doi:10.1016/0167-9236(93)E0050-N.

[WangAProd98-8] Wang, R.Y. (1998). "A product perspective on total data quality management". Communications of the ACM 41 (2): 58–65. doi:10.1145/269012.269022.

[BatiniData06-9] Batini, C.; Scannapieca, M. (2006). Data Quality: Concepts, Methodologies and Techniques. Springer. ISBN 9783540331728.

[WandAnch96-10] Wand, Y.; Wang, R.Y. (1996). "Anchoring data quality dimensions in ontological foundations". Communications of the ACM 39 (11): 86-95. doi:10.1145/240455.240479.

[PrimieroAlgor14-11] Primiero, G. (2014). "Chapter 7: Algorithmic Check of Standards for Information Quality Dimensions". In Floridi, L., Illari, P.. The Philosophy of Information Quality. Springer International Publishing. pp. 107–34. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213.

[IllariIQ14-12] 12.0 ^12.1 ^12.2 Illari, P. (2014). "Chapter 14: IQ: Purpose and Dimensions". In Floridi, L., Illari, P.. The Philosophy of Information Quality. Springer International Publishing. pp. 281–301. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213.

[LeonelliDataJ20-13] Leonelli, S.; Tempini, N. (2020). Data Journeys in the Sciences. Springer. doi:10.1007/978-3-030-37177-7. ISBN 9783030371777.

[LeonelliDataCent16-14] Leonelli, S. (2016). Data-Centric Biology: A Philosphical Study. University of Chicago Press. ISBN 9780226416502.

[StegengaDown13-15] Stegenga, J. (2013). "Down with the Hierarchies". Topoi 33: 313–22. doi:10.1007/s11245-013-9189-4.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

@@ Line 39: / Line 39: @@
 First, research on quality has traditionally focused on information quality, which became prominent in computer science in the 1990s. In this context, an influential line of research started to move beyond traditional interpretations of quality in terms of solely accuracy, developing a multi-dimensional and purpose-dependent view whereby a piece of information is of high quality insofar as it is fit for a certain purpose.<ref name="WangToward95">{{cite journal |title=Toward quality data: An attribute-based approach |journal=Decision Support Systems |author=Wang, R.Y.; Reddy, M.P.; Kon, H.B. |volume=13 |issue=3–4 |pages=349–72 |year=1995 |doi=10.1016/0167-9236(93)E0050-N}}</ref> This line of research has developed into two main approaches since the 1990s: surveying opinions and definitions of academics and practices from an “empirical” point of view; and studying the different dimensions of quality and interrelations between these from a theoretical and “[[Ontology (information science)|ontological]]” perspective.<ref name="WangAProd98">{{cite journal |title=A product perspective on total data quality management |journal=Communications of the ACM |author=Wang, R.Y. |volume=41 |issue=2 |pages=58–65 |year=1998 |doi=10.1145/269012.269022}}</ref> The empirical approach has expanded conceptualizations of information quality to include not only traditional dimensions such as accuracy, but also objectivity, completeness, relevance, security, access. and timeliness; here, the goal has primarily been to categorize these dimensions, rather than to define them.<ref name="BatiniData06">{{cite book |title=Data Quality: Concepts, Methodologies and Techniques |author=Batini, C.; Scannapieca, M. |publisher=Springer |year=2006 |isbn=9783540331728}}</ref> On the other hand, the goal of the ontological approach has been to understand how to connect different dimensions of information quality (such as those surveyed through the empirical approach<ref name="WandAnch96">{{cite journal |title=Anchoring data quality dimensions in ontological foundations |journal=Communications of the ACM |author=Wand, Y.; Wang, R.Y. |volume=39 |issue=11 |pages=86-95 |year=1996 |doi=10.1145/240455.240479}}</ref>) and conceptualize and measure potential disconnections as errors.<ref name="PrimieroAlgor14">{{cite book |chapter=Chapter 7: Algorithmic Check of Standards for Information Quality Dimensions |title=The Philosophy of Information Quality |author=Primiero, G. |editor=Floridi, L., Illari, P. |publisher=Springer International Publishing |pages=107–34 |year=2014 |doi=10.1007/978-3-319-07121-3 |isbn=9783319071213}}</ref>
+These discussions have been picked up and analyzed in the area of research known as "philosophy of information." According to Phyllis Illari and Luciano Floridi, computer science has not fully embraced the purpose-dependent approach to information quality in all of its implications, and theoretical understandings of information quality are still in search of a way of applying the approach to concrete contexts.<ref name="IllariInform14" /> With these problems and goals in mind, Illari has suggested that information quality suffers from a "rock-and-a-hard-place" problem.<ref name="IllariIQ14">{{cite book |chapter=Chapter 14: IQ: Purpose and Dimensions |title=The Philosophy of Information Quality |author=Illari, P. |editor=Floridi, L., Illari, P. |publisher=Springer International Publishing |pages=281–301 |year=2014 |doi=10.1007/978-3-319-07121-3 |isbn=9783319071213}}</ref> While information quality is defined as information that is fit for purpose, many still think that some aspects and dimensions of information quality should be independent of specific purposes (the rock). At the same time, there is a sense in which quality should make information fit for multiple if not all purposes; a piece of information that is fit for a specific purpose, but not for others, will not be considered of high quality (the hard place). As a way of going beyond the impasse, Illari has argued that we should classify information quality on the basis of a relational model, which links the different dimensions of quality to specific purposes and uses.<ref name="IllariIQ14" /> Therefore, Illari conceives of quality as a property of information that is highly dependent on its context, i.e., the specific uses, aims, and purposes we want to employ a piece of information for. In other words, quality cannot be independent of fit for a specific purpose and cannot consist in a single fit-for-any purpose.
+Here, a similar push for the purpose-dependent and contextual approach has been identified in a second area of philosophical analyses, which have more specifically focused on the use of data in the context of scientific practice. The increasing volume and variety of data used in the sciences—with related and different levels of veracity, validity, volatility, and value—have created a number of potential benefits as well as challenges for scientific epistemology.<ref name="LeonelliDataJ20">{{cite book |title=Data Journeys in the Sciences |author=Leonelli, S.; Tempini, N. |publisher=Springer |year=2020 |doi=10.1007/978-3-030-37177-7 |isbn=9783030371777}}</ref> Determining and assessing quality is one of the main challenges of data-intensive science because of the diversity of sources of data and integration practices, the often short “timespan” and relevance of data, the difficulties of providing quality assessments and evaluations in a timely manner, and the overall lack of unified standards.<ref name="CaiTheChall15" />
+Partly as a result of these shifts, philosophers of science have recently expanded their focus on data as an important component of scientific epistemology.<ref name="LeonelliDataCent16">{{cite book |title=Data-Centric Biology: A Philosphical Study |author=Leonelli, S. |publisher=University of Chicago Press |year=2016 |isbn=9780226416502}}</ref> In this context, some analyses have focused on the tools that are used to calibrate, standardize, and assess the quality of data in the sciences. For instance, data quality assessment tools are often applied to clinical studies, in the form of scales or checklists about specific aspects of the study, with the goal of checking whether the study, e.g., makes use of specific statistical methods, sufficiently describes subject withdrawal, etc. According to Jacob Stegenga, there are two main issues affecting the use of these tools in the biomedical context: a poor level of inter-rating operability, i.e., different users of the tools achieve different instead of similar results; and a low level of inter-tool operability, i.e., different types of tools give different instead of similar results when assessing the same study.<ref name="StegengaDown13">{{cite journal |title=Down with the Hierarchies |journal=Topoi |author=Stegenga, J. |volume=33 |pages=313–22 |year=2013 |doi=10.1007/s11245-013-9189-4}}</ref> Stegenga has argued that this can be conceptualized as a result of the underdetermination of the evidential significance of data: there is no uniquely correct way of estimating information quality, and different results will always be obtained in relation to the context, users, and type of study. These results can be interpreted in similar terms to the aforementioned analysis by Illari<ref name="IllariIQ14" />, as pointing to the crucial role that the context where data are analyzed and used plays in determination of its quality. Quality is not an intrinsic property of data that only depends on the characteristics of the data itself: quality will differ depending on contextual features, such as the tools used to assess quality, who uses them, their purposes, etc.
@@ Line 47: / Line 53: @@
 ==Notes==
 This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.
 <!--Place all category tags here-->

Difference between revisions of "Journal:Towards a contextual approach to data quality"

Revision as of 17:14, 11 October 2020

Contents

Abstract

Introduction

Quality as a property of information, evidence, and data

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export