Journal:Data sharing at scale: A heuristic for affirming data cultures
|Full article title||Data sharing at scale: A heuristic for affirming data cultures|
|Journal||Data Science Journal|
|Author(s)||Poirier, Lindsay; Costelloe-Kuehn, Brandon|
|Author affiliation(s)||University of California - Davis, Rensselaer Polytechnic Institute|
|Primary contact||Email: lnpoirier at ucdavis dot edu|
|Volume and issue||18(1)|
|Distribution license||Creative Commons Attribution 4.0 International|
Addressing the most pressing contemporary social, environmental, and technological challenges will require integrating insights and sharing data across disciplines, geographies, and cultures. Strengthening international data sharing networks will not only demand advancing technical, legal, and logistical infrastructure for publishing data in open, accessible formats; it will also require recognizing, respecting, and learning to work across diverse data cultures. This essay introduces a heuristic for pursuing richer characterizations of the “data cultures” at play in international, interdisciplinary data sharing. The heuristic prompts cultural analysts to query the contexts of data sharing for a particular discipline, institution, geography, or project at seven scales: the meta, macro, meso, micro, techno, data, and nano. The essay articulates examples of the diverse cultural forces acting upon and interacting with researchers in different communities at each scale. The heuristic we introduce in this essay aims to elicit from researchers the beliefs, values, practices, incentives, and restrictions that impact how they think about and approach data sharing. Rather than represent an effort to iron out differences between disciplines, this essay instead intends to showcase and affirm the diversity of traditions and modes of analysis that have shaped how data gets collected, organized, and interpreted in diverse settings.
Keywords: data sharing, data culture, ethnography, data friction, metadata
In the 1980s, the European Organization for Nuclear Research (CERN) was the most prominent particle physics laboratory in the world and at the cutting edge of coordinating international scientific research. Herwig Schopper, Director-General for CERN, 1981–1988, describes the time as provoking “a new ‘sociology’ for international scientific collaboration”; with over 30 countries participating in experiments, the challenges for keeping track of researchers, workflows, and scientific data were enormous.
CERN hired Tim Berners-Lee as a contract programmer in 1980. To help keep track of projects, he toyed with designing Enquire, a knowledge organization system that enabled users to organize their data by creating links between documents stored in separate locations. Berners-Lee landed a fellowship in the Data Acquisition and Control division in 1983, a time when CERN was upgrading its computing infrastructure to better network globally distributed researchers in laboratories that each followed their own methods, used their own operating systems, and often spoke different languages. In describing the systems that were proposed for addressing these challenges, Berners-Lee writes:
I had seen numerous developers arrive at CERN to tout systems that “helped” people organize information. They’d say, “To use this system all you have to do is divide all your documents into four categories” or “You just have to save your data as a WordWonderful document” or whatever. I saw one protagonist after the next shot down in flames by indignant researchers because the developers were forcing them to reorganize their work to fit the system. I would have to create a system with common rules that would be acceptable to everyone. This meant as close as possible to no rules at all.
The challenge was not to compel researchers to adopt a new standard; instead the challenge was learning to recognize and respect the different data cultures that guided how diverse researchers approached their work.[a] Berners-Lee’s Enquire eventually evolved into a proposal for what inevitably became the World Wide Web, perhaps the most widely adopted information infrastructure in the world, in large part because the system has very few rules prescribing how users should organize their knowledge within it.
Today, we are contending with a sociology for international scientific collaboration on a much larger scale. Addressing the most pressing contemporary social, environmental, and technological challenges will require integrating insights and sharing data across disciplines, geographies, and cultures. Research into the socio-technical challenges of data sharing has begun to characterize complications that arise as researchers in different communities work to align their data cultures. The process of integrating complex and heterogeneous data generated in different geographies, according to different disciplinary standards, and motivated by different epistemic commitments and incentive structures, can produce “friction,” demanding that researchers make compromises to find common ground. Different disciplines may speak different “languages,” making it difficult to devise shared schemas and ontologies. Perhaps most notably, researchers in different settings often have diverse rationales for valuing data preservation, contextualization, integration, and dissemination. Strengthening international data sharing networks will not only demand advancing technical, legal, and logistical infrastructure for publishing data in open, accessible ways; it will also require recognizing, respecting, and learning to work across diverse data cultures. As Berners-Lee observed of collaborative research practice at CERN in the 1980s, prescriptively forcing researchers to reorganize their work to fit a standard limits adoption and collaboration. This essay, informed by our work exploring diverse data sharing communities at the Research Data Alliance (RDA), will introduce a heuristic we’ve developed in order to pursue richer characterizations of the “data cultures” at play.
The Research Data Alliance (RDA) is an international community of researchers aiming to design and sustain the socio-technical infrastructure needed to enable open research data sharing across geographies and disciplines. We became involved in the RDA as cultural anthropologists looking to advance frameworks for data sharing in our own field. However, we found that RDA’s bi-annual plenaries are rich ethnographic field-sites for examining how commitments guiding scientific practice inform how data gets produced, structured, and semantically-enriched, and how researchers in different communities think about, value, and practice data sharing.[b] We have been developing the heuristic presented in this paper since 2017, when we were asked, as ethnographers of data practice, to collaborate in a session at the RDA Plenary 10 in Montreal on addressing barriers to adoption of RDA outputs. We devised a series of questions that researchers might ask themselves in considering the aims, assumptions, and commitments brought to their work, their congruence with the RDA output, and the infrastructure and incentives available for enabling adoption. We revised the questions in 2019 for a workshop addressing the socio-technical challenges of international and interdisciplinary data sharing.
Data sharing at scale
Global, interdisciplinary data sharing is a cultural system, one that assembles many actors, institutions, technologies, and frameworks. It is a system animated by a diverse set of forces operating at many different locations and across many different scales. Understanding this system will demand that we learn to simultaneously observe the multiple forces acting upon and interacting with researchers and the data they produce.
Scale has historically referred to many things. For instance, geographers may refer to spatial scales that designate different geopolitical boundaries; computer scientists may refer to nested IT infrastructures (i.e., data, computers, networks, the internet, etc.). Scholars studying the history and social dynamics of information infrastructures have shown how examining data systems across macro, meso, and micro scales of society can reveal the complexity of sociotechnical forces in shaping knowledge, modernity, and computing history. When we refer to scale in the context of cultural systems, we aim to evoke the frames of reference—diverse in their breadth and modes of cultural ordering—that anthropologists have crafted to examine how culture is enacted and mediated. When we talk of frames of reference, we are not referring to particular places, contexts, or phenomena, but rather, focusing devices that order what the ethnographer pays attention to (i.e., customs, politics, discourse, etc.). Kim Fortun, a cultural anthropologist who has written extensively on theories of ethnographic practice, argues that “scale is a heuristic, which, like all heuristics, provides a way of seeing that frames and orients perspective. At its best, scale provides a way to see many types of action in motion at once, evoking a sense of the system at hand.” Fortun proposes seven “strata” (including meta, macro, meso, micro, techno, nano, and natural) for guiding cultural analysis and argues that examining cultural systems across these scales can help constitute the “meta-data” needed to make sense of the systems.
In this paper, we adapt Fortun’s approach in order to outline a heuristic for examining the data cultures of different research communities. We do so bearing in mind that all heuristics for framing perspective delimit insight, and that phenomena are constantly crossing these scales that are only analytically (and uneasily) separable. Further, the issues that emerge within each strata involve cultural forces playing out across diverse spatial, temporal, and infrastructural scales. Data cultures can be characterized as a cumulation of phenomena oscillating between various sites at different times in the context of the different cultural frames of reference we outline below. While the heuristic we introduce provides but one way of unpacking complex cultural systems, it does serve as a starting point, enabling comparative perspective between field-sites, informing examinations of how cultural systems evolve, and signposting the various forces that can impact data sharing. The heuristic we outline below (Figure 1) can serve as a template for querying researchers and examining data cultures within the context of a particular discipline, institution, geography, or project.[c]
Meta-level analysis, or the way forces characterized by the scales below get talked about, queries the dominant discourse and counter-narratives guiding how a community values data sharing. Different communities have differently prioritized investments in open science and the data infrastructure needed to support it. While advocates in all disciplines may struggle to communicate the value of open science to their administrators, funders, and peers, the conversation has advanced much further in certain disciplines. Early policies regarding data sharing in the Human Genome Project propelled discourse around open science to the forefront in genomics research, setting global expectations that sequencing data would be publicly available. Meta-level analysis may also consider how geopolitics shape different discourses around open science. In low- and middle-income countries, some researchers have voiced concerns that opening data can exacerbate existing global inequalities by heralding in new opportunities for extraction.
Macro: Legal, political-economic, and financial
Macro-level analysis attends to the financial and legal structures that support the work of data sharing communities and organizations. While globally funders are increasingly requiring researchers to deposit data in public access repositories, their willingness to fund data infrastructure development and maintenance differs drastically. For instance, the European Union has consistently prioritized open science in their research and innovation funding programs (including Horizon 2020 and Horizon Europe), supporting the propagation of consulting bodies such as GoFAIR and the European Research Infrastructures Initiative, as well as domain-specific bodies such as DARIAH and the Europrean Marine Biological Resource Center. In other parts of the world, financial resources supporting research infrastructure are not as readily available.
Macro-level analysis also considers differences in legal structures that disciplines operate within. For instance, social scientists must comply with country-specific research ethics laws, and public health researchers must comply with country-specific laws for safeguarding a patient’s privacy with respect to medical information. Such legislation restricts the degree to which researchers can engage in interdisciplinary data sharing.
Meso-level analysis focuses attention on organizations and networks. Ethnographic research examining barriers to data sharing and collaborative practice has often focused here, examining how a lack of incentives for sharing data, engaging in collaborative work, and investing time in the design and maintenance of research infrastructure has inhibited participation among some communities in data sharing networks. Notably, this dearth of impetus is felt disproportionately in certain disciplines (such as those where collaborative publications are discouraged), in certain institutions (such as those that do not count the design and maintenance of research infrastructure as “service” in tenure cases), and at different career stages. Early career researchers, for instance, are often concerned that publishing their data in open repositories may lead to their research findings being “scooped up” before they have an opportunity to establish their credibility as scholars.
Micro: Research practices and customs
Micro-level analysis focuses attention on customs and practice, including both data practice and research practice. For instance, Broom et al. argue that data sharing imperatives can undermine qualitative researchers’ understanding of what it means to “do” qualitative research. Since many qualitative researchers understand their data to be so contextually situated and containing indeterminacies that require an “expert” eye to be perceived, they have voiced concern that adopting data publishing into their workflows erodes the integrity of doing qualitative research.[d] On the other hand, in some research communities, data sharing procedures have already fundamentally reshaped and advanced research practice. Leonelli and Ankeny have documented how the development of community databases for model organisms has provoked a shift in biological research practice away from single species analysis towards more comparative, cross-species research. Micro-level analysis also considers the time researchers can devote to implementing data sharing protocols.
Techno-level analysis attends to the availability, accessibility, and fitness of technologies and data standards for supporting data sharing practices. In certain research communities, suites of data sharing technologies and standards have already been networked into data repositories that can support domain-specific data publishing and management without requiring advanced data infrastructure expertise on the part of data depositors. While certain communities can submit their data to domain-specific data repositories (such as the earth data repositories networked through DataONE) or region-specific data repositories (such as Research Data Australia), researchers in other data domains and regions do not necessarily have access to infrastructure designed specifically to meet their data management needs. Further, at some institutions, librarians and information scientists are more readily available for helping researchers address the technical challenges of implementing data infrastructure to support sharing and management. The California Digital Library, for instance, supports the entire University of California system in addressing challenges of data curation and preservation.
Data: Data architecture
Data-level analysis focuses attention on data architecture and configuration and the extent to which the logic of data sharing standards embody the assumptions that different communities bring to their research practice. Some disciplines, such as evolutionary biology and chemistry, have a long tradition of grouping and sorting entities according to various taxonomic systems, while in other research domains the very act of discretizing knowledge runs counter to researchers’ epistemological commitments. For instance, Pulsifer et al. have shown how designing formal data management processes for documenting Inuit knowledge posed a unique set of challenges because the complexity and dynamism of relationships characterized in indigenous narratives are not amenable to the standardized data practices Western science promotes. At the data level, we consider the ways in which the complexity of a community’s context-specific knowledge inevitably gets transformed to fit into data sharing infrastructures (because all infrastructures structure and delimit entities and their flows), and how new data infrastructures can be designed to better represent diverse knowledge forms.
Nano: Individual beliefs and values
Nano-level analysis focuses attention on the embodied beliefs researchers bring to data sharing practice, e.g., why they value data sharing and what they hope to get out of collaboration.[e] For instance, in many of the natural sciences, an oft-cited motivator for advancing robust data sharing infrastructure is to confront a crisis of the scientific method: that a great deal of published research has not been documented in such a way that its results can be reproduced. The humanities, however, are typically guided by a different set of motivations; in cultural anthropology, for instance, sharing data and encouraging its reuse can help ensure that interpretations of cultural objects are not univocal but instead represent an array of perspectives. Unpacking the ideological underpinnings of different data cultures highlights the fallacy of designing one-size-fits-all solutions. To advance global and interdisciplinary data sharing in an inclusive way, we need infrastructures and policies that affirm the wide array of stakes that animate different research communities’ investment in data sharing and open science.
Conclusion: Affirming culture
We all tend to treat diversity as a problem. It’s here to stay and it’s beautiful.
—Dr. Devika P. Madalli, RDA Technical Advisory Board and Consultant, RDA Plenary 13, Philadelphia, PA (April 2019)
Over the past five years, we have found ourselves in numerous data sharing workshops, meetings, and plenaries where “culture” gets cast as a problem to be fixed. We have heard folks say “if only we could get everyone to speak the same language” or “if only we can align different data sharing cultures.” However, in working to tame culture in data sharing practices, there is great risk that the nuances that make interdisciplinary research so robust and appropriate for tackling complex, multi-scaled, and multi-dimensional problems will be eclipsed. The heuristic we introduce in this essay aims to elicit from researchers the beliefs, values, practices, incentives, and restrictions that impact how they think about and approach data sharing. This is not an effort to iron out differences between disciplines, but instead to showcase and affirm the diversity of traditions and modes of analysis that have shaped how data gets collected, organized, and interpreted in diverse settings.
This is a key and often overlooked component within efforts to design and implement data sharing policy and infrastructure. While designers of data sharing infrastructure often attempt to gather feedback from diverse domain communities when developing new standards, tools, or frameworks, we have found that they attempt to structure the feedback in ways that control for difference. For instance, designers may distribute use case templates that ask representatives in different data domains to outline scenarios, triggers, motivations, goals, costs, and risks involved in a particular data practice. While the structure of the document allows the designers to compare and contrast key factors of a data practice across communities, it also presets the conditions for comparative perspective in ways that can eclipse more fundamental differences. This becomes evident when analyzing how researchers value data in the first place, how they leverage theory, what they hope to gain through collaboration, the assumptions they have about language and representation, and the unique historical and institutional conditions that have shaped their communities. These considerations can have a profound effect on how data sharing practices get taken up in different settings. Studying data cultures at scale can help to foreground these often neglected considerations, animating capacity to design data sharing infrastructure and policies that are not only acceptable to everyone, but also affirm and respect the diversity of cultures that guide global and interdisciplinary research practice.
The ideas developed in this essay were supported through the RDA/US Data Share fellowship sponsored through a grant from the Alfred P. Sloan Foundation. We also thank Kim Fortun and Mike Fortun for helping us scope our involvement in RDA and reviewing multiple iterations of this heuristic.
The authors have no competing interests to declare.
- We refer to “researchers” here quite expansively to denote any individual involved in the collection, designation, analysis, stewardship, and/or use of empirical data. This may refer to scientists, humanists, industrial analysts, and government actors in a variety of locations.
- In positioning the RDA as a field-site, we have methodologically employed what Fassin and Rechtman refer to as “observant participation” in the study of data cultures. Ethnographic observation has come second to and has been inflected by our own participation in the organization. This research was carried out under the approval of Institutional Review Boards at Rensselaer Polytechnic Institute and the University of California Davis.
- As RDA working groups plan for the design of new data sharing infrastructure, this heuristic may be used as a template for interviews or surveys designed to elicit from diverse communities the data cultures that shape their thinking, their practice, and the resources available to them. For communities seeking to adopt RDA outputs, the heuristic may be a tool to help analyze and make sense of the diverse cultural forces that shape possibilities for infrastructural implementation.
- Ann Zimmerman similarly demonstrates how the locally-situated knowledge ecologists acquire in fieldwork can be difficult to translate into public data through available standards and thus often get left behind.
- Of all the scales presented here, the distinction between the first (meta/discursive) and the last (nano/individual beliefs and values) is perhaps the most difficult to stabilize. To understand this particular crossing of scale, we find Althusser’s concept of interpellation useful, which in its simplest terms involves the processes by which ideology (embodied in the various scales presented here) conditions and even constitutes individual subjects’ identities, beliefs, and values.
- Schopper, H. (28 March 2014). "The 1980s: spurring collaboration". CERN Courier. https://cerncourier.com/a/viewpoint-the-1980s-spurring-collaboration/.
- Berners-Lee, T. (2000). Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. HarperCollins. p. 15. ISBN 9780062515872.
- Borgman, C.L. (2012). "The conundrum of sharing research data". Journal of the American Society for Information Science and Technology 63 (6): 1059–78. doi:10.1002/asi.22634.
- Edwards, P.N.; Mayernik, M.S.; Batcheller, A.L. et al. (2011). "Science friction: Data, metadata, and collaboration". Social Studies of Science 41 (5): 667-690. doi:10.1177/0306312711413314.
- Fassin, D.; Rechtman, R.; Gomme, R. (2009). The Empire of Trauma: An Inquiry into the Condition of Victimhoo. Princeton University Press. p. 11. ISBN 9780691137537.
- Edwards, P.N. (2004). "Infrastructure and Modernity: Force, Time, and Social Organization in the History of Sociotechnical Systems". In Misa, T.J.; Brey, P.; Feenberg, A.. Modernity and Technology. MIT Press. pp. 185–266. ISBN 9780262633109.
- "Toward a Hermeneutics of Data". IEEE Annals of the History of Computing 37 (3): 70–75. 2015. doi:10.1109/MAHC.2015.68.
- Fortun, K. (2009). "Scaling and Visualizing Multi-sited Ethnography". In Falzon, M.-A.. Multi-Sited Ethnography: Theory, Praxis and Locality in Contemporary Research. Routledge. pp. 75–6. ISBN 9780754673187.
- Kaye, J.; Heeney, C.; Hawkins, N. et al. (2009). "Data sharing in genomics — Re-shaping scientific practice". Nature Reviews Genetics 10: 331–35. doi:10.1038/nrg2573.
- Serwadda, D.; Ndebele, P.; Grabowski, M.K. et al. (2018). "Open data sharing and the Global South—Who benefits?". Science 359 (6376): 642–43. doi:10.1126/science.aap8395.
- van Panhuis, W.G.; Paul, P.; Emerson, C. et al. (2014). "A systematic review of barriers to data sharing in public health". A systematic review of barriers to data sharing in public health 14: 1144. doi:10.1186/1471-2458-14-1144.
- Bahlai, C.; Bartlett, L.J.; Burgio, K.R. et al. (2019). "Open Science Isn't Always Open to All Scientists". American Scientist 107 (2): 78. doi:10.1511/2019.107.2.78. https://www.americanscientist.org/article/open-science-isnt-always-open-to-all-scientists.
- Broom, A.; Cheshire, L.; Emmison, M. (2009). "Qualitative Researchers’ Understandings of Their Practice and the Implications for Data Archiving and Sharing". Sociology 43 (6): 1163–80. doi:10.1177/0038038509345704.
- Zimmerman, A.S. (2008). "New Knowledge from Old Data: The Role of Standards in the Sharing and Reuse of Ecological Data". Science, Technology, & Human Values 33 (5): 631–52. doi:10.1177/0162243907306704.
- Leonelli, S.; Ankeny, R.A. (2012). "Re-thinking organisms: The impact of databases on model organism biology". Studies in History and Philosophy of Science Part C 43 (1): 29–36. doi:10.1016/j.shpsc.2011.10.003.
- Acord, S.K.; Harley, D. (2012). "Credit, time, and personality: The human challenges to sharing scholarly work using Web 2.0". New Media & Society 15 (3): 379–397. doi:10.1177/1461444812465140.
- Tenopir, C.; Allard, S.; Douglass, K. et al. (2011). "Data Sharing by Scientists: Practices and Perceptions". PLoS One 6 (6): e21101. doi:10.1371/journal.pone.0021101.
- Pulsifer, P.L.; Laidler, G.J.; Taylor, D.R.F. et al. (2011). "Towards an Indigenist data management program: reflections on experiences developing an atlas of sea ice knowledge and use". The Canadian Geographer 55 (1): 108–24. doi:10.1111/j.1541-0064.2010.00348.x.
- Althusser, L. (1971). "Ideology and Ideological State Apparatuses (Notes towards an Investigation)". Lenin and Philosophy and other essays. Monthly Review Press. pp. 127–86. ISBN 0902308122.
- Jasny, B.R.; Chin, G.; Chong, L. et al. (2011). "Again, and Again, and Again …". Science 334 (6060): 1225. doi:10.1126/science.334.6060.1225.
- Poirier, L.; Fortun, K.; Costelloe-Kuehn, B. et al. (4 July 2019). "Metadata, Digital Infrastructure, and the Data Ideologies of Cultural Anthropology". PECE. https://worldpece.org/content/metadata-digital-infrastructure-and-data-ideologies-cultural-anthropology.
This presentation is faithful to the original, with only a few minor changes to presentation and grammar. In some cases important information was missing from the references, and that information was added. The original article had citations listed alphabetically; they are listed in the order they appear here due to the way the wiki works. To more easily differentiate footnotes from references, the original footnotes (which where numbered) were updated to use lowercase letters.