Journal:From biobank and data silos into a data commons: Convergence to support translational medicine

Full article title	From biobank and data silos into a data commons: Convergence to support translational medicine
Journal	Journal of Translational Medicine
Author(s)	Asiimwe, Rebecca; Lam, Stephanie; Leung, Samuel; Wang, Shanzhao; Wan, Rachel; Tinker, Anna; McAlpine, Jessica N.; Woo, Michelle M.M.; Huntsman, David G.; Talhouk, Aline
Author affiliation(s)	BC Cancer Research Centre, BC Children’s Hospital Research Institute, University of British Columbia, OVCARE
Primary contact	Email: a dot talhouk at ubc dot ca
Year published	2021
Volume and issue	19
Article #	493
DOI	10.1186/s12967-021-03147-z
ISSN	1479-5876
Distribution license	Creative Commons Attribution 4.0 International
Website	https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-021-03147-z
Download	https://translational-medicine.biomedcentral.com/track/pdf/10.1186/s12967-021-03147-z.pdf (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

Background: To drive translational medicine, modern day biobanks need to integrate with other sources of data (e.g., clinical, genomics) to support novel data-intensive research. Currently, vast amounts of research and clinical data remain in silos, held and managed by individual researchers, operating under different standards and governance structures; such a framework impedes sharing and effective use of data.

In this article, we describe the journey of British Columbia’s Gynecological Cancer Research Program (OVCARE) in moving a traditional tumor biobank, outcomes unit, and a collection of data silos into an integrated data commons to support data standardization and resource sharing under collaborative governance, as a means of providing the gynecologic cancer research community in British Columbia access to tissue samples and associated clinical and molecular data from thousands of patients.

Results: Through several engagements with stakeholders from various research institutions within our research community, we identified priorities and assessed infrastructure needs required to optimize and support data collections, storage, and sharing, under three main research domains: (1) biospecimen collections, (2) molecular and genomics data, and (3) clinical data. We further built a governance model and a resource portal to implement protocols and standard operating procedures (SOPs) for seamless collections management and governance of interoperable data, making genomic and clinical data available to the broader research community.

Conclusions: Proper infrastructures for data collection, sharing, and governance is a translational research imperative. We have consolidated our data holdings into a data commons, along with standardized operating procedures to meet research and ethics requirements of the gynecologic cancer community in British Columbia. The developed infrastructure brings together diverse data and computing frameworks, as well as tools and applications for managing, analyzing, and sharing data. Our data commons bridges data access gaps and barriers to precision medicine and approaches for diagnostics, treatment, and prevention of gynecological cancers by providing access to large datasets required for data-intensive science.

Keywords: biobanks, biospecimens, biobank technologies. precision medicine, data commons, laboratory information management systems, LIMS, federated systems, data governance

Background

The collection, storage, management, and distribution of human biospecimens for diagnostic pathology [1,2,3] can be traced as far back as the 1900s. [3] To meet research needs in the postgenomic era, modern day biorepositories [4] support scientists to derive disease-specific insights [5] by aiding the investigation of genetic underpinnings [6,7,8], elucidating etiology, and evaluating disease progression and therapeutic response; they are the backbone of precision medicine [9, 10], as well as biomedical and translational research. [1, 2, 11]

The last decade has seen advances in biotechnology such as next-generation sequencing (NGS), and the emergence of “omics” techniques for precision medicine (e.g., genomics, transcriptomics, proteomics, metabolomics, and epigenomics). These innovations coincided with breakthroughs in computing, artificial intelligence (AI), and analytics, enabling discrimination between disease with greater precision. [12] This has created an unprecedented demand for high-quality biospecimens and associated data, including clinical, molecular, imaging, and other types of data generated during research. [11] Innovations in database cloud storage and computing infrastructures to support data-intensive science have further contributed to revolutionizing resources available to address modern research needs. [13, 14] As federated models for aggregating data and biomaterials have emerged as favored approaches for identifying enough patients with specific clinical or molecular features, the importance of interoperability between biobanks and related databases has been accentuated. [4, 7, 15] Specimen collections have become virtual [13], flexible, and interoperable, hosted on internationally harmonized infrastructures [7] and optimized for secondary research. [7, 13] Present-day research environments and needs have led to the development and implementation of data commons [16, 17], bringing together within a research community diverse data and computing infrastructures, as well as tools and applications for managing, analyzing, and sharing interoperable data. This has created an opportunity to maximize collaborations and to extend the value generated from primary data collection. [18]

In 2016, as part of British Colombia’s multidisciplinary gynecological cancer research team (OVCARE), we undertook a comprehensive review of the landscape of the data assets available within our local research environment and assessed the infrastructure needs required to support data storage and sharing within our research community. Herein, we describe the roadmap undertaken for the creation of a data commons, transforming a traditional tumor biobank and a collection of data silos into an integrated and comprehensive infrastructure to support the current and future research needs of an expanding team.

Results

Matching technical solutions to research needs

OVCARE started in 2000 as an initiative between the British Columbia Cancer Agency, the University of British Columbia, and the Vancouver Coastal Health Research Institute to accelerate research discoveries and their translation to the clinical settings, as well as improve the lives of women with ovarian cancer or those at risk. Today, OVCARE is an internationally recognized multidisciplinary team of physicians and scientists who are breaking new ground in improving diagnosis, prevention, and treatment of all gynecological cancers. [19,20,21,22,23,24,25,26,27]

OVCARE’s research has been powered by the gynecological tumor bank and the Cheryl Brown Gynecological Cancers Outcomes Unit. Through the course of research, a plethora of molecular and genomic data was historically held by researchers that generated them. Similarly, clinical data from chart reviews are obtained to support clinical studies and held with clinicians. These data were in incompatible formats that needed significant manual manipulation and curation to be integrated. Moreover, each collection was governed by different ethics agreements that restricted the use of data and kept it in silos. This was becoming a barrier to novel data-intensive research, requiring the integration of multiple data sources; undertaking such projects was challenging, time consuming, and prone to errors. The OVCARE leadership had recognized that current research needs were not being met through existing infrastructure.

A broad stakeholder engagement effort in 2016 kicked off, with the objective to work with researchers, clinicians, scientists, and technicians at various institutions, to map out a collective future vision, identifying research needs and re-thinking present infrastructure. Engagements with key stakeholders identified research priorities, which were expanded into a list of fundamental requirements (Table 1) relevant to the collection and optimization of biospecimen, clinical, and molecular/genomics data, as well as a governance model of the resulting infrastructure. In addition to generating efficiency, limiting errors and honoring patient consent, fundamental research requirements included the maximization of secondary use of data, which enables data collected for one purpose to be reused in a completely different context. For example, chemotherapy drugs dispensed at our pharmacy are collected for administrative purposes (billing) but can also be used to link with patient phenotype, genotype, and outcome to investigate which patients benefit from these therapies more than others. Another important need was to generate novel research hypotheses by considering simultaneously various data that could never before be considered at the same time. Patterns that may not have been obvious previously may emerge to drive future innovative research. Another important need was to use translational studies to help inform patient care, as well as use data generated from patient care to ask new research questions, with the goal of continuously trying to better fill gaps in understanding of disease etiology and progression. In upcoming sections, we further describe more of these requirements in greater detail.

	Fundamental research requirements
Table 1. Summary of fundamental research and infrastructural needs of OVCARE’s research community
1	Generate efficiencies in data collection, storage, and analysis to maximize utility of collected data
2	Limit errors in data handling and ensure reproducibility of research findings
3	Protect patients’ privacy and honor their consent
4	Optimize secondary and continuous use of data generated from research and clinical care
5	Facilitate the recruitment of patients in various clinical studies
6	Identify specimens from patients with specific clinical, molecular, and genomic characteristics
7	Integration of medical and clinical data with molecular information to enable the discovery and testing of new associations and hypotheses towards translational research
8	Organize data towards a learning healthcare system where translation is bi-directional, meaning evidence-based research is used to inform practice, and the data generated during clinical care is in turn used to inform guidelines, generate hypotheses, and trigger pragmatic trials
	Functional and infrastructural IT requirements
1	Allow batch data imports and exports
2	Facilitate validation of data entered to minimize errors (e.g., returning an error message when text is entered instead of a numeric value)
3	Easy-to-use and customizable user interfaces
4	Support both prospective and retrospective data collection mechanisms
5	Adapt to changing needs between studies and projects, as well as over time
6	Track biospecimen locations, usage, and shipment to both local and offsite storage locations
7	Support multi-tenancy for the banking of biospecimens from distributed and diverse studies lead by different investigators interested in sharing resources
8	Adherence to best practices in privacy and security, such as support for data encryption, audit trails on all user actions, and data changes for regulatory compliance; configurable user privileges; role-based access control; and adherence to federal regulations with respect to de-identification of specimens and tracking of consent
9	Support interoperability and integration with other institutions, systems, and data sources to facilitate data sharing
10	Potential to scale-up biospecimen and user capacity at no added cost
11	Stable and mature vendor and community support

Biospecimen collection

OVCARE employs two models for biospecimen recruitment. The first is a general banking model, with broad scientific aims, and where specimens are obtained from consented participants and stored until needed. The second is a study-based banking model, where participants are recruited to address specific study aims, with a pre-defined protocol and pre-planned specimen collection. To accommodate both approaches, the biorepository infrastructure needed to manage accrual of specimens in a patient-centric approach, retain the context of the patient’s clinical history, and support basic biospecimen collection, storage, and distribution across multiple studies at different sites, under both recruitment models. This includes inventory control, the ability to track sample availability and location, as well as track generated derivatives (e.g., xenografts and organoids). The infrastructure needed to be adaptable to changing needs between studies and projects, as well as over time, with the ability to preserve the natural history of the data. Access control that varied for different user groups was a critical feature, enabling adherence to regulatory requirements and health research best practices. Data security, de-identification of specimens, and tracking of consent were also important for the same reasons, in addition to the need to operate and manage the biorepository with minimal support from institutional and research IT.

We compiled a comprehensive list of requirements (Additional file 2: Table S1) from our stakeholder meetings, and we used it to guide our scan of the landscape of existing laboratory information management systems (LIMS) (Additional file 2: Table S2—S11, and Fig. 1). This resulted in the identification of OpenSpecimen [28], a LIMS based on caTissue [29], a mature system with over 15 years of use by the research community. OpenSpecimen addressed more requirements from our list in comparison to other options we considered. It is an open-source software solution with commercial support, in use by over 70 biobanks across 20 countries. The commercial support ensures ongoing software testing, updating, and continuous improvement. This is in addition to the availability of technical support, and access to a community of experienced users through active forums.

Figure 1. Needs-to-biobank mapping and the number of requirements fulfilled by each LIMS. a. Tiled plot of the mapping of each biospecimen research need to the biobank solution meeting that need. Surveyed biobanks are plotted on the y-axis and research needs (desired biobank features) are plotted on the x-axis, grouped and colored by feature class. b. Barplot on the overall number of features provided by a specific LIMS. The LIMS solutions are plotted on the y-axis, and the number of features provided are plotted on the x-axis.

In this LIMS, biospecimens can be processed individually or in bulk, with rapid barcode-based scanning available to enter information on multiple patient samples at once. This enabled high-throughput processing and efficient migration from our legacy LIMS. Options for data annotation and storage management allowed us to optimize specimen storage, a costly resource in our research community (e.g., − 80 freezers). [29]

The OpenSpecimen LIMS enabled customization of data entry forms via a graphical user interface (GUI, or web interface) to match study-specific needs without requiring software development. The platform met most of our IT requirements as it supported role-based access control and provided an audit trail of every user operation. [30] The system was also easy to use, with graphics-based queries that enable searching for stored data about participants, biospecimens, or projects without requiring any programming, making the moderately complex queries accessible to most users. Queries could also be performed via REST API (Representational State Transfer Application Programming Interface) using a SQL (Structured Query Language)-like query language. This facilitated automation of data downloads for analytics pipelines through the incorporation of query scripts.

The system enables standalone plugins through a software development kit. These plugins can be made publicly available to the community. For example, the tissue microarray (TMA) plugin can manage TMAs on OpenSpecimen by linking to donor blocks and describing details of experiments done on the different slices of the TMA blocks. Finally, interoperability with other systems was important to expand linkage within the data commons. The vendor provides integration with electronic data capture applications (REDCap, OpenClinica), electronic medical record (EMR) systems (EPIC, Velos), and pathology systems (CoPath, Cerner, Aperio), as well as systems using Health Level 7 (HL7) messages, a capability which can further support inclusion of participant and biospecimen information from distributed systems.

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.

[[Category:LIMSwiki journal articles (added in 2019)]