Difference between revisions of "Journal:From biobank and data silos into a data commons: Convergence to support translational medicine"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
m (→‎Notes: Corrected tag)
(5 intermediate revisions by the same user not shown)
Line 19: Line 19:
|download    = [https://translational-medicine.biomedcentral.com/track/pdf/10.1186/s12967-021-03147-z.pdf https://translational-medicine.biomedcentral.com/track/pdf/10.1186/s12967-021-03147-z.pdf] (PDF)
|download    = [https://translational-medicine.biomedcentral.com/track/pdf/10.1186/s12967-021-03147-z.pdf https://translational-medicine.biomedcentral.com/track/pdf/10.1186/s12967-021-03147-z.pdf] (PDF)
}}
}}
{{ombox
| type      = notice
| image    = [[Image:Emblem-important-yellow.svg|40px]]
| style    = width: 500px;
| text      = This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.
}}
==Abstract==
==Abstract==
'''Background''': To drive [[Translational research|translational medicine]], modern day [[biobank]]s need to integrate with other sources of data (e.g., [[Health informatics|clinical]], [[genomics]]) to support novel data-intensive research. Currently, vast amounts of research and clinical data remain in silos, held and managed by individual researchers, operating under different standards and governance structures; such a framework impedes sharing and effective use of data.  
'''Background''': To drive [[Translational research|translational medicine]], modern day [[biobank]]s need to integrate with other sources of data (e.g., [[Health informatics|clinical]], [[genomics]]) to support novel data-intensive research. Currently, vast amounts of research and clinical data remain in silos, held and managed by individual researchers, operating under different standards and governance structures; such a framework impedes sharing and effective use of data.  
Line 38: Line 31:


==Background==
==Background==
The collection, storage, management, and distribution of human biospecimens for diagnostic pathology [1,2,3] can be traced as far back as the 1900s. [3] To meet research needs in the postgenomic era, modern day biorepositories [4] support scientists to derive disease-specific insights [5] by aiding the investigation of genetic underpinnings [6,7,8], elucidating etiology, and evaluating disease progression and therapeutic response; they are the backbone of precision medicine [9, 10], as well as biomedical and [[translational research]]. [1, 2, 11]
The collection, storage, management, and distribution of human biospecimens for diagnostic pathology<ref name=":0">{{Cite journal |last=Vaught |first=Jim |date=2016-01-06 |title=Biobanking Comes of Age: The Transition to Biospecimen Science |url=https://www.annualreviews.org/doi/10.1146/annurev-pharmtox-010715-103246 |journal=Annual Review of Pharmacology and Toxicology |volume=56 |issue=1 |pages=211–228 |doi=10.1146/annurev-pharmtox-010715-103246 |issn=0362-1642}}</ref><ref name=":1">{{Cite journal |last=Vaught |first=Jim |last2=Kelly |first2=Andrea |last3=Hewitt |first3=Robert |date=2009-09 |title=A Review of International Biobanks and Networks: Success Factors and Key Benchmarks |url=https://doi.org/10.1089/bio.2010.0003 |journal=Biopreservation and Biobanking |volume=7 |issue=3 |pages=143–150 |doi=10.1089/bio.2010.0003 |issn=1947-5535 |pmc=PMC4046743 |pmid=24835880}}</ref><ref name=":2">{{Cite book |last=Eiseman, E.; Haga, S.B. |date=1999 |title=Handbook of Human Tissue Sources: A National Resource of Human Tissue Samples |url=https://www.rand.org/pubs/monograph_reports/MR954.html |language=en |publisher=RAND Corporation |pages=251 |doi=10.7249/mr954 |isbn=978-0-8330-2766-5}}</ref> can be traced as far back as the 1900s.<ref name=":2" /> To meet research needs in the postgenomic era, modern day biorepositories<ref name=":3">{{Cite journal |last=Coppola |first=Luigi |last2=Cianflone |first2=Alessandra |last3=Grimaldi |first3=Anna Maria |last4=Incoronato |first4=Mariarosaria |last5=Bevilacqua |first5=Paolo |last6=Messina |first6=Francesco |last7=Baselice |first7=Simona |last8=Soricelli |first8=Andrea |last9=Mirabelli |first9=Peppino |last10=Salvatore |first10=Marco |date=2019-05-22 |title=Biobanking in health care: evolution and future directions |url=https://doi.org/10.1186/s12967-019-1922-3 |journal=Journal of Translational Medicine |volume=17 |issue=1 |pages=172 |doi=10.1186/s12967-019-1922-3 |issn=1479-5876 |pmc=PMC6532145 |pmid=31118074}}</ref> support scientists to derive disease-specific insights<ref>{{Cite journal |last=Greenberg |first=Benjamin |last2=Christian |first2=Jennifer |last3=Meltzer Henry |first3=Leslie |last4=Leavy |first4=Michelle |last5=Moore |first5=Helen |date=2018-02 |title=Biorepositories |url=https://effectivehealthcare.ahrq.gov/topics/registries-guide-4th-edition/white-paper-2016-3 |doi=10.23970/ahrqregistriesbio}}</ref> by aiding the investigation of genetic underpinnings<ref>{{Cite journal |last=Cortes |first=Adrian |last2=Albers |first2=Patrick K. |last3=Dendrou |first3=Calliope A. |last4=Fugger |first4=Lars |last5=McVean |first5=Gil |date=2020-01 |title=Identifying cross-disease components of genetic risk across hospital data in the UK Biobank |url=https://www.nature.com/articles/s41588-019-0550-4 |journal=Nature Genetics |language=en |volume=52 |issue=1 |pages=126–134 |doi=10.1038/s41588-019-0550-4 |issn=1546-1718 |pmc=PMC6974401 |pmid=31873298}}</ref><ref name=":4">{{Cite journal |last=Harris |first=Jennifer R. |last2=Burton |first2=Paul |last3=Knoppers |first3=Bartha Maria |last4=Lindpaintner |first4=Klaus |last5=Bledsoe |first5=Marianna |last6=Brookes |first6=Anthony J. |last7=Budin-Ljøsne |first7=Isabelle |last8=Chisholm |first8=Rex |last9=Cox |first9=David |last10=Deschênes |first10=Mylène |last11=Fortier |first11=Isabel |date=2012-11 |title=Toward a roadmap in global biobanking for health |url=https://www.nature.com/articles/ejhg201296 |journal=European Journal of Human Genetics |language=en |volume=20 |issue=11 |pages=1105–1111 |doi=10.1038/ejhg.2012.96 |issn=1476-5438 |pmc=PMC3477856 |pmid=22713808}}</ref><ref>{{Cite journal |last=Cole |first=Joanne B. |last2=Florez |first2=Jose C. |last3=Hirschhorn |first3=Joel N. |date=2020-03-19 |title=Comprehensive genomic analysis of dietary habits in UK Biobank identifies hundreds of genetic associations |url=https://www.nature.com/articles/s41467-020-15193-0 |journal=Nature Communications |language=en |volume=11 |issue=1 |pages=1467 |doi=10.1038/s41467-020-15193-0 |issn=2041-1723 |pmc=PMC7081342 |pmid=32193382}}</ref>, elucidating etiology, and evaluating disease progression and therapeutic response; they are the backbone of precision medicine<ref>{{Cite web |last=Collins |first=Francis S. |last2=Varmus |first2=Harold |date=2015-02-25 |title=A New Initiative on Precision Medicine |work=https://doi.org/10.1056/NEJMp1500523 |url=https://www.nejm.org/doi/10.1056/NEJMp1500523 |language=en |pages=793–5 |doi=10.1056/nejmp1500523 |pmc=PMC5101938 |pmid=25635347 |accessdate=2022-01-05}}</ref><ref>{{Citation |last=Liu |first=Angen |last2=Pollard |first2=Kai |date=2015 |editor-last=Karimi-Busheri |editor-first=Feridoun |title=Biobanking for Personalized Medicine |url=http://link.springer.com/10.1007/978-3-319-20579-3_5 |work=Biobanking in the 21st Century |publisher=Springer International Publishing |place=Cham |volume=864 |pages=55–68 |doi=10.1007/978-3-319-20579-3_5 |isbn=978-3-319-20578-6 |accessdate=2022-01-05}}</ref>, as well as biomedical and [[translational research]].<ref name=":0" /><ref name=":1" /><ref name=":5">{{Cite journal |last=De Souza |first=Yvonne G. |last2=Greenspan |first2=John S. |date=2013-01-28 |title=Biobanking past, present and future |url=https://doi.org/10.1097/QAD.0b013e32835c1244 |journal=AIDS |volume=27 |issue=3 |pages=303–312 |doi=10.1097/qad.0b013e32835c1244 |issn=0269-9370 |pmc=PMC3894636 |pmid=23135167}}</ref>


The last decade has seen advances in biotechnology such as [[next-generation sequencing]] (NGS), and the emergence of “omics” techniques for precision medicine (e.g., [[genomics]], transcriptomics, proteomics, metabolomics, and epigenomics). These innovations coincided with breakthroughs in computing, [[artificial intelligence]] (AI), and analytics, enabling discrimination between disease with greater precision. [12] This has created an unprecedented demand for high-quality biospecimens and associated data, including [[Health informatics|clinical]], [[Molecular diagnostics|molecular]], [[Medical imaging|imaging]], and other types of data generated during research. [11] Innovations in database [[Cloud computing|cloud storage]] and computing infrastructures to support data-intensive science have further contributed to revolutionizing resources available to address modern research needs. [13, 14] As federated models for aggregating data and biomaterials have emerged as favored approaches for identifying enough patients with specific clinical or molecular features, the importance of interoperability between [[biobank]]s and related databases has been accentuated. [4, 7, 15] Specimen collections have become virtual [13], flexible, and interoperable, hosted on internationally harmonized infrastructures [7] and optimized for secondary research. [7, 13] Present-day research environments and needs have led to the development and implementation of [[Open data#Policies and strategies|data commons]] [16, 17], bringing together within a research community diverse data and computing infrastructures, as well as tools and applications for managing, [[Data analysis|analyzing]], and sharing interoperable data. This has created an opportunity to maximize collaborations and to extend the value generated from primary data collection. [18]
The last decade has seen advances in biotechnology such as [[next-generation sequencing]] (NGS), and the emergence of “omics” techniques for precision medicine (e.g., [[genomics]], transcriptomics, proteomics, metabolomics, and epigenomics). These innovations coincided with breakthroughs in computing, [[artificial intelligence]] (AI), and analytics, enabling discrimination between disease with greater precision.<ref>{{Cite journal |last=Uddin |first=Mohammed |last2=Wang |first2=Yujiang |last3=Woodbury-Smith |first3=Marc |date=2019-11-21 |title=Artificial intelligence for precision medicine in neurodevelopmental disorders |url=https://www.nature.com/articles/s41746-019-0191-0 |journal=npj Digital Medicine |language=en |volume=2 |issue=1 |pages=1–10 |doi=10.1038/s41746-019-0191-0 |issn=2398-6352 |pmc=PMC6872596 |pmid=31799421}}</ref> This has created an unprecedented demand for high-quality biospecimens and associated data, including [[Health informatics|clinical]], [[Molecular diagnostics|molecular]], [[Medical imaging|imaging]], and other types of data generated during research.<ref name=":5" /> Innovations in database [[Cloud computing|cloud storage]] and computing infrastructures to support data-intensive science have further contributed to revolutionizing resources available to address modern research needs.<ref name=":6">{{Cite web |last=Pandya, J.; Cognitive World |date=12 August 2019 |title=Biobanking Is Changing The World |work=Forbes |url=https://www.forbes.com/sites/cognitiveworld/2019/08/12/biobanking-is-changing-the-world/?sh=4be6f9443792 |accessdate=16 August 2020}}</ref><ref>{{Cite journal |last=Lee |first=Jae-Eun |date=2018-07-31 |title=Artificial Intelligence in the Future Biobanking: Current Issues in the Biobank and Future Possibilities of Artificial Intelligence |url=https://biomedres.us/fulltexts/BJSTR.MS.ID.001511.php |journal=Biomedical Journal of Scientific & Technical Research |volume=7 |issue=3 |doi=10.26717/BJSTR.2018.07.001511}}</ref> As federated models for aggregating data and biomaterials have emerged as favored approaches for identifying enough patients with specific clinical or molecular features, the importance of interoperability between [[biobank]]s and related databases has been accentuated.<ref name=":3" /><ref name=":4" /><ref>{{Cite journal |last=Kiehntopf |first=Michael |last2=Krawczak |first2=Michael |date=2011-07-15 |title=Biobanking and international interoperability: samples |url=https://doi.org/10.1007/s00439-011-1068-8 |journal=Human Genetics |language=en |volume=130 |issue=3 |pages=369–376 |doi=10.1007/s00439-011-1068-8 |issn=0340-6717}}</ref> Specimen collections have become virtual<ref name=":6" />, flexible, and interoperable, hosted on internationally harmonized infrastructures<ref name=":4" /> and optimized for secondary research.<ref name=":4" /><ref name=":6" /> Present-day research environments and needs have led to the development and implementation of [[Open data#Policies and strategies|data commons]]<ref>{{Cite journal |last=Grossman |first=Robert L. |last2=Heath |first2=Allison |last3=Murphy |first3=Mark |last4=Patterson |first4=Maria |last5=Wells |first5=Walt |date=2016-09 |title=A Case for Data Commons: Toward Data Science as a Service |url=https://ieeexplore.ieee.org/document/7548983/ |journal=Computing in Science Engineering |volume=18 |issue=5 |pages=10–20 |doi=10.1109/MCSE.2016.92 |issn=1558-366X |pmc=PMC5636009 |pmid=29033693}}</ref><ref>{{Cite journal |last=Jensen |first=Mark A. |last2=Ferretti |first2=Vincent |last3=Grossman |first3=Robert L. |last4=Staudt |first4=Louis M. |date=2017-07-27 |title=The NCI Genomic Data Commons as an engine for precision medicine |url=https://doi.org/10.1182/blood-2017-03-735654 |journal=Blood |volume=130 |issue=4 |pages=453–459 |doi=10.1182/blood-2017-03-735654 |issn=0006-4971 |pmc=PMC5533202 |pmid=28600341}}</ref>, bringing together within a research community diverse data and computing infrastructures, as well as tools and applications for managing, [[Data analysis|analyzing]], and sharing interoperable data. This has created an opportunity to maximize collaborations and to extend the value generated from primary data collection.<ref>{{Cite journal |last=Hinkson |first=Izumi V. |last2=Davidsen |first2=Tanja M. |last3=Klemm |first3=Juli D. |last4=Chandramouliswaran |first4=Ishwar |last5=Kerlavage |first5=Anthony R. |last6=Kibbe |first6=Warren A. |date=2017 |title=A Comprehensive Infrastructure for Big Data in Cancer Research: Accelerating Cancer Research and Precision Medicine |url=https://www.frontiersin.org/articles/10.3389/fcell.2017.00083/full |journal=Frontiers in Cell and Developmental Biology |language=English |volume=5 |at=83 |doi=10.3389/fcell.2017.00083 |issn=2296-634X |pmc=PMC5613113 |pmid=28983483}}</ref>


In 2016, as part of British Colombia’s multidisciplinary gynecological cancer research team (OVCARE), we undertook a comprehensive review of the landscape of the data assets available within our local research environment and assessed the infrastructure needs required to support data storage and sharing within our research community. Herein, we describe the roadmap undertaken for the creation of a data commons, transforming a traditional tumor biobank and a collection of data silos into an integrated and comprehensive infrastructure to support the current and future research needs of an expanding team.
In 2016, as part of British Colombia’s multidisciplinary gynecological cancer research team (OVCARE), we undertook a comprehensive review of the landscape of the data assets available within our local research environment and assessed the infrastructure needs required to support data storage and sharing within our research community. Herein, we describe the roadmap undertaken for the creation of a data commons, transforming a traditional tumor biobank and a collection of data silos into an integrated and comprehensive infrastructure to support the current and future research needs of an expanding team.
Line 46: Line 39:
==Results==
==Results==
===Matching technical solutions to research needs===
===Matching technical solutions to research needs===
OVCARE started in 2000 as an initiative between the British Columbia Cancer Agency, the University of British Columbia, and the Vancouver Coastal Health Research Institute to accelerate research discoveries and their translation to the clinical settings, as well as improve the lives of women with ovarian cancer or those at risk. Today, OVCARE is an internationally recognized multidisciplinary team of physicians and scientists who are breaking new ground in improving diagnosis, prevention, and treatment of all gynecological cancers. [19,20,21,22,23,24,25,26,27]
OVCARE started in 2000 as an initiative between the British Columbia Cancer Agency, the University of British Columbia, and the Vancouver Coastal Health Research Institute to accelerate research discoveries and their translation to the clinical settings, as well as improve the lives of women with ovarian cancer or those at risk. Today, OVCARE is an internationally recognized multidisciplinary team of physicians and scientists who are breaking new ground in improving diagnosis, prevention, and treatment of all gynecological cancers.<ref>{{Cite journal |last=Köbel |first=Martin |last2=Rahimi |first2=Kurosh |last3=Rambau |first3=Peter F. |last4=Naugler |first4=Christopher |last5=Le Page |first5=Cécile |last6=Meunier |first6=Liliane |last7=de Ladurantaye |first7=Manon |last8=Lee |first8=Sandra |last9=Leung |first9=Samuel |last10=Goode |first10=Ellen L. |last11=Ramus |first11=Susan J. |date=2016-09 |title=An Immunohistochemical Algorithm for Ovarian Carcinoma Typing |url=https://doi.org/10.1097/PGP.0000000000000274 |journal=International Journal of Gynecological Pathology |volume=35 |issue=5 |pages=430–441 |doi=10.1097/pgp.0000000000000274 |issn=0277-1691 |pmc=PMC4978603 |pmid=26974996}}</ref><ref>{{Cite journal |last=Shah |first=Sohrab P. |last2=Köbel |first2=Martin |last3=Senz |first3=Janine |last4=Morin |first4=Ryan D. |last5=Clarke |first5=Blaise A. |last6=Wiegand |first6=Kimberly C. |last7=Leung |first7=Gillian |last8=Zayed |first8=Abdalnasser |last9=Mehl |first9=Erika |last10=Kalloger |first10=Steve E. |last11=Sun |first11=Mark |date=2009-06-25 |title=Mutation of FOXL2 in Granulosa-Cell Tumors of the Ovary |url=https://doi.org/10.1056/NEJMoa0902542 |journal=New England Journal of Medicine |volume=360 |issue=26 |pages=2719–2729 |doi=10.1056/NEJMoa0902542 |issn=0028-4793 |pmid=19516027}}</ref><ref>{{Cite journal |last=Wiegand |first=Kimberly C. |last2=Shah |first2=Sohrab P. |last3=Al-Agha |first3=Osama M. |last4=Zhao |first4=Yongjun |last5=Tse |first5=Kane |last6=Zeng |first6=Thomas |last7=Senz |first7=Janine |last8=McConechy |first8=Melissa K. |last9=Anglesio |first9=Michael S. |last10=Kalloger |first10=Steve E. |last11=Yang |first11=Winnie |date=2010-10-14 |title=ARID1A Mutations in Endometriosis-Associated Ovarian Carcinomas |url=https://doi.org/10.1056/NEJMoa1008433 |journal=New England Journal of Medicine |volume=363 |issue=16 |pages=1532–1543 |doi=10.1056/NEJMoa1008433 |issn=0028-4793 |pmc=PMC2976679 |pmid=20942669}}</ref><ref>{{Cite journal |last=Errico |first=Alessia |date=2014-06 |title=SMARCA4 mutated in SCCOHT |url=https://www.nature.com/articles/nrclinonc.2014.63 |journal=Nature Reviews Clinical Oncology |language=en |volume=11 |issue=6 |pages=302–302 |doi=10.1038/nrclinonc.2014.63 |issn=1759-4782}}</ref><ref>{{Cite journal |last=Wang |first=Yi Kan |last2=Bashashati |first2=Ali |last3=Anglesio |first3=Michael S. |last4=Cochrane |first4=Dawn R. |last5=Grewal |first5=Diljot S. |last6=Ha |first6=Gavin |last7=McPherson |first7=Andrew |last8=Horlings |first8=Hugo M. |last9=Senz |first9=Janine |last10=Prentice |first10=Leah M. |last11=Karnezis |first11=Anthony N. |date=2017-06 |title=Genomic consequences of aberrant DNA repair mechanisms stratify ovarian cancer histotypes |url=https://www.nature.com/articles/ng.3849 |journal=Nature Genetics |language=en |volume=49 |issue=6 |pages=856–865 |doi=10.1038/ng.3849 |issn=1546-1718}}</ref><ref>{{Cite journal |last=Talhouk |first=Aline |last2=McConechy |first2=Melissa K. |last3=Leung |first3=Samuel |last4=Yang |first4=Winnie |last5=Lum |first5=Amy |last6=Senz |first6=Janine |last7=Boyd |first7=Niki |last8=Pike |first8=Judith |last9=Anglesio |first9=Michael |last10=Kwon |first10=Janice S. |last11=Karnezis |first11=Anthony N. |date=2017 |title=Confirmation of ProMisE: A simple, genomics-based clinical classifier for endometrial cancer |url=https://onlinelibrary.wiley.com/doi/abs/10.1002/cncr.30496 |journal=Cancer |language=en |volume=123 |issue=5 |pages=802–813 |doi=10.1002/cncr.30496 |issn=1097-0142}}</ref><ref>{{Cite journal |last=Karnezis |first=Anthony N. |last2=Leung |first2=Samuel |last3=Magrill |first3=Jamie |last4=McConechy |first4=Melissa K. |last5=Yang |first5=Winnie |last6=Chow |first6=Christine |last7=Kobel |first7=Martin |last8=Lee |first8=Cheng-Han |last9=Huntsman |first9=David G. |last10=Talhouk |first10=Aline |last11=Kommoss |first11=Friederich |date=2017 |title=Evaluation of endometrial carcinoma prognostic immunohistochemistry markers in the context of molecular classification |url=https://onlinelibrary.wiley.com/doi/abs/10.1002/cjp2.82 |journal=The Journal of Pathology: Clinical Research |language=en |volume=3 |issue=4 |pages=279–293 |doi=10.1002/cjp2.82 |issn=2056-4538 |pmc=PMC5653931 |pmid=29085668}}</ref><ref>{{Cite journal |last=Talhouk |first=Aline |last2=Hoang |first2=Lien N. |last3=McConechy |first3=Melissa K. |last4=Nakonechny |first4=Quentin |last5=Leo |first5=Joyce |last6=Cheng |first6=Angela |last7=Leung |first7=Samuel |last8=Yang |first8=Winnie |last9=Lum |first9=Amy |last10=Köbel |first10=Martin |last11=Lee |first11=Cheng-Han |date=2016-10-01 |title=Molecular classification of endometrial carcinoma on diagnostic specimens is highly concordant with final hysterectomy: Earlier prognostic information to guide treatment |url=https://www.gynecologiconcology-online.net/article/S0090-8258(16)30959-3/abstract |journal=Gynecologic Oncology |language=English |volume=143 |issue=1 |pages=46–53 |doi=10.1016/j.ygyno.2016.07.090 |issn=0090-8258 |pmc=PMC5521211 |pmid=27421752}}</ref><ref>{{Cite journal |last=McAlpine |first=Jessica N. |last2=Leung |first2=Samuel C. Y. |last3=Cheng |first3=Angela |last4=Miller |first4=Dianne |last5=Talhouk |first5=Aline |last6=Gilks |first6=C. Blake |last7=Karnezis |first7=Anthony N. |date=2017 |title=Human papillomavirus (HPV)-independent vulvar squamous cell carcinoma has a worse prognosis than HPV-associated disease: a retrospective cohort study |url=https://onlinelibrary.wiley.com/doi/abs/10.1111/his.13205 |journal=Histopathology |language=en |volume=71 |issue=2 |pages=238–246 |doi=10.1111/his.13205 |issn=1365-2559}}</ref>


OVCARE’s research has been powered by the gynecological tumor bank and the Cheryl Brown Gynecological Cancers Outcomes Unit. Through the course of research, a plethora of molecular and genomic data was historically held by researchers that generated them. Similarly, clinical data from chart reviews are obtained to support clinical studies and held with clinicians. These data were in incompatible formats that needed significant manual manipulation and curation to be integrated. Moreover, each collection was governed by different ethics agreements that restricted the use of data and kept it in silos. This was becoming a barrier to novel data-intensive research, requiring the integration of multiple data sources; undertaking such projects was challenging, time consuming, and prone to errors. The OVCARE leadership had recognized that current research needs were not being met through existing infrastructure.
OVCARE’s research has been powered by the gynecological tumor bank and the Cheryl Brown Gynecological Cancers Outcomes Unit. Through the course of research, a plethora of molecular and genomic data was historically held by researchers that generated them. Similarly, clinical data from chart reviews are obtained to support clinical studies and held with clinicians. These data were in incompatible formats that needed significant manual manipulation and curation to be integrated. Moreover, each collection was governed by different ethics agreements that restricted the use of data and kept it in silos. This was becoming a barrier to novel data-intensive research, requiring the integration of multiple data sources; undertaking such projects was challenging, time consuming, and prone to errors. The OVCARE leadership had recognized that current research needs were not being met through existing infrastructure.
Line 53: Line 46:


{|  
{|  
  | STYLE="vertical-align:top;"|
  | style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="70%"
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="70%"
  |-
  |-
   | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''Table 1.''' Summary of fundamental research and infrastructural needs of OVCARE’s research community
   | colspan="2" style="background-color:white; padding-left:10px; padding-right:10px;" |'''Table 1.''' Summary of fundamental research and infrastructural needs of OVCARE’s research community
  |-
  |-
   ! style="background-color:#dddddd; padding-left:10px; padding-right:10px;"|
   ! style="background-color:#dddddd; padding-left:10px; padding-right:10px;" |
   ! style="background-color:#dddddd; padding-left:10px; padding-right:10px;"|Fundamental research requirements
   ! style="background-color:#dddddd; padding-left:10px; padding-right:10px;" |Fundamental research requirements
  |-
  |-
   | style="background-color:white; padding-left:10px; padding-right:10px;"|1
   | style="background-color:white; padding-left:10px; padding-right:10px;" |1
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Generate efficiencies in data collection, storage, and analysis to maximize utility of collected data
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Generate efficiencies in data collection, storage, and analysis to maximize utility of collected data
  |-  
  |-  
   | style="background-color:white; padding-left:10px; padding-right:10px;"|2
   | style="background-color:white; padding-left:10px; padding-right:10px;" |2
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Limit errors in data handling and ensure reproducibility of research findings
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Limit errors in data handling and ensure reproducibility of research findings
  |-  
  |-  
   | style="background-color:white; padding-left:10px; padding-right:10px;"|3
   | style="background-color:white; padding-left:10px; padding-right:10px;" |3
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Protect patients’ privacy and honor their consent
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Protect patients’ privacy and honor their consent
  |-  
  |-  
   | style="background-color:white; padding-left:10px; padding-right:10px;"|4
   | style="background-color:white; padding-left:10px; padding-right:10px;" |4
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Optimize secondary and continuous use of data generated from research and clinical care
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Optimize secondary and continuous use of data generated from research and clinical care
  |-  
  |-  
   | style="background-color:white; padding-left:10px; padding-right:10px;"|5
   | style="background-color:white; padding-left:10px; padding-right:10px;" |5
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Facilitate the recruitment of patients in various clinical studies
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Facilitate the recruitment of patients in various clinical studies
  |-  
  |-  
   | style="background-color:white; padding-left:10px; padding-right:10px;"|6
   | style="background-color:white; padding-left:10px; padding-right:10px;" |6
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Identify specimens from patients with specific clinical, molecular, and genomic characteristics
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Identify specimens from patients with specific clinical, molecular, and genomic characteristics
  |-  
  |-  
   | style="background-color:white; padding-left:10px; padding-right:10px;"|7
   | style="background-color:white; padding-left:10px; padding-right:10px;" |7
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Integration of medical and clinical data with molecular information to enable the discovery and testing of new associations and hypotheses towards translational research
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Integration of medical and clinical data with molecular information to enable the discovery and testing of new associations and hypotheses towards translational research
  |-  
  |-  
   | style="background-color:white; padding-left:10px; padding-right:10px;"|8
   | style="background-color:white; padding-left:10px; padding-right:10px;" |8
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Organize data towards a learning healthcare system where translation is bi-directional, meaning evidence-based research is used to inform practice, and the data generated during clinical care is in turn used to inform guidelines, generate hypotheses, and trigger pragmatic trials
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Organize data towards a learning healthcare system where translation is bi-directional, meaning evidence-based research is used to inform practice, and the data generated during clinical care is in turn used to inform guidelines, generate hypotheses, and trigger pragmatic trials
  |-  
  |-  
   ! style="background-color:#dddddd; padding-left:10px; padding-right:10px;"|
   ! style="background-color:#dddddd; padding-left:10px; padding-right:10px;" |
   ! style="background-color:#dddddd; padding-left:10px; padding-right:10px;"|Functional and infrastructural IT requirements
   ! style="background-color:#dddddd; padding-left:10px; padding-right:10px;" |Functional and infrastructural IT requirements
  |-  
  |-  
   | style="background-color:white; padding-left:10px; padding-right:10px;"|1
   | style="background-color:white; padding-left:10px; padding-right:10px;" |1
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Allow batch data imports and exports
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Allow batch data imports and exports
  |-   
  |-   
   | style="background-color:white; padding-left:10px; padding-right:10px;"|2
   | style="background-color:white; padding-left:10px; padding-right:10px;" |2
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Facilitate validation of data entered to minimize errors (e.g., returning an error message when text is entered instead of a numeric value)
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Facilitate validation of data entered to minimize errors (e.g., returning an error message when text is entered instead of a numeric value)
  |-   
  |-   
   | style="background-color:white; padding-left:10px; padding-right:10px;"|3
   | style="background-color:white; padding-left:10px; padding-right:10px;" |3
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Easy-to-use and customizable user interfaces
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Easy-to-use and customizable user interfaces
  |-   
  |-   
   | style="background-color:white; padding-left:10px; padding-right:10px;"|4
   | style="background-color:white; padding-left:10px; padding-right:10px;" |4
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Support both prospective and retrospective data collection mechanisms
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Support both prospective and retrospective data collection mechanisms
  |-   
  |-   
   | style="background-color:white; padding-left:10px; padding-right:10px;"|5
   | style="background-color:white; padding-left:10px; padding-right:10px;" |5
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Adapt to changing needs between studies and projects, as well as over time
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Adapt to changing needs between studies and projects, as well as over time
  |-   
  |-   
   | style="background-color:white; padding-left:10px; padding-right:10px;"|6
   | style="background-color:white; padding-left:10px; padding-right:10px;" |6
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Track biospecimen locations, usage, and shipment to both local and offsite storage locations
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Track biospecimen locations, usage, and shipment to both local and offsite storage locations
  |-   
  |-   
   | style="background-color:white; padding-left:10px; padding-right:10px;"|7
   | style="background-color:white; padding-left:10px; padding-right:10px;" |7
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Support multi-tenancy for the banking of biospecimens from distributed and diverse studies lead by different investigators interested in sharing resources
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Support multi-tenancy for the banking of biospecimens from distributed and diverse studies lead by different investigators interested in sharing resources
  |-   
  |-   
   | style="background-color:white; padding-left:10px; padding-right:10px;"|8
   | style="background-color:white; padding-left:10px; padding-right:10px;" |8
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Adherence to best practices in privacy and security, such as support for data [[encryption]], [[audit trail]]s on all user actions, and data changes for regulatory compliance; configurable user privileges; role-based access control; and adherence to federal regulations with respect to de-identification of specimens and tracking of consent
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Adherence to best practices in privacy and security, such as support for data [[encryption]], [[audit trail]]s on all user actions, and data changes for regulatory compliance; configurable user privileges; role-based access control; and adherence to federal regulations with respect to de-identification of specimens and tracking of consent
  |-   
  |-   
   | style="background-color:white; padding-left:10px; padding-right:10px;"|9
   | style="background-color:white; padding-left:10px; padding-right:10px;" |9
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Support interoperability and integration with other institutions, systems, and data sources to facilitate data sharing
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Support interoperability and integration with other institutions, systems, and data sources to facilitate data sharing
  |-   
  |-   
   | style="background-color:white; padding-left:10px; padding-right:10px;"|10
   | style="background-color:white; padding-left:10px; padding-right:10px;" |10
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Potential to scale-up biospecimen and user capacity at no added cost
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Potential to scale-up biospecimen and user capacity at no added cost
  |-   
  |-   
   | style="background-color:white; padding-left:10px; padding-right:10px;"|11
   | style="background-color:white; padding-left:10px; padding-right:10px;" |11
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Stable and mature vendor and community support
   | style="background-color:white; padding-left:10px; padding-right:10px;" |Stable and mature vendor and community support
  |-  
  |-  
|}
|}
Line 127: Line 120:
OVCARE employs two models for biospecimen recruitment. The first is a general banking model, with broad scientific aims, and where specimens are obtained from consented participants and stored until needed. The second is a study-based banking model, where participants are recruited to address specific study aims, with a pre-defined protocol and pre-planned specimen collection. To accommodate both approaches, the biorepository infrastructure needed to manage accrual of specimens in a patient-centric approach, retain the context of the patient’s clinical history, and support basic biospecimen collection, storage, and distribution across multiple studies at different sites, under both recruitment models. This includes inventory control, the ability to track sample availability and location, as well as track generated derivatives (e.g., xenografts and organoids). The infrastructure needed to be adaptable to changing needs between studies and projects, as well as over time, with the ability to preserve the natural history of the data. Access control that varied for different user groups was a critical feature, enabling adherence to regulatory requirements and health research best practices. Data security, de-identification of specimens, and tracking of consent were also important for the same reasons, in addition to the need to operate and manage the biorepository with minimal support from institutional and research IT.
OVCARE employs two models for biospecimen recruitment. The first is a general banking model, with broad scientific aims, and where specimens are obtained from consented participants and stored until needed. The second is a study-based banking model, where participants are recruited to address specific study aims, with a pre-defined protocol and pre-planned specimen collection. To accommodate both approaches, the biorepository infrastructure needed to manage accrual of specimens in a patient-centric approach, retain the context of the patient’s clinical history, and support basic biospecimen collection, storage, and distribution across multiple studies at different sites, under both recruitment models. This includes inventory control, the ability to track sample availability and location, as well as track generated derivatives (e.g., xenografts and organoids). The infrastructure needed to be adaptable to changing needs between studies and projects, as well as over time, with the ability to preserve the natural history of the data. Access control that varied for different user groups was a critical feature, enabling adherence to regulatory requirements and health research best practices. Data security, de-identification of specimens, and tracking of consent were also important for the same reasons, in addition to the need to operate and manage the biorepository with minimal support from institutional and research IT.


We compiled a comprehensive list of requirements (Additional file 2: Table S1) from our stakeholder meetings, and we used it to guide our scan of the landscape of existing [[laboratory information management system]]s (LIMS) (Additional file 2: Table S2—S11, and Fig. 1). This resulted in the identification of [[OpenSpecimen]] [28], a LIMS based on caTissue [29], a mature system with over 15 years of use by the research community. OpenSpecimen addressed more requirements from our list in comparison to other options we considered. It is an open-source software solution with commercial support, in use by over 70 biobanks across 20 countries. The commercial support ensures ongoing software testing, updating, and continuous improvement. This is in addition to the availability of technical support, and access to a community of experienced users through active forums.
We compiled a comprehensive list of requirements (Additional file 2: Table S1) from our stakeholder meetings, and we used it to guide our scan of the landscape of existing [[laboratory information management system]]s (LIMS) (Additional file 2: Table S2—S11, and Fig. 1). This resulted in the identification of [[OpenSpecimen]]<ref>{{Cite web |title=OpenSpecimen |url=https://www.openspecimen.org/ |publisher=Krishagni Solutions Pvt. Ltd |accessdate=17 August 2016}}</ref>, a LIMS based on caTissue<ref name=":7">{{Cite journal |last=McIntosh, L.D.; Sharma, M.K.; Mulvihill, D. |date=2015-10-01 |title=caTissue Suite to OpenSpecimen: Developing an extensible, open source, web-based biobanking management system |url=https://www.sciencedirect.com/science/article/pii/S1532046415001884 |journal=Journal of Biomedical Informatics |language=en |volume=57 |pages=456–464 |doi=10.1016/j.jbi.2015.08.020 |issn=1532-0464 |pmc=PMC4772150 |pmid=26325296}}</ref>, a mature system with over 15 years of use by the research community. OpenSpecimen addressed more requirements from our list in comparison to other options we considered. It is an open-source software solution with commercial support, in use by over 70 biobanks across 20 countries. The commercial support ensures ongoing software testing, updating, and continuous improvement. This is in addition to the availability of technical support, and access to a community of experienced users through active forums.




Line 133: Line 126:
{{clear}}
{{clear}}
{|  
{|  
  | STYLE="vertical-align:top;"|
  | style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
{| border="0" cellpadding="5" cellspacing="0" width="800px"
  |-
  |-
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' Needs-to-biobank mapping and the number of requirements fulfilled by each LIMS. '''a.''' Tiled plot of the mapping of each biospecimen research need to the biobank solution meeting that need. Surveyed biobanks are plotted on the y-axis and research needs (desired biobank features) are plotted on the x-axis, grouped and colored by feature class. '''b.''' Barplot on the overall number of features provided by a specific LIMS. The LIMS solutions are plotted on the y-axis, and the number of features provided are plotted on the x-axis.</blockquote>
   | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 1.''' Needs-to-biobank mapping and the number of requirements fulfilled by each LIMS. '''a.''' Tiled plot of the mapping of each biospecimen research need to the biobank solution meeting that need. Surveyed biobanks are plotted on the y-axis and research needs (desired biobank features) are plotted on the x-axis, grouped and colored by feature class. '''b.''' Barplot on the overall number of features provided by a specific LIMS. The LIMS solutions are plotted on the y-axis, and the number of features provided are plotted on the x-axis.</blockquote>
  |-  
  |-  
|}
|}
|}
|}


In this LIMS, biospecimens can be processed individually or in bulk, with rapid [[barcode]]-based scanning available to enter information on multiple patient samples at once. This enabled high-throughput processing and efficient migration from our legacy LIMS. Options for data annotation and storage management allowed us to optimize specimen storage, a costly resource in our research community (e.g., − 80 freezers). [29]
In this LIMS, biospecimens can be processed individually or in bulk, with rapid [[barcode]]-based scanning available to enter information on multiple patient samples at once. This enabled high-throughput processing and efficient migration from our legacy LIMS. Options for data annotation and storage management allowed us to optimize specimen storage, a costly resource in our research community (e.g., − 80 freezers).<ref name=":7" />


The OpenSpecimen LIMS enabled customization of data entry forms via a graphical user interface (GUI, or web interface) to match study-specific needs without requiring software development. The platform met most of our IT requirements as it supported role-based access control and provided an [[audit trail]] of every user operation. [30] The system was also easy to use, with graphics-based queries that enable searching for stored data about participants, biospecimens, or projects without requiring any programming, making the moderately complex queries accessible to most users. Queries could also be performed via REST API (Representational State Transfer Application Programming Interface) using a SQL (Structured Query Language)-like query language. This facilitated automation of data downloads for analytics pipelines through the incorporation of query scripts.
The OpenSpecimen LIMS enabled customization of data entry forms via a graphical user interface (GUI, or web interface) to match study-specific needs without requiring software development. The platform met most of our IT requirements as it supported role-based access control and provided an [[audit trail]] of every user operation.<ref>{{Cite web |title=OpenSpecimen: Biobanking LIMS Features |url=https://www.openspecimen.org/biobanking-lims-features/ |publisher=Krishagni Solutions Pvt. Ltd |accessdate=17 August 2016}}</ref> The system was also easy to use, with graphics-based queries that enable searching for stored data about participants, biospecimens, or projects without requiring any programming, making the moderately complex queries accessible to most users. Queries could also be performed via REST API (Representational State Transfer Application Programming Interface) using a SQL (Structured Query Language)-like query language. This facilitated automation of data downloads for analytics pipelines through the incorporation of query scripts.


The system enables standalone plugins through a software development kit. These plugins can be made publicly available to the community. For example, the tissue microarray (TMA) plugin can manage TMAs on OpenSpecimen by linking to donor blocks and describing details of experiments done on the different slices of the TMA blocks. Finally, interoperability with other systems was important to expand linkage within the data commons. The vendor provides integration with [[electronic data capture]] applications (REDCap, [[OpenClinica]]), [[electronic medical record]] (EMR) systems (EPIC, Velos), and pathology systems (CoPath, Cerner, Aperio), as well as systems using [[Health Level 7]] (HL7) messages, a capability which can further support inclusion of participant and biospecimen information from distributed systems.
The system enables standalone plugins through a software development kit. These plugins can be made publicly available to the community. For example, the tissue microarray (TMA) plugin can manage TMAs on OpenSpecimen by linking to donor blocks and describing details of experiments done on the different slices of the TMA blocks. Finally, interoperability with other systems was important to expand linkage within the data commons. The vendor provides integration with [[electronic data capture]] applications (REDCap, [[OpenClinica]]), [[electronic medical record]] (EMR) systems (EPIC, Velos), and pathology systems (CoPath, Cerner, Aperio), as well as systems using [[Health Level 7]] (HL7) messages, a capability which can further support inclusion of participant and biospecimen information from distributed systems.
Line 150: Line 143:
Various molecular and genomics data are generated through the course of research, including next-generation sequencing, proteomics, gene expression, targeted sequencing, and immunohistochemical data. These data are primarily generated to answer specific research hypotheses and were supported by public, government, and philanthropic funds, with an implicit obligation to minimize duplication of efforts and to optimize their secondary use in later research. The ability to consider all this data simultaneously can uncover novel patterns, trends, and unknown correlations. This may prompt new hypotheses and spark new insights into novel research directions. To achieve this level of integration, we would need to track which analytical assay was performed on which samples and link back to those data. To facilitate the interrogation of this complex data, an exploration tool was needed to visualize resulting multidimensional datasets and simultaneously investigate molecular profiles and clinical attributes.
Various molecular and genomics data are generated through the course of research, including next-generation sequencing, proteomics, gene expression, targeted sequencing, and immunohistochemical data. These data are primarily generated to answer specific research hypotheses and were supported by public, government, and philanthropic funds, with an implicit obligation to minimize duplication of efforts and to optimize their secondary use in later research. The ability to consider all this data simultaneously can uncover novel patterns, trends, and unknown correlations. This may prompt new hypotheses and spark new insights into novel research directions. To achieve this level of integration, we would need to track which analytical assay was performed on which samples and link back to those data. To facilitate the interrogation of this complex data, an exploration tool was needed to visualize resulting multidimensional datasets and simultaneously investigate molecular profiles and clinical attributes.


We adopted the cBio Cancer Genomics Portal [31], one of the most recommended and widely used [32,33,34,35,36] pan-cancer analytics web tools to facilitate interactive exploration, [[Data mining|mining]], analysis, and visualization of multidimensional datasets derived from tumor samples collected from various cancer studies. [31, 37] Developed at the Memorial Sloan Kettering Cancer Center (MSK), this platform is used by large cancer genomic studies (TCGA [38], TARGET [39]), and publicly available data can be downloaded and queried alongside our own collections.
We adopted the cBioPortal for Cancer Genomics<ref name=":8">{{Cite journal |last=Cerami |first=Ethan |last2=Gao |first2=Jianjiong |last3=Dogrusoz |first3=Ugur |last4=Gross |first4=Benjamin E. |last5=Sumer |first5=Selcuk Onur |last6=Aksoy |first6=Bülent Arman |last7=Jacobsen |first7=Anders |last8=Byrne |first8=Caitlin J. |last9=Heuer |first9=Michael L. |last10=Larsson |first10=Erik |last11=Antipin |first11=Yevgeniy |date=2012-05-01 |title=The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data |url=https://cancerdiscovery.aacrjournals.org/content/2/5/401 |journal=Cancer Discovery |volume=2 |issue=5 |pages=401–404 |doi=10.1158/2159-8290.cd-12-0095 |pmc=PMC3956037 |pmid=22588877}}</ref>, one of the most recommended and widely used<ref>{{Cite journal |last=Bell |first=D. |last2=Berchuck |first2=A. |last3=Birrer |first3=M. |last4=Chien |first4=J. |last5=Cramer |first5=D. W. |last6=Dao |first6=F. |last7=Dhir |first7=R. |last8=DiSaia |first8=P. |last9=Gabra |first9=H. |last10=Glenn |first10=P. |last11=Godwin |first11=A. K. |date=2011-06 |title=Integrated genomic analyses of ovarian carcinoma |url=https://www.nature.com/articles/nature10166 |journal=Nature |language=en |volume=474 |issue=7353 |pages=609–615 |doi=10.1038/nature10166 |issn=1476-4687 |pmc=PMC3163504 |pmid=21720365}}</ref><ref>{{Cite journal |last=Jonckheere |first=Nicolas |last2=Van Seuningen |first2=Isabelle |date=2018-09-20 |title=Integrative analysis of the cancer genome atlas and cancer cell lines encyclopedia large-scale genomic databases: MUC4/MUC16/MUC20 signature is associated with poor survival in human carcinomas |url=https://doi.org/10.1186/s12967-018-1632-2 |journal=Journal of Translational Medicine |volume=16 |issue=1 |pages=259 |doi=10.1186/s12967-018-1632-2 |issn=1479-5876 |pmc=PMC6149062 |pmid=30236127}}</ref><ref name=":9">{{Cite journal |last=Cui |first=Xiangrong |last2=Jing |first2=Xuan |last3=Yi |first3=Qin |last4=Long |first4=Chunlan |last5=Tan |first5=Bin |last6=Li |first6=Xin |last7=Chen |first7=Xueni |last8=Huang |first8=Yue |last9=Xiang |first9=Zhongping |last10=Tian |first10=Jie |last11=Zhu |first11=Jing |date=2017-12-14 |title=Systematic analysis of gene expression alterations and clinical outcomes of STAT3 in cancer |url=https://www.oncotarget.com/article/23226/text/ |journal=Oncotarget |language=en |volume=9 |issue=3 |pages=3198–3213 |doi=10.18632/oncotarget.23226 |issn=1949-2553 |pmc=PMC5790457 |pmid=29423040}}</ref><ref>{{Cite journal |last=Koboldt |first=Daniel C. |last2=Fulton |first2=Robert S. |last3=McLellan |first3=Michael D. |last4=Schmidt |first4=Heather |last5=Kalicki-Veizer |first5=Joelle |last6=McMichael |first6=Joshua F. |last7=Fulton |first7=Lucinda L. |last8=Dooling |first8=David J. |last9=Ding |first9=Li |last10=Mardis |first10=Elaine R. |last11=Wilson |first11=Richard K. |date=2012-10 |title=Comprehensive molecular portraits of human breast tumours |url=https://www.nature.com/articles/nature11412 |journal=Nature |language=en |volume=490 |issue=7418 |pages=61–70 |doi=10.1038/nature11412 |issn=1476-4687 |pmc=PMC3465532 |pmid=23000897}}</ref><ref>{{Cite journal |last=Nagasawa |first=Saya |last2=Ikeda |first2=Kazuhiro |last3=Horie-Inoue |first3=Kuniko |last4=Sato |first4=Sho |last5=Takeda |first5=Satoru |last6=Hasegawa |first6=Kosei |last7=Inoue |first7=Satoshi |date=2020 |title=Identification of novel mutations of ovarian cancer-related genes from RNA-sequencing data for Japanese epithelial ovarian cancer patients |url=https://www.jstage.jst.go.jp/article/endocrj/67/2/67_EJ19-0283/_article |journal=Endocrine Journal |volume=67 |issue=2 |pages=219–229 |doi=10.1507/endocrj.EJ19-0283}}</ref> pan-cancer analytics web tools to facilitate interactive exploration, [[Data mining|mining]], analysis, and visualization of multidimensional datasets derived from tumor samples collected from various cancer studies.<ref name=":8" /><ref name=":10">{{Cite journal |last=Gao |first=Jianjiong |last2=Aksoy |first2=Bülent Arman |last3=Dogrusoz |first3=Ugur |last4=Dresdner |first4=Gideon |last5=Gross |first5=Benjamin |last6=Sumer |first6=S. Onur |last7=Sun |first7=Yichao |last8=Jacobsen |first8=Anders |last9=Sinha |first9=Rileen |last10=Larsson |first10=Erik |last11=Cerami |first11=Ethan |date=2013-04-02 |title=Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal |url=https://www.science.org/doi/10.1126/scisignal.2004088 |journal=Science Signaling |language=en |volume=6 |issue=269 |doi=10.1126/scisignal.2004088 |issn=1945-0877 |pmc=PMC4160307 |pmid=23550210}}</ref> Developed at the Memorial Sloan Kettering Cancer Center (MSK), this platform is used by large cancer genomic studies (TCGA<ref>{{Cite web |last=National Cancer Institute |title=The Cancer Genome Atlas Program |url=https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga |publisher=National Institutes of Health |accessdate=12 April 2021}}</ref>, TARGET<ref>{{Cite web |last=National Cancer Institute, Office of Cancer Genomics |title=TARGET: Therapeutically Applicable Research To Generate Effective Treatments |url=https://ocg.cancer.gov/programs/target |publisher=National Institutes of Health |accessdate=15 April 2021}}</ref>), and publicly available data can be downloaded and queried alongside our own collections.


The cBio Cancer Genomics Portal enables the collection of various genomic data on each tumor sample, including non-synonymous mutations, copy-number alterations (CNAs), mRNA and microRNA expression data, DNA methylation data, protein data, and phosphoprotein level data. [31] Each of these data types is integrated and stored at the gene level to allow investigators to probe for the presence of specific biological events (e.g., gene mutations, deletions, amplifications, and expression levels in each sample) [37], and compare discrete genomic events and patterns across samples and across multiple integrated data types. [31] Stored gene-level data is integrated with de-identified clinical data to probe patient clinical outcomes to support the development or testing of hypotheses on frequently altered genes in specific cancers. [31, 37] In addition, it enables the investigation of the prognostic roles of certain genes in gynecological and other cancers [34], correlations between mutations, expression profiles, clinicopathological features, and potential diagnostic and therapeutic targets for certain cancer types.
The cBioPortal enables the collection of various genomic data on each tumor sample, including non-synonymous mutations, copy-number alterations (CNAs), mRNA and microRNA expression data, DNA methylation data, protein data, and phosphoprotein level data.<ref name=":8" /> Each of these data types is integrated and stored at the gene level to allow investigators to probe for the presence of specific biological events (e.g., gene mutations, deletions, amplifications, and expression levels in each sample)<ref name=":10" />, and compare discrete genomic events and patterns across samples and across multiple integrated data types.<ref name=":8" /> Stored gene-level data is integrated with de-identified clinical data to probe patient clinical outcomes to support the development or testing of hypotheses on frequently altered genes in specific cancers.<ref name=":8" /><ref name=":10" /> In addition, it enables the investigation of the prognostic roles of certain genes in gynecological and other cancers<ref name=":9" />, correlations between mutations, expression profiles, clinicopathological features, and potential diagnostic and therapeutic targets for certain cancer types.


====Clinical data====
====Clinical data====
Clinical data at OVCARE are obtained and collected for the purpose of evaluation of patient outcomes and improvement of the quality of patient care and research. Some of these data were historically managed by the Cheryl Brown Outcomes Unit for the purpose of outcomes research on ovarian cancer patients referred to BC Cancer, the provincial tertiary cancer center. The BC Cancer Registry provided the Cheryl Brown Outcomes Unit regular data updates, such as the identification of patients with cancer and their vital statistics, which were supplemented by exhaustive chart reviews. In addition to the Cheryl Brown Outcomes Unit, clinicians often conducted chart reviews for other clinical studies; the resulting data was held separately. In 2016, the scope of data collection at the Cheryl Brown Outcomes Unit was limited to ovarian cancer and did not take full advantage of other available data. Collecting clinical data was resource-intensive and the effort needed was not sustainable in the long run. Moreover, the mandate of the Cheryl Brown Outcomes Unit expanded to enable OVCARE’s researchers to study all gynecological cancers in the province of BC, especially those cancers that do not require referral to a cancer center (e.g., in BC, up to 50% of patients with endometrial cancer are treated by gynecologists in their communities). Thus, an important priority for the team was to create efficiencies in clinical data collection and to standardize, integrate, and link all gynecologic cancer clinical data from various sources and consolidate clinical data in a single database. This would allow researchers to understand what clinical data is already available, thereby streamlining their own data collection strategies which would, in turn, directly contribute to a master database. To maximize the re-use of clinical data, standardization of [[Ontology (information science)|ontologies]] across projects was needed, as well as the creation of infrastructure to serve as permanent storage with an easy-to-use data collection interface adaptable to fit the needs of various research projects. This would allow standardization of data collection, to the greatest extent possible, and minimization of errors. Consequently, this would improve the overall quality of data, maximize interoperability and reusability, and optimize data analysis. Management of sensitive clinical data requires [[Cybersecurity|security]], [[Information privacy|privacy]], and the use of tools and technology with institutional approval. We also needed rigorous security and privacy measures, and comprehensive audit trails for tracking data manipulation, exports, and downloads for both single and multi-centered research studies, including tracking data access.
Clinical data at OVCARE are obtained and collected for the purpose of evaluation of patient outcomes and improvement of the quality of patient care and research. Some of these data were historically managed by the Cheryl Brown Outcomes Unit for the purpose of outcomes research on ovarian cancer patients referred to BC Cancer, the provincial tertiary cancer center. The BC Cancer Registry provided the Cheryl Brown Outcomes Unit regular data updates, such as the identification of patients with cancer and their vital statistics, which were supplemented by exhaustive chart reviews. In addition to the Cheryl Brown Outcomes Unit, clinicians often conducted chart reviews for other clinical studies; the resulting data was held separately. In 2016, the scope of data collection at the Cheryl Brown Outcomes Unit was limited to ovarian cancer and did not take full advantage of other available data. Collecting clinical data was resource-intensive and the effort needed was not sustainable in the long run. Moreover, the mandate of the Cheryl Brown Outcomes Unit expanded to enable OVCARE’s researchers to study all gynecological cancers in the province of BC, especially those cancers that do not require referral to a cancer center (e.g., in BC, up to 50% of patients with endometrial cancer are treated by gynecologists in their communities). Thus, an important priority for the team was to create efficiencies in clinical data collection and to standardize, integrate, and link all gynecologic cancer clinical data from various sources and consolidate clinical data in a single database. This would allow researchers to understand what clinical data is already available, thereby streamlining their own data collection strategies which would, in turn, directly contribute to a master database. To maximize the re-use of clinical data, standardization of [[Ontology (information science)|ontologies]] across projects was needed, as well as the creation of infrastructure to serve as permanent storage with an easy-to-use data collection interface adaptable to fit the needs of various research projects. This would allow standardization of data collection, to the greatest extent possible, and minimization of errors. Consequently, this would improve the overall quality of data, maximize interoperability and reusability, and optimize data analysis. Management of sensitive clinical data requires [[Cybersecurity|security]], [[Information privacy|privacy]], and the use of tools and technology with institutional approval. We also needed rigorous security and privacy measures, and comprehensive audit trails for tracking data manipulation, exports, and downloads for both single and multi-centered research studies, including tracking data access.


To support OVCARE’s clinical data requirements, we adopted Research Electronic Data Capture (REDCap), a widely used, free and flexible web-based application [40, 41] developed at Vanderbilt University for clinical and translational research. It is one of the most popular research electronic data systems implemented in 141 countries by over 1,000,000 [42] studies, including our institutions. REDCap’s flexible design supports permanent database collections, which can be augmented by both patient/study-centric surveys or data collection forms, and includes a rich set of modules that support today’s diverse and multi-scaled biomedical research operations. [41]
To support OVCARE’s clinical data requirements, we adopted Research Electronic Data Capture (REDCap), a widely used, free and flexible web-based application<ref>{{Cite journal |last=Harris, P.A.; Taylor, R.; Thielke, R. et al. |date=2009-04-01 |title=Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support |url=https://www.sciencedirect.com/science/article/pii/S1532046408001226 |journal=Journal of Biomedical Informatics |language=en |volume=42 |issue=2 |pages=377–381 |doi=10.1016/j.jbi.2008.08.010 |issn=1532-0464 |pmc=PMC2700030 |pmid=18929686}}</ref><ref name=":11">{{Cite journal |last=Harris, P.A.; Taylor, R.; Minor, B.L. et al. |date=2019-07-01 |title=The REDCap consortium: Building an international community of software platform partners |url=https://www.sciencedirect.com/science/article/pii/S1532046419301261 |journal=Journal of Biomedical Informatics |language=en |volume=95 |pages=103208 |doi=10.1016/j.jbi.2019.103208 |issn=1532-0464 |pmc=PMC7254481 |pmid=31078660}}</ref> developed at Vanderbilt University for clinical and translational research. It is one of the most popular research electronic data systems implemented in 141 countries by over 1,000,000<ref>{{Cite web |title=RedCap |url=https://www.project-redcap.org/ |publisher=National Institutes of Health |accessdate=15 April 2021}}</ref> studies, including our institutions. REDCap’s flexible design supports permanent database collections, which can be augmented by both patient/study-centric surveys or data collection forms, and includes a rich set of modules that support today’s diverse and multi-scaled biomedical research operations.<ref name=":11" />


====Governance structure====
====Governance structure====
To manage the various integrated datasets (biospecimen, molecular, genomic, and clinical data), we needed to ensure proper governance, protocols, and standard operating procedures (SOPs) to support data sharing, streamline data requests and inquiries, undertake scientific review or requests, and ensure availability of ethics approval. We envisioned a single portal application for all requests and queries, with a backend database keeping track of details of requesting researchers, description of projects, and required resources, as well as their associated ethics application and certificates of approval. This infrastructure would facilitate compliance with ethics and maintain a log of all activities.
To manage the various integrated datasets (biospecimen, molecular, genomic, and clinical data), we needed to ensure proper governance, protocols, and standard operating procedures (SOPs) to support data sharing, streamline data requests and inquiries, undertake scientific review or requests, and ensure availability of ethics approval. We envisioned a single portal application for all requests and queries, with a backend database keeping track of details of requesting researchers, description of projects, and required resources, as well as their associated ethics application and certificates of approval. This infrastructure would facilitate compliance with ethics and maintain a log of all activities.


We adopted Oracle Corporation's Oracle Application Express (APEX) [43] to develop this portal application. Already available at our institution, APEX, is a low-code, data-driven platform for rapid development and deployment of scalable and secure web applications. Applications are implemented in a preconfigured environment; all development was done through a web interface that is mostly GUI–based. The middle-tier functions of the web application software stack, such as parsing Hypertext Transfer Protocol (HTTP) requests and session management, are fully automated, and all operational aspects of the system (data backup, software patches and updates) are managed by institutional IT.
We adopted Oracle Corporation's Oracle Application Express (APEX)<ref>{{Cite web |title=Oracle Apex |url=https://apex.oracle.com/en/ |publisher=Oracle Corporation |accessdate=14 April 2021}}</ref> to develop this portal application. Already available at our institution, APEX, is a low-code, data-driven platform for rapid development and deployment of scalable and secure web applications. Applications are implemented in a preconfigured environment; all development was done through a web interface that is mostly GUI–based. The middle-tier functions of the web application software stack, such as parsing Hypertext Transfer Protocol (HTTP) requests and session management, are fully automated, and all operational aspects of the system (data backup, software patches and updates) are managed by institutional IT.


==Implementation==
==Implementation==
Line 171: Line 164:
{{clear}}
{{clear}}
{|  
{|  
  | STYLE="vertical-align:top;"|
  | style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
  |-
  |-
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 2.''' OVCARE’s data commons infrastructure and software stack. The overall data commons infrastructure comprises of five main components: (1) A clinical database (REDCap) that consolidates and manages clinical data collections from the BC Cancer Registry and the Cheryl Brown Gynecological Cancers Outcomes Unit; (2) a Library Information Management System (OpenSpecimen) that stores and manages biospecimens collected from consented participants at different hospital sites (i.e., Vancouver General Hospital, the University of British Columbia Hospital, BC Cancer Vancouver, and now a few more centers in BC); (3) the cBioPortal that supports the exploration, analysis, and visualization of clinical attributes and molecular profiles from patient tumor samples; (4) the OVCARE Resource Portal (ORP) that governs data and resource sharing based on stipulated protocols, SOPs, and research ethics; and (5) the Research Community (this includes the OVCARE internal research and informatics team, and the broader research community that OVCARE serves). Each of the components (REDCap, OpenSpecimen, cBioPortal, ORP) identified to meet our research needs are separately hosted in our hospital’s computing environment and programmatically interlinked through API calls. The data from the different domains are interlinked using system-wide unique identifiers that link patients to their biospecimen collections and molecular/genomics data. To access the amassed clinical and biospecimen collections, authenticated researchers in the OVCARE research community send data and sample acquisition requests to the ORP through which those requests are met by informatics staff, if all stipulated requirements including ethics approval are met. Upon successful data and sample acquisition, researchers conduct their respective studies, and the data generated (raw or processed, and/ biospecimen derivatives) from their research are retuned to OVCARE, making it available for re-purposing/secondary use. Furthermore, molecular data returned to the data commons are linked back to the available and stored patient biospecimens. Together with clinical outcomes, these molecular profiles are further explored, analyzed, and visualized using the cBio Cancer Genomics Portal.</blockquote>
   | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 2.''' OVCARE’s data commons infrastructure and software stack. The overall data commons infrastructure comprises of five main components: (1) A clinical database (REDCap) that consolidates and manages clinical data collections from the BC Cancer Registry and the Cheryl Brown Gynecological Cancers Outcomes Unit; (2) a Library Information Management System (OpenSpecimen) that stores and manages biospecimens collected from consented participants at different hospital sites (i.e., Vancouver General Hospital, the University of British Columbia Hospital, BC Cancer Vancouver, and now a few more centers in BC); (3) the cBioPortal that supports the exploration, analysis, and visualization of clinical attributes and molecular profiles from patient tumor samples; (4) the OVCARE Resource Portal (ORP) that governs data and resource sharing based on stipulated protocols, SOPs, and research ethics; and (5) the Research Community (this includes the OVCARE internal research and informatics team, and the broader research community that OVCARE serves). Each of the components (REDCap, OpenSpecimen, cBio Cancer Genomics Portal, ORP) identified to meet our research needs are separately hosted in our hospital’s computing environment and programmatically interlinked through API calls. The data from the different domains are interlinked using system-wide unique identifiers that link patients to their biospecimen collections and molecular/genomics data. To access the amassed clinical and biospecimen collections, authenticated researchers in the OVCARE research community send data and sample acquisition requests to the ORP through which those requests are met by informatics staff, if all stipulated requirements including ethics approval are met. Upon successful data and sample acquisition, researchers conduct their respective studies, and the data generated (raw or processed, and/ biospecimen derivatives) from their research are retuned to OVCARE, making it available for re-purposing/secondary use. Furthermore, molecular data returned to the data commons are linked back to the available and stored patient biospecimens. Together with clinical outcomes, these molecular profiles are further explored, analyzed, and visualized using the cBioPortal.</blockquote>
  |-  
  |-  
|}
|}
Line 183: Line 176:
{{clear}}
{{clear}}
{|  
{|  
  | STYLE="vertical-align:top;"|
  | style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
  |-
  |-
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 3.''' Implementation timeline of OVCARE’s data commons</blockquote>
   | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 3.''' Implementation timeline of OVCARE’s data commons</blockquote>
  |-  
  |-  
|}
|}
|}
|}


In early 2017, we completed a survey of existing biobanking solutions to select one that provided the best fit to our needs at that time. In June 2017, a test server was obtained to run local instances of the selected LIMS, OpenSpecimen, to conduct functionality, integration, and unit testing of all components of this software. This enabled us to evaluate OpenSpecimen's features firsthand and to determine the required resources to operate the infrastructure with optimal performance in our current computing and research environment. We tested for performance and evaluated operation workflows by diverse types of users, both technical and nontechnical, to perform daily biobanking activities. We fully adopted OpenSpecimen in December of 2017. Following this migration, we worked with researchers to gather available genomic datasets and link their availability to the respective biospecimen in OpenSpecimen as well as indicate where data are held. As we continue to expand this resource, we will add availability of images of pathology slides, associated with each tumor block and link to them. To prototype the cBio Cancer Genomics Portal integration, we gathered molecular data for one ovarian cancer subtype, collected from prior studies which were integrated with specimen availability and key clinical outcomes in cBio Cancer Genomics Portal, using specimen ID. We recently launched this prototype and it is currently under evaluation.
In early 2017, we completed a survey of existing biobanking solutions to select one that provided the best fit to our needs at that time. In June 2017, a test server was obtained to run local instances of the selected LIMS, OpenSpecimen, to conduct functionality, integration, and unit testing of all components of this software. This enabled us to evaluate OpenSpecimen's features firsthand and to determine the required resources to operate the infrastructure with optimal performance in our current computing and research environment. We tested for performance and evaluated operation workflows by diverse types of users, both technical and nontechnical, to perform daily biobanking activities. We fully adopted OpenSpecimen in December of 2017. Following this migration, we worked with researchers to gather available genomic datasets and link their availability to the respective biospecimen in OpenSpecimen as well as indicate where data are held. As we continue to expand this resource, we will add availability of images of pathology slides, associated with each tumor block and link to them. To prototype the cBioPortal integration, we gathered molecular data for one ovarian cancer subtype, collected from prior studies which were integrated with specimen availability and key clinical outcomes in cBioPortall, using specimen ID. We recently launched this prototype and it is currently under evaluation.


For clinical data, we expanded the mandate of the Cheryl Brown Outcomes Unit to include clinical and outcome data on all gynecological cancer patients diagnosed in British Columbia. We also obtained ethics approval to permanently retain clinical and outcomes data from all clinical studies in our group. We maximized data we can receive from administrative sources, such as the BC Cancer Registry, as this provides access to clinical data for all patients and minimizes the need for broad chart reviews (Fig. 4). We included elements, such as the date of diagnosis, date of last clinical appointment, vital statistics, International Classification of Diseases (ICD)-10 morphology codes, tumour stage, and grade. We are presently investigating additional data, such as systemic therapy (chemotherapy and radiation therapy received). The second step of clinical data integration involved adding clinical studies with chart reviews. To enable that, we needed to map different data elements to unique concepts. This further facilitated the identification of variables that are of greatest interest to researchers in our group. We then developed consistent data definitions, standards, and semantics for each data element to ensure that all data can be integrated within the data commons. Future data collection will consult these data standards to ensure prospectively harmonized clinical data.
For clinical data, we expanded the mandate of the Cheryl Brown Outcomes Unit to include clinical and outcome data on all gynecological cancer patients diagnosed in British Columbia. We also obtained ethics approval to permanently retain clinical and outcomes data from all clinical studies in our group. We maximized data we can receive from administrative sources, such as the BC Cancer Registry, as this provides access to clinical data for all patients and minimizes the need for broad chart reviews (Fig. 4). We included elements, such as the date of diagnosis, date of last clinical appointment, vital statistics, International Classification of Diseases (ICD)-10 morphology codes, tumour stage, and grade. We are presently investigating additional data, such as systemic therapy (chemotherapy and radiation therapy received). The second step of clinical data integration involved adding clinical studies with chart reviews. To enable that, we needed to map different data elements to unique concepts. This further facilitated the identification of variables that are of greatest interest to researchers in our group. We then developed consistent data definitions, standards, and semantics for each data element to ensure that all data can be integrated within the data commons. Future data collection will consult these data standards to ensure prospectively harmonized clinical data.
Line 199: Line 192:
{{clear}}
{{clear}}
{|  
{|  
  | STYLE="vertical-align:top;"|
  | style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1200px"
{| border="0" cellpadding="5" cellspacing="0" width="1200px"
  |-
  |-
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 4.''' Clinical and outcome data on all gynecological cancer patients diagnosed in British Columbia. In the tiled plot, data elements (demographic, medical history, pathology, chemotherapy, radiation, surgery, and quality of life data) were plotted on the y-axis against gynecological cancer patients (patient 1 to n) on the x-axis. Darker tiles indicate availability of data on a patient per data element. Clinical studies (study 1 to n) are interested in certain patients with available data on specific data elements. Subsets of patients overlap between clinical studies.</blockquote>
   | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 4.''' Clinical and outcome data on all gynecological cancer patients diagnosed in British Columbia. In the tiled plot, data elements (demographic, medical history, pathology, chemotherapy, radiation, surgery, and quality of life data) were plotted on the y-axis against gynecological cancer patients (patient 1 to n) on the x-axis. Darker tiles indicate availability of data on a patient per data element. Clinical studies (study 1 to n) are interested in certain patients with available data on specific data elements. Subsets of patients overlap between clinical studies.</blockquote>
  |-  
  |-  
|}
|}
|}
|}


Finally, to manage all data assets and resources, we developed the OVCARE Resource Portal (ORP). Designed and customized to fit the needs of OVCARE users, this solution is implemented in the APEX software and launched in June 2020. This portal has helped to consolidate workflows and all data and resource requests, helping to ensure proper governance and compliance with protocols, SOPs, and research ethics board requirements.
Each of these implementations (REDCap, OpenSpecimen, and cBio Cancer Genomics Portal) are hosted separately on the [[hospital]]’s research IT network and solely accessible to [[Informatics (academic field)|informatics]] staff. Only the resource portal is accessible for researchers to make requests. Data are integrated through unique identifiers that link the various tables from each database at the patient level or at the specimen level. Data linkage to fulfill various study requirements is done programmatically through API calls.
To request data, researchers create user accounts on the ORP, and if needed, associate the principal investigator profile to their account. Authenticated researchers can then submit information (study proposal, ethics approval, and study requirements) on the study for which resources will be requested. A project reference number created for progress tracking is then issued to the researcher and an ORP-generated email sent to the informatics staff notifying them of a new study proposal. Received proposals are subsequently processed and sent for review and approval by a committee of reviewers selected from the OVCARE community, after which resource requests are fulfilled. Researchers return to the data commons any raw and processed data that results from their studies, as well as any derivatives produced by their research (e.g., cell lines, DNA extractions, organoids).
==Discussion==
We have described the journey followed towards implementing a data commons to benefit the gynecologic cancer community in British Columbia. This infrastructure democratizes access to resources shared by the entire community and brings together the whole gynecological cancer community in BC to work towards a common goal: to reduce death and suffering for women with gynecologic malignancies. To safeguard our data assets and maximize their utility, we have created a unified infrastructure, along with standardized operating procedures, to meet research and ethics needs. The core expertise in data management and informatics which was developed in this process generated efficiencies in data collection to maximize the value of data and stretch research funds by optimizing their secondary use. The proposed governance structure streamlines requests and ensures scientific integrity of projects while adhering to privacy, security, and ethical disclosure of patient-specific data.
Through our investigations we found that no single solution can meet all the different data needs. Rather, the integration of multiple solutions can help us achieve the desired outcome. While the software and technology stack used to implement the current infrastructure will serve us for the near future (i.e., the next five years), the data storage and management field is moving at a very fast pace, and we may need to re-assess our requirements soon. In choosing our software stack, we needed to balance between risks associated with open-source and open-access, which provided affordable solutions and more control but with the downside of little available support and the possibility of the software no longer being maintained, versus going with a corporate software solution that provides more technical support and liability but can be potentially very costly to set up and maintain. To mitigate this, we went with hybrid models where possible and selected software that had an active community of users and that enabled some degree of customization.
The data we collected as part of primary research or for administrative purposes needed to be harmonized for integration. For example, some data sources report "tumor grade" as "high" or "low," while others report numeric grades, e.g., 1, 2, 3, 4. And while gender is occasionally reported as "male" and "female," it can also be represented as "M" and "F," "1" and "0," or "1" and "2."<ref name=":12">{{Cite book |last=Olson |first=Steve |last2=Downey |first2=Autumn S. |date=2013 |others=Institute of Medicine (U.S.), Institute of Medicine (U.S.), National Cancer Policy Forum (U.S.), Institute of Medicine (U.S.), Institute of Medicine (U.S.), Institute of Medicine (U.S.) |title=Sharing clinical research data: workshop summary |url=https://www.worldcat.org/title/mediawiki/oclc/853280017 |publisher=The National Academies Press |place=Washington, D.C |isbn=978-0-309-26874-5 |oclc=853280017}}</ref> Integration of such data presents “unique technical, semantic, and ethical challenges”<ref name=":13">{{Cite journal |last=Seneviratne |first=Martin G. |last2=Kahn |first2=Michael G. |last3=Hernandez-Boussard |first3=Tina |date=2019 |title=Merging heterogeneous clinical data to enable knowledge discovery |url=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6447393/ |journal=Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing |volume=24 |pages=439–443 |issn=2335-6928 |pmc=6447393 |pmid=30864344}}</ref> and could also result in large amounts of unusable data due to loss in translation. Developing standards ''a priori'' streamlines semantics and ontologies, avoids data wastage, increases data quality, and supports effective data integration, sharing, and reusability, while also saving significant time and costs required to pool, process, and share data.<ref name=":12" /><ref>{{Cite journal |last=Huser, V.; Sastry, C.; Breymaier, M. et al. |date=2015-10-01 |title=Standardizing data exchange for clinical research protocols and case report forms: An assessment of the suitability of the Clinical Data Interchange Standards Consortium (CDISC) Operational Data Model (ODM) |url=https://www.sciencedirect.com/science/article/pii/S1532046415001331 |journal=Journal of Biomedical Informatics |language=en |volume=57 |pages=88–99 |doi=10.1016/j.jbi.2015.06.023 |issn=1532-0464 |pmc=PMC4714951 |pmid=26188274}}</ref> Future efforts to connect with other biorepositories and similar databases from other centers rely on adopting standardized ontologies to facilitate data sharing. Policies for ensuring data quality and security were also defined, including establishing team and user roles and data access levels, ensuring that all processes from data acquisition to distribution are compliant to stipulated policies and research ethics.
The data commons is overseen by three principal investigators, including an informatician, a medical oncologist, and a gynecological oncologist. The team that operationalizes this infrastructure includes a part-time database manager and a data scientist who work on various data integrations. A lab technician and a clinical coordinator, with the help of various co-op students, facilitate specimen acquisition, storage, and data collection. Occasional consultations with pathology and oncology fellows are needed.
Our team continues to curate and harmonize available data to maximize their utility. For example, in the next year, we will add digital pathology images and have the ability to upload our collection to data enclaves, where it can be linked to other administrative data, including health service utilization and prescription drugs. This will result in a very rich data ecosystem, which will be ripe for novel scientific discovery and can enable research never before possible.
In the very near future, we are expanding our data commons to make it more patient-centric. We are launching an online consent process so that we can reach a broader patient population to invite them to participate in research. We are also adding patient reported outcomes (PRO) to the data commons.
==Conclusions==
In contrast to traditional biorepositories, the consolidation of heterogeneous datasets and biospecimens from various distributed systems, clinical studies, and research institutions, into a data commons presents important opportunities to drive translational medicine. A seamless data environment for clinical and research data can be achieved through shared policies and technologies, and privacy-preserving open computer architectures and storage platforms.
The success and sustainability of data commons rely first and foremost on fostering a scientific community capable of using the open and connected data environment. Secondly, the appropriate technological solution suitable for each type of data needs to be in place; there is no single solution that can be adapted to all data collections, yet multiple solutions should be integrated. Lastly, the proper governance structure is needed to grapple with the unique challenges presented in cross-institutional and multi-disciplinary research, resource integration, data sharing, and data harmonization for greater interoperability.
In this paper, we present methods developed and applied to successfully establish a federated and scalable infrastructure that extends OVCARE’s traditional tumor biobank, outcomes unit, and a collection of data silos, into an integrated data commons. To this end, we gathered and analyzed all research requirements of participating institutions under three main domains: (1) biospecimen collections, (2) molecular and genomics data, and (3) clinical data. We then built a governance model and a resource portal to effectuate protocols and standard operating procedures, to support data and biomaterials aggregation, sharing, harmonization and governance, across all participating institutions. We believe such infrastructures will help break barriers to the access of large datasets required to elucidate and improve our understanding of complex and rare diseases, providing powerful opportunities for knowledge discovery and translation towards improved patient care.
==Methods==
===Needs assessment===
To identify research needs and gather infrastructural requirements, stakeholders were engaged from all participating institutions. Discussions and one-on-one meetings with individual researchers, as well as brainstorming meetings to map out general research direction and requirements for the upcoming five to 10 years were held. Further discussions were conducted with institutional research IT to understand security, data management, and sustainability requirements. Identified direction and priorities were expanded into a list of requirements (Additional file 2: Table S1) relevant to the collection and optimization of biospecimen, clinical, and molecular/genomics data, as well as a governance model of the resulting infrastructure.
===Technical solutions===
For each of the domain-specific requirements (governance, biospecimen, clinical and molecular/genomics data), technical solutions were identified to meet the needs established under that domain. Solutions required for managing clinical and molecular/genomics data (REDCap and cBio Cancer Genomics Portal respectively) were previously well established, tested, implemented, and proven to meet the needs emerging from these two data domains in our research environment.
To identify a LIMS solution that met all/most of the identified biospecimen requirements, we surveyed the biorepository and LIMS environment (Additional file 1) and identified nine prominent software solutions, which we comparatively evaluated. Based on publications and online documentation, we collected and analyzed data on all identified biobanking software and examined the features and functionality of each with respect to our requirements (Additional file 2: Table S12). We also conducted meetings, interviews, and live interactive demos with various software vendors. A list of features per identified platform (Additional file 2: Table S2—S11) was generated to which each of our requirements was considered to identify the solution that best addressed our needs (Additional file 2: Table S12). Results from this survey were presented in a second stakeholder meeting where we discussed the suitability and utility of the identified LIMs, and we decided to further evaluate OpenSpecimen.
Based on collected biospecimen data, we defined database concepts (entities, attributes, relationships, and constraints) and customized the backend OpenSpecimen database (running [[MySQL]]). We obtained a test server (implemented in Java and [[Apache Tomcat]]) and installed a Linux-based local instance of OpenSpecimen in our computing environment. During these pilot runs, frequent inquiries were made with software vendors on features, components, integration, and interoperability functions, including the identification of missing requirements. Following successful tests, data from legacy systems was then consolidated into the server by leveraging OpenSpecimen’s batch uploads utility. We further designed and developed the user interface and configured and customized OpenSpecimen to our unique requirements before moving it into production.
===Data standardization and integration===
The vision of modern translational medicine largely hinges on the integration of large-scale clinical and molecular profiles of patients to derive hypotheses and novel insights into a patient’s disease.<ref name=":13" /><ref>{{Cite journal |last=De Maria Marchiano |first=Ruggero |last2=Di Sante |first2=Gabriele |last3=Piro |first3=Geny |last4=Carbone |first4=Carmine |last5=Tortora |first5=Giampaolo |last6=Boldrini |first6=Luca |last7=Pietragalla |first7=Antonella |last8=Daniele |first8=Gennaro |last9=Tredicine |first9=Maria |last10=Cesario |first10=Alfredo |last11=Valentini |first11=Vincenzo |date=2021-03-18 |title=Translational Research in the Era of Precision Medicine: Where We Are and Where We Will Go |url=https://doi.org/10.3390/jpm11030216 |journal=Journal of Personalized Medicine |volume=11 |issue=3 |pages=216 |doi=10.3390/jpm11030216 |issn=2075-4426 |pmc=PMC8002976 |pmid=33803592}}</ref><ref>{{Cite journal |last=Tian |first=Q. |last2=Price |first2=N. D. |last3=Hood |first3=L. |date=2012 |title=Systems cancer medicine: towards realization of predictive, preventive, personalized and participatory (P4) medicine |url=https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1365-2796.2011.02498.x |journal=Journal of Internal Medicine |language=en |volume=271 |issue=2 |pages=111–121 |doi=10.1111/j.1365-2796.2011.02498.x |issn=1365-2796 |pmc=PMC3978383 |pmid=22142401}}</ref> The data at OVCARE is derived from multiple disparate sources. To consolidate data from several databases, we began rigorous data validation and quality control checks. We extensively reviewed all biospecimen data, which included:
*checking, locating, and uploading all physical consent forms to ensure a digital record in our database;
*uploading all physical biospecimen requisition forms;
*reviewing all pathology diagnosis (by pathologists with gynecological subspecialty); and
*locating and confirming availability of all specimens.
The process of integrating molecular and genomics datasets into OpenSpecimen required close collaboration with researchers with expertise in the interpretation of these data. At the start of 2019, we obtained and consolidated from all OVCARE researchers any previously collected “-omics” datasets. As a first step, we mapped the omics data back to specimen and created tags indicating their availability in OpenSpecimen patient profiles. The second step of this process started in April 2020 with the implementation of cBioPortal for data visualization and analytics.
To consolidate clinical data, we derived a two-step approach whereby we use a minimal set of data elements available on all patients, supplemented by data available from other studies on various subsets of patients. We evaluated all available data elements which can be obtained from administrative sources (e.g., BC Cancer Registry) for accuracy, consistency, and completeness. We selected a set of data elements that met our quality standards. We deployed a pipeline that regularly performs quality checks on data elements against a set of rules that can be applied programmatically to validate the integrity, consistency, and logic between various elements before their integration. Only data that passed quality checks would be merged with a permanent clinical database; data that failed quality checks were further investigated with data stewards to determine sources of error. Clinical outcomes data from the BC Cancer Registry were de-identified before being merged with a permanent database hosted in REDCap, and updated quarterly.
To complement data available from the Registry, the second step of our process involved integrating clinical data obtained through clinical studies and held in silos. To ensure that data can be aggregated, compared, analyzed, shared, and reused across studies, data standards were defined to resolve standardization discrepancies.<ref name=":12" /> Unique data variables were aggregated from seven clinical studies to understand the breadth of the data in our clinical database. We created a standardized data dictionary with the goal of mapping data elements to the same data concepts across all clinical data collections in British Colombia, and these concepts in turn can be matched with a common data model OMOP-CDM<ref>{{Cite web |title=OMOP Common Data Model |url=https://www.ohdsi.org/data-standardization/the-common-data-model/ |publisher=OHDSI |accessdate=02 July 2021}}</ref> to maximize interoperability with external datasets.
===Data governance, ethics and standard operating procedures===
Following standardization and aggregation of all our data sources, we developed a centralized governance model and defined protocols, SOPs, and policies governing data access, storage, protection, sharing, and permissible use across OVCARE’s research community. To implement the governance framework, we designed, developed, tested, and deployed the OVCARE Resource Portal (ORP). The portal was developed using Oracle APEX to provide an online interface for all internal research and collaborating teams to request resources, including biospecimen, clinical, molecular, and imaging data, as well as informatics and data analytics support.
==Supplemental information and data availability==
The LIMS survey data analyzed during the current study are available in Additional file 2: Tables S2–S12. The data are also publicly available on the websites (features section) of each surveyed LIMS (Additional file 1).
*[https://static-content.springer.com/esm/art%3A10.1186%2Fs12967-021-03147-z/MediaObjects/12967_2021_3147_MOESM1_ESM.docx Additional file 1]: Evaluation of identified biobanking library information management systems (.docx)
*[https://static-content.springer.com/esm/art%3A10.1186%2Fs12967-021-03147-z/MediaObjects/12967_2021_3147_MOESM2_ESM.xlsx Additional file 2]: OVCARE data commons: requirements identification and mapping desired biobanking features to solutions meeting the need (.xlsx)
==Abbreviations==
'''AI''': artificial intelligence
'''APEX''': Oracle Application Express
'''API''': applicationp programming interface
'''CBGOU''': Cheryl Brown Gynecological Cancers Outcomes Unit
'''CNAs''': copy number alterations
'''GUI''': graphical user interface
'''HL7''': Health Level Seven
'''HTTP''': Hypertext Transfer Protocol
'''ICD''': International Classification of Diseases
'''LIMS''': laboratory information management systems
'''mRNA''': messenger RNA
'''MSK''': Memorial Sloan Kettering Cancer Center
'''MySQL''': My Structured Query Language
'''NGS''': next-generation sequencing
'''OMOP-CDM''': Observational Medical Outcomes Partnership-Common Data Model
'''ORP''': OVCARE Resource Portal
'''OVCARE''': Ovarian Cancer Research Program
'''PRO''': patient-reported outcomes
'''REDCap''': Research Electronic Data Capture
'''REST''': representational state transfer
'''SOPs''': standard operating procedures
'''SQL''': Structured Query Language
'''TARGET''': Tumor Alterations Relevant for Genomics-Driven Therapy
'''TCGA''': The Cancer Genome Atlas
'''TMA''': tissue microarray
==Acknowledgements==
The authors are profoundly grateful to all the women who donated their samples for research. Without their generosity, advancements in gynecological cancer research and care would not be possible. The authors wish to extend special thanks to Jane & Maurice Wong and the Gray Family for their foresight in funding the work of the data commons which has and will continue to be a tremendous resource for researchers. The authors also acknowledge the funding from the BC Cancer Foundation, the VGH & UBC Hospital Foundation, the University of British Columbia, and Ovarian Cancer Canada (to OVCARE, BC’s gynecologic cancer research team).
===Author contributions===
AT in collaboration with MW and SL conceptualized OVCARE’s transition into a data commons in consultations with DH, JNM and AT. RA conducted the LIMS survey, analysis and results interpretation in collaboration with SL and AT. Data standardization and integration was conducted by SL, SW, SL and RW under the supervision of AT. SL, SL, RW, MW, and AT established the data commons governance model, policies and standard operating procedures in consultation with DH, JNM and AT. Design, testing and implementation of data solutions was conducted by SL, RA and SL, under the supervision of AT. Manuscript composition and drafting was conducted by RA, and AT with SL, SL, RW, and SW. All authors read and approved the final manuscript.
===Funding===
This work was funded by donations from Jane & Maurice Wong and the Gray Family. Funding was also obtained from the BC Cancer Foundation, the VGH & UBC Hospital Foundation, the University of British Columbia, and Ovarian Cancer Canada (to OVCARE, BC’s gynecologic cancer research team).


===Competing interests===
The authors declare that they have no competing interests.


==References==
==References==
Line 216: Line 330:


<!--Place all category tags here-->
<!--Place all category tags here-->
[[Category:LIMSwiki journal articles (added in 2019)]
[[Category:LIMSwiki journal articles (added in 2022)]]
[[Category:LIMSwiki journal articles (all)]]
[[Category:LIMSwiki journal articles (all)]]
[[Category:LIMSwiki journal articles on clinical informatics]]
[[Category:LIMSwiki journal articles on clinical informatics]]
[[Category:LIMSwiki journal articles on library informatics]]
[[Category:LIMSwiki journal articles on health informatics]]
[[Category:LIMSwiki journal articles on research]]

Revision as of 18:59, 7 February 2022

Full article title From biobank and data silos into a data commons: Convergence to support translational medicine
Journal Journal of Translational Medicine
Author(s) Asiimwe, Rebecca; Lam, Stephanie; Leung, Samuel; Wang, Shanzhao; Wan, Rachel; Tinker, Anna; McAlpine, Jessica N.; Woo, Michelle M.M.; Huntsman, David G.; Talhouk, Aline
Author affiliation(s) BC Cancer Research Centre, BC Children’s Hospital Research Institute, University of British Columbia, OVCARE
Primary contact Email: a dot talhouk at ubc dot ca
Year published 2021
Volume and issue 19
Article # 493
DOI 10.1186/s12967-021-03147-z
ISSN 1479-5876
Distribution license Creative Commons Attribution 4.0 International
Website https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-021-03147-z
Download https://translational-medicine.biomedcentral.com/track/pdf/10.1186/s12967-021-03147-z.pdf (PDF)

Abstract

Background: To drive translational medicine, modern day biobanks need to integrate with other sources of data (e.g., clinical, genomics) to support novel data-intensive research. Currently, vast amounts of research and clinical data remain in silos, held and managed by individual researchers, operating under different standards and governance structures; such a framework impedes sharing and effective use of data.

In this article, we describe the journey of British Columbia’s Gynecological Cancer Research Program (OVCARE) in moving a traditional tumor biobank, outcomes unit, and a collection of data silos into an integrated data commons to support data standardization and resource sharing under collaborative governance, as a means of providing the gynecologic cancer research community in British Columbia access to tissue samples and associated clinical and molecular data from thousands of patients.

Results: Through several engagements with stakeholders from various research institutions within our research community, we identified priorities and assessed infrastructure needs required to optimize and support data collections, storage, and sharing, under three main research domains: (1) biospecimen collections, (2) molecular and genomics data, and (3) clinical data. We further built a governance model and a resource portal to implement protocols and standard operating procedures (SOPs) for seamless collections management and governance of interoperable data, making genomic and clinical data available to the broader research community.

Conclusions: Proper infrastructures for data collection, sharing, and governance is a translational research imperative. We have consolidated our data holdings into a data commons, along with standardized operating procedures to meet research and ethics requirements of the gynecologic cancer community in British Columbia. The developed infrastructure brings together diverse data and computing frameworks, as well as tools and applications for managing, analyzing, and sharing data. Our data commons bridges data access gaps and barriers to precision medicine and approaches for diagnostics, treatment, and prevention of gynecological cancers by providing access to large datasets required for data-intensive science.

Keywords: biobanks, biospecimens, biobank technologies. precision medicine, data commons, laboratory information management systems, LIMS, federated systems, data governance

Background

The collection, storage, management, and distribution of human biospecimens for diagnostic pathology[1][2][3] can be traced as far back as the 1900s.[3] To meet research needs in the postgenomic era, modern day biorepositories[4] support scientists to derive disease-specific insights[5] by aiding the investigation of genetic underpinnings[6][7][8], elucidating etiology, and evaluating disease progression and therapeutic response; they are the backbone of precision medicine[9][10], as well as biomedical and translational research.[1][2][11]

The last decade has seen advances in biotechnology such as next-generation sequencing (NGS), and the emergence of “omics” techniques for precision medicine (e.g., genomics, transcriptomics, proteomics, metabolomics, and epigenomics). These innovations coincided with breakthroughs in computing, artificial intelligence (AI), and analytics, enabling discrimination between disease with greater precision.[12] This has created an unprecedented demand for high-quality biospecimens and associated data, including clinical, molecular, imaging, and other types of data generated during research.[11] Innovations in database cloud storage and computing infrastructures to support data-intensive science have further contributed to revolutionizing resources available to address modern research needs.[13][14] As federated models for aggregating data and biomaterials have emerged as favored approaches for identifying enough patients with specific clinical or molecular features, the importance of interoperability between biobanks and related databases has been accentuated.[4][7][15] Specimen collections have become virtual[13], flexible, and interoperable, hosted on internationally harmonized infrastructures[7] and optimized for secondary research.[7][13] Present-day research environments and needs have led to the development and implementation of data commons[16][17], bringing together within a research community diverse data and computing infrastructures, as well as tools and applications for managing, analyzing, and sharing interoperable data. This has created an opportunity to maximize collaborations and to extend the value generated from primary data collection.[18]

In 2016, as part of British Colombia’s multidisciplinary gynecological cancer research team (OVCARE), we undertook a comprehensive review of the landscape of the data assets available within our local research environment and assessed the infrastructure needs required to support data storage and sharing within our research community. Herein, we describe the roadmap undertaken for the creation of a data commons, transforming a traditional tumor biobank and a collection of data silos into an integrated and comprehensive infrastructure to support the current and future research needs of an expanding team.

Results

Matching technical solutions to research needs

OVCARE started in 2000 as an initiative between the British Columbia Cancer Agency, the University of British Columbia, and the Vancouver Coastal Health Research Institute to accelerate research discoveries and their translation to the clinical settings, as well as improve the lives of women with ovarian cancer or those at risk. Today, OVCARE is an internationally recognized multidisciplinary team of physicians and scientists who are breaking new ground in improving diagnosis, prevention, and treatment of all gynecological cancers.[19][20][21][22][23][24][25][26][27]

OVCARE’s research has been powered by the gynecological tumor bank and the Cheryl Brown Gynecological Cancers Outcomes Unit. Through the course of research, a plethora of molecular and genomic data was historically held by researchers that generated them. Similarly, clinical data from chart reviews are obtained to support clinical studies and held with clinicians. These data were in incompatible formats that needed significant manual manipulation and curation to be integrated. Moreover, each collection was governed by different ethics agreements that restricted the use of data and kept it in silos. This was becoming a barrier to novel data-intensive research, requiring the integration of multiple data sources; undertaking such projects was challenging, time consuming, and prone to errors. The OVCARE leadership had recognized that current research needs were not being met through existing infrastructure.

A broad stakeholder engagement effort in 2016 kicked off, with the objective to work with researchers, clinicians, scientists, and technicians at various institutions, to map out a collective future vision, identifying research needs and re-thinking present infrastructure. Engagements with key stakeholders identified research priorities, which were expanded into a list of fundamental requirements (Table 1) relevant to the collection and optimization of biospecimen, clinical, and molecular/genomics data, as well as a governance model of the resulting infrastructure. In addition to generating efficiency, limiting errors and honoring patient consent, fundamental research requirements included the maximization of secondary use of data, which enables data collected for one purpose to be reused in a completely different context. For example, chemotherapy drugs dispensed at our pharmacy are collected for administrative purposes (billing) but can also be used to link with patient phenotype, genotype, and outcome to investigate which patients benefit from these therapies more than others. Another important need was to generate novel research hypotheses by considering simultaneously various data that could never before be considered at the same time. Patterns that may not have been obvious previously may emerge to drive future innovative research. Another important need was to use translational studies to help inform patient care, as well as use data generated from patient care to ask new research questions, with the goal of continuously trying to better fill gaps in understanding of disease etiology and progression. In upcoming sections, we further describe more of these requirements in greater detail.

Table 1. Summary of fundamental research and infrastructural needs of OVCARE’s research community
Fundamental research requirements
1 Generate efficiencies in data collection, storage, and analysis to maximize utility of collected data
2 Limit errors in data handling and ensure reproducibility of research findings
3 Protect patients’ privacy and honor their consent
4 Optimize secondary and continuous use of data generated from research and clinical care
5 Facilitate the recruitment of patients in various clinical studies
6 Identify specimens from patients with specific clinical, molecular, and genomic characteristics
7 Integration of medical and clinical data with molecular information to enable the discovery and testing of new associations and hypotheses towards translational research
8 Organize data towards a learning healthcare system where translation is bi-directional, meaning evidence-based research is used to inform practice, and the data generated during clinical care is in turn used to inform guidelines, generate hypotheses, and trigger pragmatic trials
Functional and infrastructural IT requirements
1 Allow batch data imports and exports
2 Facilitate validation of data entered to minimize errors (e.g., returning an error message when text is entered instead of a numeric value)
3 Easy-to-use and customizable user interfaces
4 Support both prospective and retrospective data collection mechanisms
5 Adapt to changing needs between studies and projects, as well as over time
6 Track biospecimen locations, usage, and shipment to both local and offsite storage locations
7 Support multi-tenancy for the banking of biospecimens from distributed and diverse studies lead by different investigators interested in sharing resources
8 Adherence to best practices in privacy and security, such as support for data encryption, audit trails on all user actions, and data changes for regulatory compliance; configurable user privileges; role-based access control; and adherence to federal regulations with respect to de-identification of specimens and tracking of consent
9 Support interoperability and integration with other institutions, systems, and data sources to facilitate data sharing
10 Potential to scale-up biospecimen and user capacity at no added cost
11 Stable and mature vendor and community support

Biospecimen collection

OVCARE employs two models for biospecimen recruitment. The first is a general banking model, with broad scientific aims, and where specimens are obtained from consented participants and stored until needed. The second is a study-based banking model, where participants are recruited to address specific study aims, with a pre-defined protocol and pre-planned specimen collection. To accommodate both approaches, the biorepository infrastructure needed to manage accrual of specimens in a patient-centric approach, retain the context of the patient’s clinical history, and support basic biospecimen collection, storage, and distribution across multiple studies at different sites, under both recruitment models. This includes inventory control, the ability to track sample availability and location, as well as track generated derivatives (e.g., xenografts and organoids). The infrastructure needed to be adaptable to changing needs between studies and projects, as well as over time, with the ability to preserve the natural history of the data. Access control that varied for different user groups was a critical feature, enabling adherence to regulatory requirements and health research best practices. Data security, de-identification of specimens, and tracking of consent were also important for the same reasons, in addition to the need to operate and manage the biorepository with minimal support from institutional and research IT.

We compiled a comprehensive list of requirements (Additional file 2: Table S1) from our stakeholder meetings, and we used it to guide our scan of the landscape of existing laboratory information management systems (LIMS) (Additional file 2: Table S2—S11, and Fig. 1). This resulted in the identification of OpenSpecimen[28], a LIMS based on caTissue[29], a mature system with over 15 years of use by the research community. OpenSpecimen addressed more requirements from our list in comparison to other options we considered. It is an open-source software solution with commercial support, in use by over 70 biobanks across 20 countries. The commercial support ensures ongoing software testing, updating, and continuous improvement. This is in addition to the availability of technical support, and access to a community of experienced users through active forums.


Fig1 Asiimwe JofTransMed21 19.png

Figure 1. Needs-to-biobank mapping and the number of requirements fulfilled by each LIMS. a. Tiled plot of the mapping of each biospecimen research need to the biobank solution meeting that need. Surveyed biobanks are plotted on the y-axis and research needs (desired biobank features) are plotted on the x-axis, grouped and colored by feature class. b. Barplot on the overall number of features provided by a specific LIMS. The LIMS solutions are plotted on the y-axis, and the number of features provided are plotted on the x-axis.

In this LIMS, biospecimens can be processed individually or in bulk, with rapid barcode-based scanning available to enter information on multiple patient samples at once. This enabled high-throughput processing and efficient migration from our legacy LIMS. Options for data annotation and storage management allowed us to optimize specimen storage, a costly resource in our research community (e.g., − 80 freezers).[29]

The OpenSpecimen LIMS enabled customization of data entry forms via a graphical user interface (GUI, or web interface) to match study-specific needs without requiring software development. The platform met most of our IT requirements as it supported role-based access control and provided an audit trail of every user operation.[30] The system was also easy to use, with graphics-based queries that enable searching for stored data about participants, biospecimens, or projects without requiring any programming, making the moderately complex queries accessible to most users. Queries could also be performed via REST API (Representational State Transfer Application Programming Interface) using a SQL (Structured Query Language)-like query language. This facilitated automation of data downloads for analytics pipelines through the incorporation of query scripts.

The system enables standalone plugins through a software development kit. These plugins can be made publicly available to the community. For example, the tissue microarray (TMA) plugin can manage TMAs on OpenSpecimen by linking to donor blocks and describing details of experiments done on the different slices of the TMA blocks. Finally, interoperability with other systems was important to expand linkage within the data commons. The vendor provides integration with electronic data capture applications (REDCap, OpenClinica), electronic medical record (EMR) systems (EPIC, Velos), and pathology systems (CoPath, Cerner, Aperio), as well as systems using Health Level 7 (HL7) messages, a capability which can further support inclusion of participant and biospecimen information from distributed systems.

Molecular and genomics data

Various molecular and genomics data are generated through the course of research, including next-generation sequencing, proteomics, gene expression, targeted sequencing, and immunohistochemical data. These data are primarily generated to answer specific research hypotheses and were supported by public, government, and philanthropic funds, with an implicit obligation to minimize duplication of efforts and to optimize their secondary use in later research. The ability to consider all this data simultaneously can uncover novel patterns, trends, and unknown correlations. This may prompt new hypotheses and spark new insights into novel research directions. To achieve this level of integration, we would need to track which analytical assay was performed on which samples and link back to those data. To facilitate the interrogation of this complex data, an exploration tool was needed to visualize resulting multidimensional datasets and simultaneously investigate molecular profiles and clinical attributes.

We adopted the cBioPortal for Cancer Genomics[31], one of the most recommended and widely used[32][33][34][35][36] pan-cancer analytics web tools to facilitate interactive exploration, mining, analysis, and visualization of multidimensional datasets derived from tumor samples collected from various cancer studies.[31][37] Developed at the Memorial Sloan Kettering Cancer Center (MSK), this platform is used by large cancer genomic studies (TCGA[38], TARGET[39]), and publicly available data can be downloaded and queried alongside our own collections.

The cBioPortal enables the collection of various genomic data on each tumor sample, including non-synonymous mutations, copy-number alterations (CNAs), mRNA and microRNA expression data, DNA methylation data, protein data, and phosphoprotein level data.[31] Each of these data types is integrated and stored at the gene level to allow investigators to probe for the presence of specific biological events (e.g., gene mutations, deletions, amplifications, and expression levels in each sample)[37], and compare discrete genomic events and patterns across samples and across multiple integrated data types.[31] Stored gene-level data is integrated with de-identified clinical data to probe patient clinical outcomes to support the development or testing of hypotheses on frequently altered genes in specific cancers.[31][37] In addition, it enables the investigation of the prognostic roles of certain genes in gynecological and other cancers[34], correlations between mutations, expression profiles, clinicopathological features, and potential diagnostic and therapeutic targets for certain cancer types.

Clinical data

Clinical data at OVCARE are obtained and collected for the purpose of evaluation of patient outcomes and improvement of the quality of patient care and research. Some of these data were historically managed by the Cheryl Brown Outcomes Unit for the purpose of outcomes research on ovarian cancer patients referred to BC Cancer, the provincial tertiary cancer center. The BC Cancer Registry provided the Cheryl Brown Outcomes Unit regular data updates, such as the identification of patients with cancer and their vital statistics, which were supplemented by exhaustive chart reviews. In addition to the Cheryl Brown Outcomes Unit, clinicians often conducted chart reviews for other clinical studies; the resulting data was held separately. In 2016, the scope of data collection at the Cheryl Brown Outcomes Unit was limited to ovarian cancer and did not take full advantage of other available data. Collecting clinical data was resource-intensive and the effort needed was not sustainable in the long run. Moreover, the mandate of the Cheryl Brown Outcomes Unit expanded to enable OVCARE’s researchers to study all gynecological cancers in the province of BC, especially those cancers that do not require referral to a cancer center (e.g., in BC, up to 50% of patients with endometrial cancer are treated by gynecologists in their communities). Thus, an important priority for the team was to create efficiencies in clinical data collection and to standardize, integrate, and link all gynecologic cancer clinical data from various sources and consolidate clinical data in a single database. This would allow researchers to understand what clinical data is already available, thereby streamlining their own data collection strategies which would, in turn, directly contribute to a master database. To maximize the re-use of clinical data, standardization of ontologies across projects was needed, as well as the creation of infrastructure to serve as permanent storage with an easy-to-use data collection interface adaptable to fit the needs of various research projects. This would allow standardization of data collection, to the greatest extent possible, and minimization of errors. Consequently, this would improve the overall quality of data, maximize interoperability and reusability, and optimize data analysis. Management of sensitive clinical data requires security, privacy, and the use of tools and technology with institutional approval. We also needed rigorous security and privacy measures, and comprehensive audit trails for tracking data manipulation, exports, and downloads for both single and multi-centered research studies, including tracking data access.

To support OVCARE’s clinical data requirements, we adopted Research Electronic Data Capture (REDCap), a widely used, free and flexible web-based application[40][41] developed at Vanderbilt University for clinical and translational research. It is one of the most popular research electronic data systems implemented in 141 countries by over 1,000,000[42] studies, including our institutions. REDCap’s flexible design supports permanent database collections, which can be augmented by both patient/study-centric surveys or data collection forms, and includes a rich set of modules that support today’s diverse and multi-scaled biomedical research operations.[41]

Governance structure

To manage the various integrated datasets (biospecimen, molecular, genomic, and clinical data), we needed to ensure proper governance, protocols, and standard operating procedures (SOPs) to support data sharing, streamline data requests and inquiries, undertake scientific review or requests, and ensure availability of ethics approval. We envisioned a single portal application for all requests and queries, with a backend database keeping track of details of requesting researchers, description of projects, and required resources, as well as their associated ethics application and certificates of approval. This infrastructure would facilitate compliance with ethics and maintain a log of all activities.

We adopted Oracle Corporation's Oracle Application Express (APEX)[43] to develop this portal application. Already available at our institution, APEX, is a low-code, data-driven platform for rapid development and deployment of scalable and secure web applications. Applications are implemented in a preconfigured environment; all development was done through a web interface that is mostly GUI–based. The middle-tier functions of the web application software stack, such as parsing Hypertext Transfer Protocol (HTTP) requests and session management, are fully automated, and all operational aspects of the system (data backup, software patches and updates) are managed by institutional IT.

Implementation

The various components of the data commons infrastructure and software identified to meet the domain-specific needs described in the previous section are illustrated in Fig 2. This infrastructure is implemented behind institutional firewalls, with only the resource portal accessible through the world wide web. The path to implementing this infrastructure was not linear and continues to evolve, despite the linear timeline (presented in Fig. 3).


Fig2 Asiimwe JofTransMed21 19.png

Figure 2. OVCARE’s data commons infrastructure and software stack. The overall data commons infrastructure comprises of five main components: (1) A clinical database (REDCap) that consolidates and manages clinical data collections from the BC Cancer Registry and the Cheryl Brown Gynecological Cancers Outcomes Unit; (2) a Library Information Management System (OpenSpecimen) that stores and manages biospecimens collected from consented participants at different hospital sites (i.e., Vancouver General Hospital, the University of British Columbia Hospital, BC Cancer Vancouver, and now a few more centers in BC); (3) the cBioPortal that supports the exploration, analysis, and visualization of clinical attributes and molecular profiles from patient tumor samples; (4) the OVCARE Resource Portal (ORP) that governs data and resource sharing based on stipulated protocols, SOPs, and research ethics; and (5) the Research Community (this includes the OVCARE internal research and informatics team, and the broader research community that OVCARE serves). Each of the components (REDCap, OpenSpecimen, cBio Cancer Genomics Portal, ORP) identified to meet our research needs are separately hosted in our hospital’s computing environment and programmatically interlinked through API calls. The data from the different domains are interlinked using system-wide unique identifiers that link patients to their biospecimen collections and molecular/genomics data. To access the amassed clinical and biospecimen collections, authenticated researchers in the OVCARE research community send data and sample acquisition requests to the ORP through which those requests are met by informatics staff, if all stipulated requirements including ethics approval are met. Upon successful data and sample acquisition, researchers conduct their respective studies, and the data generated (raw or processed, and/ biospecimen derivatives) from their research are retuned to OVCARE, making it available for re-purposing/secondary use. Furthermore, molecular data returned to the data commons are linked back to the available and stored patient biospecimens. Together with clinical outcomes, these molecular profiles are further explored, analyzed, and visualized using the cBioPortal.


Fig3 Asiimwe JofTransMed21 19.png

Figure 3. Implementation timeline of OVCARE’s data commons

In early 2017, we completed a survey of existing biobanking solutions to select one that provided the best fit to our needs at that time. In June 2017, a test server was obtained to run local instances of the selected LIMS, OpenSpecimen, to conduct functionality, integration, and unit testing of all components of this software. This enabled us to evaluate OpenSpecimen's features firsthand and to determine the required resources to operate the infrastructure with optimal performance in our current computing and research environment. We tested for performance and evaluated operation workflows by diverse types of users, both technical and nontechnical, to perform daily biobanking activities. We fully adopted OpenSpecimen in December of 2017. Following this migration, we worked with researchers to gather available genomic datasets and link their availability to the respective biospecimen in OpenSpecimen as well as indicate where data are held. As we continue to expand this resource, we will add availability of images of pathology slides, associated with each tumor block and link to them. To prototype the cBioPortal integration, we gathered molecular data for one ovarian cancer subtype, collected from prior studies which were integrated with specimen availability and key clinical outcomes in cBioPortall, using specimen ID. We recently launched this prototype and it is currently under evaluation.

For clinical data, we expanded the mandate of the Cheryl Brown Outcomes Unit to include clinical and outcome data on all gynecological cancer patients diagnosed in British Columbia. We also obtained ethics approval to permanently retain clinical and outcomes data from all clinical studies in our group. We maximized data we can receive from administrative sources, such as the BC Cancer Registry, as this provides access to clinical data for all patients and minimizes the need for broad chart reviews (Fig. 4). We included elements, such as the date of diagnosis, date of last clinical appointment, vital statistics, International Classification of Diseases (ICD)-10 morphology codes, tumour stage, and grade. We are presently investigating additional data, such as systemic therapy (chemotherapy and radiation therapy received). The second step of clinical data integration involved adding clinical studies with chart reviews. To enable that, we needed to map different data elements to unique concepts. This further facilitated the identification of variables that are of greatest interest to researchers in our group. We then developed consistent data definitions, standards, and semantics for each data element to ensure that all data can be integrated within the data commons. Future data collection will consult these data standards to ensure prospectively harmonized clinical data.


Fig4 Asiimwe JofTransMed21 19.png

Figure 4. Clinical and outcome data on all gynecological cancer patients diagnosed in British Columbia. In the tiled plot, data elements (demographic, medical history, pathology, chemotherapy, radiation, surgery, and quality of life data) were plotted on the y-axis against gynecological cancer patients (patient 1 to n) on the x-axis. Darker tiles indicate availability of data on a patient per data element. Clinical studies (study 1 to n) are interested in certain patients with available data on specific data elements. Subsets of patients overlap between clinical studies.

Finally, to manage all data assets and resources, we developed the OVCARE Resource Portal (ORP). Designed and customized to fit the needs of OVCARE users, this solution is implemented in the APEX software and launched in June 2020. This portal has helped to consolidate workflows and all data and resource requests, helping to ensure proper governance and compliance with protocols, SOPs, and research ethics board requirements.

Each of these implementations (REDCap, OpenSpecimen, and cBio Cancer Genomics Portal) are hosted separately on the hospital’s research IT network and solely accessible to informatics staff. Only the resource portal is accessible for researchers to make requests. Data are integrated through unique identifiers that link the various tables from each database at the patient level or at the specimen level. Data linkage to fulfill various study requirements is done programmatically through API calls.

To request data, researchers create user accounts on the ORP, and if needed, associate the principal investigator profile to their account. Authenticated researchers can then submit information (study proposal, ethics approval, and study requirements) on the study for which resources will be requested. A project reference number created for progress tracking is then issued to the researcher and an ORP-generated email sent to the informatics staff notifying them of a new study proposal. Received proposals are subsequently processed and sent for review and approval by a committee of reviewers selected from the OVCARE community, after which resource requests are fulfilled. Researchers return to the data commons any raw and processed data that results from their studies, as well as any derivatives produced by their research (e.g., cell lines, DNA extractions, organoids).

Discussion

We have described the journey followed towards implementing a data commons to benefit the gynecologic cancer community in British Columbia. This infrastructure democratizes access to resources shared by the entire community and brings together the whole gynecological cancer community in BC to work towards a common goal: to reduce death and suffering for women with gynecologic malignancies. To safeguard our data assets and maximize their utility, we have created a unified infrastructure, along with standardized operating procedures, to meet research and ethics needs. The core expertise in data management and informatics which was developed in this process generated efficiencies in data collection to maximize the value of data and stretch research funds by optimizing their secondary use. The proposed governance structure streamlines requests and ensures scientific integrity of projects while adhering to privacy, security, and ethical disclosure of patient-specific data.

Through our investigations we found that no single solution can meet all the different data needs. Rather, the integration of multiple solutions can help us achieve the desired outcome. While the software and technology stack used to implement the current infrastructure will serve us for the near future (i.e., the next five years), the data storage and management field is moving at a very fast pace, and we may need to re-assess our requirements soon. In choosing our software stack, we needed to balance between risks associated with open-source and open-access, which provided affordable solutions and more control but with the downside of little available support and the possibility of the software no longer being maintained, versus going with a corporate software solution that provides more technical support and liability but can be potentially very costly to set up and maintain. To mitigate this, we went with hybrid models where possible and selected software that had an active community of users and that enabled some degree of customization.

The data we collected as part of primary research or for administrative purposes needed to be harmonized for integration. For example, some data sources report "tumor grade" as "high" or "low," while others report numeric grades, e.g., 1, 2, 3, 4. And while gender is occasionally reported as "male" and "female," it can also be represented as "M" and "F," "1" and "0," or "1" and "2."[44] Integration of such data presents “unique technical, semantic, and ethical challenges”[45] and could also result in large amounts of unusable data due to loss in translation. Developing standards a priori streamlines semantics and ontologies, avoids data wastage, increases data quality, and supports effective data integration, sharing, and reusability, while also saving significant time and costs required to pool, process, and share data.[44][46] Future efforts to connect with other biorepositories and similar databases from other centers rely on adopting standardized ontologies to facilitate data sharing. Policies for ensuring data quality and security were also defined, including establishing team and user roles and data access levels, ensuring that all processes from data acquisition to distribution are compliant to stipulated policies and research ethics.

The data commons is overseen by three principal investigators, including an informatician, a medical oncologist, and a gynecological oncologist. The team that operationalizes this infrastructure includes a part-time database manager and a data scientist who work on various data integrations. A lab technician and a clinical coordinator, with the help of various co-op students, facilitate specimen acquisition, storage, and data collection. Occasional consultations with pathology and oncology fellows are needed.

Our team continues to curate and harmonize available data to maximize their utility. For example, in the next year, we will add digital pathology images and have the ability to upload our collection to data enclaves, where it can be linked to other administrative data, including health service utilization and prescription drugs. This will result in a very rich data ecosystem, which will be ripe for novel scientific discovery and can enable research never before possible.

In the very near future, we are expanding our data commons to make it more patient-centric. We are launching an online consent process so that we can reach a broader patient population to invite them to participate in research. We are also adding patient reported outcomes (PRO) to the data commons.

Conclusions

In contrast to traditional biorepositories, the consolidation of heterogeneous datasets and biospecimens from various distributed systems, clinical studies, and research institutions, into a data commons presents important opportunities to drive translational medicine. A seamless data environment for clinical and research data can be achieved through shared policies and technologies, and privacy-preserving open computer architectures and storage platforms.

The success and sustainability of data commons rely first and foremost on fostering a scientific community capable of using the open and connected data environment. Secondly, the appropriate technological solution suitable for each type of data needs to be in place; there is no single solution that can be adapted to all data collections, yet multiple solutions should be integrated. Lastly, the proper governance structure is needed to grapple with the unique challenges presented in cross-institutional and multi-disciplinary research, resource integration, data sharing, and data harmonization for greater interoperability.

In this paper, we present methods developed and applied to successfully establish a federated and scalable infrastructure that extends OVCARE’s traditional tumor biobank, outcomes unit, and a collection of data silos, into an integrated data commons. To this end, we gathered and analyzed all research requirements of participating institutions under three main domains: (1) biospecimen collections, (2) molecular and genomics data, and (3) clinical data. We then built a governance model and a resource portal to effectuate protocols and standard operating procedures, to support data and biomaterials aggregation, sharing, harmonization and governance, across all participating institutions. We believe such infrastructures will help break barriers to the access of large datasets required to elucidate and improve our understanding of complex and rare diseases, providing powerful opportunities for knowledge discovery and translation towards improved patient care.

Methods

Needs assessment

To identify research needs and gather infrastructural requirements, stakeholders were engaged from all participating institutions. Discussions and one-on-one meetings with individual researchers, as well as brainstorming meetings to map out general research direction and requirements for the upcoming five to 10 years were held. Further discussions were conducted with institutional research IT to understand security, data management, and sustainability requirements. Identified direction and priorities were expanded into a list of requirements (Additional file 2: Table S1) relevant to the collection and optimization of biospecimen, clinical, and molecular/genomics data, as well as a governance model of the resulting infrastructure.

Technical solutions

For each of the domain-specific requirements (governance, biospecimen, clinical and molecular/genomics data), technical solutions were identified to meet the needs established under that domain. Solutions required for managing clinical and molecular/genomics data (REDCap and cBio Cancer Genomics Portal respectively) were previously well established, tested, implemented, and proven to meet the needs emerging from these two data domains in our research environment.

To identify a LIMS solution that met all/most of the identified biospecimen requirements, we surveyed the biorepository and LIMS environment (Additional file 1) and identified nine prominent software solutions, which we comparatively evaluated. Based on publications and online documentation, we collected and analyzed data on all identified biobanking software and examined the features and functionality of each with respect to our requirements (Additional file 2: Table S12). We also conducted meetings, interviews, and live interactive demos with various software vendors. A list of features per identified platform (Additional file 2: Table S2—S11) was generated to which each of our requirements was considered to identify the solution that best addressed our needs (Additional file 2: Table S12). Results from this survey were presented in a second stakeholder meeting where we discussed the suitability and utility of the identified LIMs, and we decided to further evaluate OpenSpecimen.

Based on collected biospecimen data, we defined database concepts (entities, attributes, relationships, and constraints) and customized the backend OpenSpecimen database (running MySQL). We obtained a test server (implemented in Java and Apache Tomcat) and installed a Linux-based local instance of OpenSpecimen in our computing environment. During these pilot runs, frequent inquiries were made with software vendors on features, components, integration, and interoperability functions, including the identification of missing requirements. Following successful tests, data from legacy systems was then consolidated into the server by leveraging OpenSpecimen’s batch uploads utility. We further designed and developed the user interface and configured and customized OpenSpecimen to our unique requirements before moving it into production.

Data standardization and integration

The vision of modern translational medicine largely hinges on the integration of large-scale clinical and molecular profiles of patients to derive hypotheses and novel insights into a patient’s disease.[45][47][48] The data at OVCARE is derived from multiple disparate sources. To consolidate data from several databases, we began rigorous data validation and quality control checks. We extensively reviewed all biospecimen data, which included:

  • checking, locating, and uploading all physical consent forms to ensure a digital record in our database;
  • uploading all physical biospecimen requisition forms;
  • reviewing all pathology diagnosis (by pathologists with gynecological subspecialty); and
  • locating and confirming availability of all specimens.

The process of integrating molecular and genomics datasets into OpenSpecimen required close collaboration with researchers with expertise in the interpretation of these data. At the start of 2019, we obtained and consolidated from all OVCARE researchers any previously collected “-omics” datasets. As a first step, we mapped the omics data back to specimen and created tags indicating their availability in OpenSpecimen patient profiles. The second step of this process started in April 2020 with the implementation of cBioPortal for data visualization and analytics.

To consolidate clinical data, we derived a two-step approach whereby we use a minimal set of data elements available on all patients, supplemented by data available from other studies on various subsets of patients. We evaluated all available data elements which can be obtained from administrative sources (e.g., BC Cancer Registry) for accuracy, consistency, and completeness. We selected a set of data elements that met our quality standards. We deployed a pipeline that regularly performs quality checks on data elements against a set of rules that can be applied programmatically to validate the integrity, consistency, and logic between various elements before their integration. Only data that passed quality checks would be merged with a permanent clinical database; data that failed quality checks were further investigated with data stewards to determine sources of error. Clinical outcomes data from the BC Cancer Registry were de-identified before being merged with a permanent database hosted in REDCap, and updated quarterly.

To complement data available from the Registry, the second step of our process involved integrating clinical data obtained through clinical studies and held in silos. To ensure that data can be aggregated, compared, analyzed, shared, and reused across studies, data standards were defined to resolve standardization discrepancies.[44] Unique data variables were aggregated from seven clinical studies to understand the breadth of the data in our clinical database. We created a standardized data dictionary with the goal of mapping data elements to the same data concepts across all clinical data collections in British Colombia, and these concepts in turn can be matched with a common data model OMOP-CDM[49] to maximize interoperability with external datasets.

Data governance, ethics and standard operating procedures

Following standardization and aggregation of all our data sources, we developed a centralized governance model and defined protocols, SOPs, and policies governing data access, storage, protection, sharing, and permissible use across OVCARE’s research community. To implement the governance framework, we designed, developed, tested, and deployed the OVCARE Resource Portal (ORP). The portal was developed using Oracle APEX to provide an online interface for all internal research and collaborating teams to request resources, including biospecimen, clinical, molecular, and imaging data, as well as informatics and data analytics support.

Supplemental information and data availability

The LIMS survey data analyzed during the current study are available in Additional file 2: Tables S2–S12. The data are also publicly available on the websites (features section) of each surveyed LIMS (Additional file 1).

  • Additional file 1: Evaluation of identified biobanking library information management systems (.docx)
  • Additional file 2: OVCARE data commons: requirements identification and mapping desired biobanking features to solutions meeting the need (.xlsx)

Abbreviations

AI: artificial intelligence

APEX: Oracle Application Express

API: applicationp programming interface

CBGOU: Cheryl Brown Gynecological Cancers Outcomes Unit

CNAs: copy number alterations

GUI: graphical user interface

HL7: Health Level Seven

HTTP: Hypertext Transfer Protocol

ICD: International Classification of Diseases

LIMS: laboratory information management systems

mRNA: messenger RNA

MSK: Memorial Sloan Kettering Cancer Center

MySQL: My Structured Query Language

NGS: next-generation sequencing

OMOP-CDM: Observational Medical Outcomes Partnership-Common Data Model

ORP: OVCARE Resource Portal

OVCARE: Ovarian Cancer Research Program

PRO: patient-reported outcomes

REDCap: Research Electronic Data Capture

REST: representational state transfer

SOPs: standard operating procedures

SQL: Structured Query Language

TARGET: Tumor Alterations Relevant for Genomics-Driven Therapy

TCGA: The Cancer Genome Atlas

TMA: tissue microarray

Acknowledgements

The authors are profoundly grateful to all the women who donated their samples for research. Without their generosity, advancements in gynecological cancer research and care would not be possible. The authors wish to extend special thanks to Jane & Maurice Wong and the Gray Family for their foresight in funding the work of the data commons which has and will continue to be a tremendous resource for researchers. The authors also acknowledge the funding from the BC Cancer Foundation, the VGH & UBC Hospital Foundation, the University of British Columbia, and Ovarian Cancer Canada (to OVCARE, BC’s gynecologic cancer research team).

Author contributions

AT in collaboration with MW and SL conceptualized OVCARE’s transition into a data commons in consultations with DH, JNM and AT. RA conducted the LIMS survey, analysis and results interpretation in collaboration with SL and AT. Data standardization and integration was conducted by SL, SW, SL and RW under the supervision of AT. SL, SL, RW, MW, and AT established the data commons governance model, policies and standard operating procedures in consultation with DH, JNM and AT. Design, testing and implementation of data solutions was conducted by SL, RA and SL, under the supervision of AT. Manuscript composition and drafting was conducted by RA, and AT with SL, SL, RW, and SW. All authors read and approved the final manuscript.

Funding

This work was funded by donations from Jane & Maurice Wong and the Gray Family. Funding was also obtained from the BC Cancer Foundation, the VGH & UBC Hospital Foundation, the University of British Columbia, and Ovarian Cancer Canada (to OVCARE, BC’s gynecologic cancer research team).

Competing interests

The authors declare that they have no competing interests.

References

  1. 1.0 1.1 Vaught, Jim (6 January 2016). "Biobanking Comes of Age: The Transition to Biospecimen Science". Annual Review of Pharmacology and Toxicology 56 (1): 211–228. doi:10.1146/annurev-pharmtox-010715-103246. ISSN 0362-1642. https://www.annualreviews.org/doi/10.1146/annurev-pharmtox-010715-103246. 
  2. 2.0 2.1 Vaught, Jim; Kelly, Andrea; Hewitt, Robert (1 September 2009). "A Review of International Biobanks and Networks: Success Factors and Key Benchmarks". Biopreservation and Biobanking 7 (3): 143–150. doi:10.1089/bio.2010.0003. ISSN 1947-5535. PMC PMC4046743. PMID 24835880. https://doi.org/10.1089/bio.2010.0003. 
  3. 3.0 3.1 Eiseman, E.; Haga, S.B. (1999) (in en). Handbook of Human Tissue Sources: A National Resource of Human Tissue Samples. RAND Corporation. pp. 251. doi:10.7249/mr954. ISBN 978-0-8330-2766-5. https://www.rand.org/pubs/monograph_reports/MR954.html. 
  4. 4.0 4.1 Coppola, Luigi; Cianflone, Alessandra; Grimaldi, Anna Maria; Incoronato, Mariarosaria; Bevilacqua, Paolo; Messina, Francesco; Baselice, Simona; Soricelli, Andrea et al. (22 May 2019). "Biobanking in health care: evolution and future directions". Journal of Translational Medicine 17 (1): 172. doi:10.1186/s12967-019-1922-3. ISSN 1479-5876. PMC PMC6532145. PMID 31118074. https://doi.org/10.1186/s12967-019-1922-3. 
  5. Greenberg, Benjamin; Christian, Jennifer; Meltzer Henry, Leslie; Leavy, Michelle; Moore, Helen (1 February 2018). Biorepositories. doi:10.23970/ahrqregistriesbio. https://effectivehealthcare.ahrq.gov/topics/registries-guide-4th-edition/white-paper-2016-3. 
  6. Cortes, Adrian; Albers, Patrick K.; Dendrou, Calliope A.; Fugger, Lars; McVean, Gil (1 January 2020). "Identifying cross-disease components of genetic risk across hospital data in the UK Biobank" (in en). Nature Genetics 52 (1): 126–134. doi:10.1038/s41588-019-0550-4. ISSN 1546-1718. PMC PMC6974401. PMID 31873298. https://www.nature.com/articles/s41588-019-0550-4. 
  7. 7.0 7.1 7.2 7.3 Harris, Jennifer R.; Burton, Paul; Knoppers, Bartha Maria; Lindpaintner, Klaus; Bledsoe, Marianna; Brookes, Anthony J.; Budin-Ljøsne, Isabelle; Chisholm, Rex et al. (1 November 2012). "Toward a roadmap in global biobanking for health" (in en). European Journal of Human Genetics 20 (11): 1105–1111. doi:10.1038/ejhg.2012.96. ISSN 1476-5438. PMC PMC3477856. PMID 22713808. https://www.nature.com/articles/ejhg201296. 
  8. Cole, Joanne B.; Florez, Jose C.; Hirschhorn, Joel N. (19 March 2020). "Comprehensive genomic analysis of dietary habits in UK Biobank identifies hundreds of genetic associations" (in en). Nature Communications 11 (1): 1467. doi:10.1038/s41467-020-15193-0. ISSN 2041-1723. PMC PMC7081342. PMID 32193382. https://www.nature.com/articles/s41467-020-15193-0. 
  9. Collins, Francis S.; Varmus, Harold (25 February 2015). "A New Initiative on Precision Medicine" (in en). https://doi.org/10.1056/NEJMp1500523. pp. 793–5. doi:10.1056/nejmp1500523. PMC PMC5101938. PMID 25635347. https://www.nejm.org/doi/10.1056/NEJMp1500523. Retrieved 2022-01-05. 
  10. Liu, Angen; Pollard, Kai (2015), Karimi-Busheri, Feridoun, ed., "Biobanking for Personalized Medicine", Biobanking in the 21st Century (Cham: Springer International Publishing) 864: 55–68, doi:10.1007/978-3-319-20579-3_5, ISBN 978-3-319-20578-6, http://link.springer.com/10.1007/978-3-319-20579-3_5. Retrieved 2022-01-05 
  11. 11.0 11.1 De Souza, Yvonne G.; Greenspan, John S. (28 January 2013). "Biobanking past, present and future". AIDS 27 (3): 303–312. doi:10.1097/qad.0b013e32835c1244. ISSN 0269-9370. PMC PMC3894636. PMID 23135167. https://doi.org/10.1097/QAD.0b013e32835c1244. 
  12. Uddin, Mohammed; Wang, Yujiang; Woodbury-Smith, Marc (21 November 2019). "Artificial intelligence for precision medicine in neurodevelopmental disorders" (in en). npj Digital Medicine 2 (1): 1–10. doi:10.1038/s41746-019-0191-0. ISSN 2398-6352. PMC PMC6872596. PMID 31799421. https://www.nature.com/articles/s41746-019-0191-0. 
  13. 13.0 13.1 13.2 Pandya, J.; Cognitive World (12 August 2019). "Biobanking Is Changing The World". Forbes. https://www.forbes.com/sites/cognitiveworld/2019/08/12/biobanking-is-changing-the-world/?sh=4be6f9443792. Retrieved 16 August 2020. 
  14. Lee, Jae-Eun (31 July 2018). "Artificial Intelligence in the Future Biobanking: Current Issues in the Biobank and Future Possibilities of Artificial Intelligence". Biomedical Journal of Scientific & Technical Research 7 (3). doi:10.26717/BJSTR.2018.07.001511. https://biomedres.us/fulltexts/BJSTR.MS.ID.001511.php. 
  15. Kiehntopf, Michael; Krawczak, Michael (15 July 2011). "Biobanking and international interoperability: samples" (in en). Human Genetics 130 (3): 369–376. doi:10.1007/s00439-011-1068-8. ISSN 0340-6717. https://doi.org/10.1007/s00439-011-1068-8. 
  16. Grossman, Robert L.; Heath, Allison; Murphy, Mark; Patterson, Maria; Wells, Walt (1 September 2016). "A Case for Data Commons: Toward Data Science as a Service". Computing in Science Engineering 18 (5): 10–20. doi:10.1109/MCSE.2016.92. ISSN 1558-366X. PMC PMC5636009. PMID 29033693. https://ieeexplore.ieee.org/document/7548983/. 
  17. Jensen, Mark A.; Ferretti, Vincent; Grossman, Robert L.; Staudt, Louis M. (27 July 2017). "The NCI Genomic Data Commons as an engine for precision medicine". Blood 130 (4): 453–459. doi:10.1182/blood-2017-03-735654. ISSN 0006-4971. PMC PMC5533202. PMID 28600341. https://doi.org/10.1182/blood-2017-03-735654. 
  18. Hinkson, Izumi V.; Davidsen, Tanja M.; Klemm, Juli D.; Chandramouliswaran, Ishwar; Kerlavage, Anthony R.; Kibbe, Warren A. (2017). "A Comprehensive Infrastructure for Big Data in Cancer Research: Accelerating Cancer Research and Precision Medicine" (in English). Frontiers in Cell and Developmental Biology 5: 83. doi:10.3389/fcell.2017.00083. ISSN 2296-634X. PMC PMC5613113. PMID 28983483. https://www.frontiersin.org/articles/10.3389/fcell.2017.00083/full. 
  19. Köbel, Martin; Rahimi, Kurosh; Rambau, Peter F.; Naugler, Christopher; Le Page, Cécile; Meunier, Liliane; de Ladurantaye, Manon; Lee, Sandra et al. (1 September 2016). "An Immunohistochemical Algorithm for Ovarian Carcinoma Typing". International Journal of Gynecological Pathology 35 (5): 430–441. doi:10.1097/pgp.0000000000000274. ISSN 0277-1691. PMC PMC4978603. PMID 26974996. https://doi.org/10.1097/PGP.0000000000000274. 
  20. Shah, Sohrab P.; Köbel, Martin; Senz, Janine; Morin, Ryan D.; Clarke, Blaise A.; Wiegand, Kimberly C.; Leung, Gillian; Zayed, Abdalnasser et al. (25 June 2009). "Mutation of FOXL2 in Granulosa-Cell Tumors of the Ovary". New England Journal of Medicine 360 (26): 2719–2729. doi:10.1056/NEJMoa0902542. ISSN 0028-4793. PMID 19516027. https://doi.org/10.1056/NEJMoa0902542. 
  21. Wiegand, Kimberly C.; Shah, Sohrab P.; Al-Agha, Osama M.; Zhao, Yongjun; Tse, Kane; Zeng, Thomas; Senz, Janine; McConechy, Melissa K. et al. (14 October 2010). "ARID1A Mutations in Endometriosis-Associated Ovarian Carcinomas". New England Journal of Medicine 363 (16): 1532–1543. doi:10.1056/NEJMoa1008433. ISSN 0028-4793. PMC PMC2976679. PMID 20942669. https://doi.org/10.1056/NEJMoa1008433. 
  22. Errico, Alessia (1 June 2014). "SMARCA4 mutated in SCCOHT" (in en). Nature Reviews Clinical Oncology 11 (6): 302–302. doi:10.1038/nrclinonc.2014.63. ISSN 1759-4782. https://www.nature.com/articles/nrclinonc.2014.63. 
  23. Wang, Yi Kan; Bashashati, Ali; Anglesio, Michael S.; Cochrane, Dawn R.; Grewal, Diljot S.; Ha, Gavin; McPherson, Andrew; Horlings, Hugo M. et al. (1 June 2017). "Genomic consequences of aberrant DNA repair mechanisms stratify ovarian cancer histotypes" (in en). Nature Genetics 49 (6): 856–865. doi:10.1038/ng.3849. ISSN 1546-1718. https://www.nature.com/articles/ng.3849. 
  24. Talhouk, Aline; McConechy, Melissa K.; Leung, Samuel; Yang, Winnie; Lum, Amy; Senz, Janine; Boyd, Niki; Pike, Judith et al. (2017). "Confirmation of ProMisE: A simple, genomics-based clinical classifier for endometrial cancer" (in en). Cancer 123 (5): 802–813. doi:10.1002/cncr.30496. ISSN 1097-0142. https://onlinelibrary.wiley.com/doi/abs/10.1002/cncr.30496. 
  25. Karnezis, Anthony N.; Leung, Samuel; Magrill, Jamie; McConechy, Melissa K.; Yang, Winnie; Chow, Christine; Kobel, Martin; Lee, Cheng-Han et al. (2017). "Evaluation of endometrial carcinoma prognostic immunohistochemistry markers in the context of molecular classification" (in en). The Journal of Pathology: Clinical Research 3 (4): 279–293. doi:10.1002/cjp2.82. ISSN 2056-4538. PMC PMC5653931. PMID 29085668. https://onlinelibrary.wiley.com/doi/abs/10.1002/cjp2.82. 
  26. Talhouk, Aline; Hoang, Lien N.; McConechy, Melissa K.; Nakonechny, Quentin; Leo, Joyce; Cheng, Angela; Leung, Samuel; Yang, Winnie et al. (1 October 2016). "Molecular classification of endometrial carcinoma on diagnostic specimens is highly concordant with final hysterectomy: Earlier prognostic information to guide treatment" (in English). Gynecologic Oncology 143 (1): 46–53. doi:10.1016/j.ygyno.2016.07.090. ISSN 0090-8258. PMC PMC5521211. PMID 27421752. https://www.gynecologiconcology-online.net/article/S0090-8258(16)30959-3/abstract. 
  27. McAlpine, Jessica N.; Leung, Samuel C. Y.; Cheng, Angela; Miller, Dianne; Talhouk, Aline; Gilks, C. Blake; Karnezis, Anthony N. (2017). "Human papillomavirus (HPV)-independent vulvar squamous cell carcinoma has a worse prognosis than HPV-associated disease: a retrospective cohort study" (in en). Histopathology 71 (2): 238–246. doi:10.1111/his.13205. ISSN 1365-2559. https://onlinelibrary.wiley.com/doi/abs/10.1111/his.13205. 
  28. "OpenSpecimen". Krishagni Solutions Pvt. Ltd. https://www.openspecimen.org/. Retrieved 17 August 2016. 
  29. 29.0 29.1 McIntosh, L.D.; Sharma, M.K.; Mulvihill, D. (1 October 2015). "caTissue Suite to OpenSpecimen: Developing an extensible, open source, web-based biobanking management system" (in en). Journal of Biomedical Informatics 57: 456–464. doi:10.1016/j.jbi.2015.08.020. ISSN 1532-0464. PMC PMC4772150. PMID 26325296. https://www.sciencedirect.com/science/article/pii/S1532046415001884. 
  30. "OpenSpecimen: Biobanking LIMS Features". Krishagni Solutions Pvt. Ltd. https://www.openspecimen.org/biobanking-lims-features/. Retrieved 17 August 2016. 
  31. 31.0 31.1 31.2 31.3 31.4 Cerami, Ethan; Gao, Jianjiong; Dogrusoz, Ugur; Gross, Benjamin E.; Sumer, Selcuk Onur; Aksoy, Bülent Arman; Jacobsen, Anders; Byrne, Caitlin J. et al. (1 May 2012). "The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data". Cancer Discovery 2 (5): 401–404. doi:10.1158/2159-8290.cd-12-0095. PMC PMC3956037. PMID 22588877. https://cancerdiscovery.aacrjournals.org/content/2/5/401. 
  32. Bell, D.; Berchuck, A.; Birrer, M.; Chien, J.; Cramer, D. W.; Dao, F.; Dhir, R.; DiSaia, P. et al. (1 June 2011). "Integrated genomic analyses of ovarian carcinoma" (in en). Nature 474 (7353): 609–615. doi:10.1038/nature10166. ISSN 1476-4687. PMC PMC3163504. PMID 21720365. https://www.nature.com/articles/nature10166. 
  33. Jonckheere, Nicolas; Van Seuningen, Isabelle (20 September 2018). "Integrative analysis of the cancer genome atlas and cancer cell lines encyclopedia large-scale genomic databases: MUC4/MUC16/MUC20 signature is associated with poor survival in human carcinomas". Journal of Translational Medicine 16 (1): 259. doi:10.1186/s12967-018-1632-2. ISSN 1479-5876. PMC PMC6149062. PMID 30236127. https://doi.org/10.1186/s12967-018-1632-2. 
  34. 34.0 34.1 Cui, Xiangrong; Jing, Xuan; Yi, Qin; Long, Chunlan; Tan, Bin; Li, Xin; Chen, Xueni; Huang, Yue et al. (14 December 2017). "Systematic analysis of gene expression alterations and clinical outcomes of STAT3 in cancer" (in en). Oncotarget 9 (3): 3198–3213. doi:10.18632/oncotarget.23226. ISSN 1949-2553. PMC PMC5790457. PMID 29423040. https://www.oncotarget.com/article/23226/text/. 
  35. Koboldt, Daniel C.; Fulton, Robert S.; McLellan, Michael D.; Schmidt, Heather; Kalicki-Veizer, Joelle; McMichael, Joshua F.; Fulton, Lucinda L.; Dooling, David J. et al. (1 October 2012). "Comprehensive molecular portraits of human breast tumours" (in en). Nature 490 (7418): 61–70. doi:10.1038/nature11412. ISSN 1476-4687. PMC PMC3465532. PMID 23000897. https://www.nature.com/articles/nature11412. 
  36. Nagasawa, Saya; Ikeda, Kazuhiro; Horie-Inoue, Kuniko; Sato, Sho; Takeda, Satoru; Hasegawa, Kosei; Inoue, Satoshi (2020). "Identification of novel mutations of ovarian cancer-related genes from RNA-sequencing data for Japanese epithelial ovarian cancer patients". Endocrine Journal 67 (2): 219–229. doi:10.1507/endocrj.EJ19-0283. https://www.jstage.jst.go.jp/article/endocrj/67/2/67_EJ19-0283/_article. 
  37. 37.0 37.1 37.2 Gao, Jianjiong; Aksoy, Bülent Arman; Dogrusoz, Ugur; Dresdner, Gideon; Gross, Benjamin; Sumer, S. Onur; Sun, Yichao; Jacobsen, Anders et al. (2 April 2013). "Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal" (in en). Science Signaling 6 (269). doi:10.1126/scisignal.2004088. ISSN 1945-0877. PMC PMC4160307. PMID 23550210. https://www.science.org/doi/10.1126/scisignal.2004088. 
  38. National Cancer Institute. "The Cancer Genome Atlas Program". National Institutes of Health. https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga. Retrieved 12 April 2021. 
  39. National Cancer Institute, Office of Cancer Genomics. "TARGET: Therapeutically Applicable Research To Generate Effective Treatments". National Institutes of Health. https://ocg.cancer.gov/programs/target. Retrieved 15 April 2021. 
  40. Harris, P.A.; Taylor, R.; Thielke, R. et al. (1 April 2009). "Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support" (in en). Journal of Biomedical Informatics 42 (2): 377–381. doi:10.1016/j.jbi.2008.08.010. ISSN 1532-0464. PMC PMC2700030. PMID 18929686. https://www.sciencedirect.com/science/article/pii/S1532046408001226. 
  41. 41.0 41.1 Harris, P.A.; Taylor, R.; Minor, B.L. et al. (1 July 2019). "The REDCap consortium: Building an international community of software platform partners" (in en). Journal of Biomedical Informatics 95: 103208. doi:10.1016/j.jbi.2019.103208. ISSN 1532-0464. PMC PMC7254481. PMID 31078660. https://www.sciencedirect.com/science/article/pii/S1532046419301261. 
  42. "RedCap". National Institutes of Health. https://www.project-redcap.org/. Retrieved 15 April 2021. 
  43. "Oracle Apex". Oracle Corporation. https://apex.oracle.com/en/. Retrieved 14 April 2021. 
  44. 44.0 44.1 44.2 Olson, Steve; Downey, Autumn S. (2013). Sharing clinical research data: workshop summary. Institute of Medicine (U.S.), Institute of Medicine (U.S.), National Cancer Policy Forum (U.S.), Institute of Medicine (U.S.), Institute of Medicine (U.S.), Institute of Medicine (U.S.). Washington, D.C: The National Academies Press. ISBN 978-0-309-26874-5. OCLC 853280017. https://www.worldcat.org/title/mediawiki/oclc/853280017. 
  45. 45.0 45.1 Seneviratne, Martin G.; Kahn, Michael G.; Hernandez-Boussard, Tina (2019). "Merging heterogeneous clinical data to enable knowledge discovery". Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 24: 439–443. ISSN 2335-6928. PMC 6447393. PMID 30864344. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6447393/. 
  46. Huser, V.; Sastry, C.; Breymaier, M. et al. (1 October 2015). "Standardizing data exchange for clinical research protocols and case report forms: An assessment of the suitability of the Clinical Data Interchange Standards Consortium (CDISC) Operational Data Model (ODM)" (in en). Journal of Biomedical Informatics 57: 88–99. doi:10.1016/j.jbi.2015.06.023. ISSN 1532-0464. PMC PMC4714951. PMID 26188274. https://www.sciencedirect.com/science/article/pii/S1532046415001331. 
  47. De Maria Marchiano, Ruggero; Di Sante, Gabriele; Piro, Geny; Carbone, Carmine; Tortora, Giampaolo; Boldrini, Luca; Pietragalla, Antonella; Daniele, Gennaro et al. (18 March 2021). "Translational Research in the Era of Precision Medicine: Where We Are and Where We Will Go". Journal of Personalized Medicine 11 (3): 216. doi:10.3390/jpm11030216. ISSN 2075-4426. PMC PMC8002976. PMID 33803592. https://doi.org/10.3390/jpm11030216. 
  48. Tian, Q.; Price, N. D.; Hood, L. (2012). "Systems cancer medicine: towards realization of predictive, preventive, personalized and participatory (P4) medicine" (in en). Journal of Internal Medicine 271 (2): 111–121. doi:10.1111/j.1365-2796.2011.02498.x. ISSN 1365-2796. PMC PMC3978383. PMID 22142401. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1365-2796.2011.02498.x. 
  49. "OMOP Common Data Model". OHDSI. https://www.ohdsi.org/data-standardization/the-common-data-model/. Retrieved 02 July 2021. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.