Journal:Building infrastructure for African human genomic data management

From LIMSWiki
Revision as of 21:22, 11 November 2019 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Building infrastructure for African human genomic data management
Journal Data Science Journal
Author(s) Parker, Ziyaad; Maslamoney, Suresh; Meintjes, Ayton; Botha, Gerrit; Panji, Sumir; Hazelhurst, Scott; Mulder, Nicola
Author affiliation(s) University of Cape Town, University of the Witwatersrand
Primary contact Email: ziyaad dot parker at uct dot ac dot za
Year published 2019
Volume and issue 18(1)
Page(s) 47
DOI 10.5334/dsj-2019-047
ISSN 1683-1470
Distribution license Creative Commons Attribution 4.0 International
Website https://datascience.codata.org/articles/10.5334/dsj-2019-047/
Download https://datascience.codata.org/articles/10.5334/dsj-2019-047/galley/894/download/ (PDF)

Abstract

Human genomic data are large and complex, and require adequate infrastructure for secure storage and transfer. The National Institutes of Health (NIH) and The Wellcome Trust have funded multiple projects on genomic research, including the Human Heredity and Health in Africa (H3Africa) initiative, and data are required to be deposited into the public domain. The European Genome-phenome Archive (EGA) is a repository for sequence and genotype data where data access is controlled by access committees. Access is determined by a formal application procedure for the purpose of secure storage and distribution, which must be in line with the informed consent of the study participants. H3Africa researchers based in Africa and generating their own data can benefit tremendously from the data sharing capabilities of the internet by using the appropriate technologies. The H3Africa Data Archive is an effort between the H3Africa data generating projects, H3ABioNet, and the EGA to store and submit genomic data to public repositories. H3ABioNet maintains the security of the H3Africa Data Archive, ensures ethical security compliance, supports users with data submission, and facilitates data transfers. The goal is to ensure efficient data flow between researchers, the archive, and the EGA or other public repositories. To comply with the H3Africa data sharing and release policy, nine months after the data is in secure storage, H3ABioNet converts the data into an Extensible Markup Language (XML) format ready for submission to EGA. This article describes the infrastructure that has been developed for African human genomic data management.

Keywords: genomic data, data archive, H3Africa data, African genomic data

Introduction

Advances in high-throughput genomic technologies are laying the foundations for the goal of precision medicine to be realized.[1][2] Decreasing costs and the capacity to generate larger volumes of human genomic data at faster rates are enabling population-level genomics studies to be conducted.[3][4] However, most of the current population-level genomics studies and data generated to date have a significant population representational bias, with the majority of genome sequences being derived from European and North American ancestry, regions that have been early adopters of genomic technologies.[4][5] African researchers, in general, have been late adopters of high-throughput technologies for use in population genomics due to more limited resources and funding. To address this critical gap in scientific knowledge about African genomics and population variation, and inspired by the African Society for Human Genetics, the National Institutes of Health (NIH) and The Wellcome Trust, through the Human Hereditary and Health in Africa (H3Africa) program, have funded multiple genomics projects led by African investigators.[6][7] To support the H3Africa projects in terms of provisioning of infrastructure for secure data storage, management, and computing, the NIH has also funded a Pan-African Bioinformatics Network for H3Africa (H3ABioNet).[8]

The H3Africa Consortium consists of multiple projects and sites distributed across Africa, most of which are generating genomic data linked to clinical data for specific diseases. The principal H3Africa funders (NIH and the Wellcome Trust) require any project data generated to be deposited into a data repository accessible by the scientific community.[9][10] In order to facilitate the storage and accessibility of H3Africa genomics data, significant infrastructure, procedures, and policies were established. Part of H3ABioNet’s mandate is to develop processes and implement an infrastructure that will enable the ingestion, validation, annotation, secure storage, and submission of the African genomics data to the controlled access European Genome-phenome Archive (EGA).[11] This has been achieved through the development of the H3Africa Data Archive, which also ensures a copy of the genomic data is securely stored and retained on the African continent.[8][12] This article describes the infrastructure that has been developed, which to our knowledge, is the first formalized human genomic data archive on the continent.

Methods

In order to establish the H3Africa Archive, and to make the submission process seamless, a data storage infrastructure had to be created and new processes and policies developed and adhered to.

Data submission and access policy

Genomic data associated with phenotype data enables the possibility of re-identification of study participants; hence all human genomic data and its accompanying phenotypic data need to be governed by a controlled access policy.[13][14] H3Africa is distinguished as biospecimens are also being collected and stored at one of the three H3Africa biorepositories, so researchers can request access to genomic data and/or biospecimens. As a consortium, H3Africa has developed its own data submission and access policy, which takes into account the genomic and phenotype data generated, as well as policies for the access to and transfer of biospecimens.[15] A single H3Africa Data and Biospecimen Access committee has been established to oversee the secondary use of both the data, which is being deposited in the EGA, and biospecimens in the H3Africa biorepositories. The sharing and access policies and the H3Africa Data and Biospecimen Access Committee guidelines seek to provide a balance between protecting the rights of individuals and their data, while at the same time not acting as a barrier to advancing scientific knowledge. A data requester will need to identify the data in the EGA and apply for data access.[11] The data access request is routed to the Data and Biospecimen Access Committee (DBAC), who review it to determine whether the intended research use is inline with the H3Africa data and access policy, and ensure the requester is a bona fide researcher. Once the data request has been reviewed, the H3Africa DBAC will provide a decision to approve or reject it.

Types of data being accepted

The principal data types being collected for submission to the H3Africa Data Archive and the EGA include genomic sequence data, genotype array files, the associated phenotypes and metadata that is collected along with the samples, and results of any analysis conducted. Genomic sequence data mainly comprises of short DNA sequence reads in FASTQ format.[16][17] The types of data and associated files for the H3Africa research projects are summarized in Table 1.

Table 1. Description of data types for submission
Exome/Whole Genome Sequence 16S rRNA Microbiome studies Genome Wide Association studies/genotyping arrays
Study type and description Study type and description Study type and description
Sequencing platform and technology used Sequencing platform and technology used Genotyping array model/name and description of the software and version used for calling the genotypes
FASTQ files linked with de-identified participant ID (minus technical reads such as adapters, linkers, barcodes) FASTQ files linked with de-identified participant ID (minus technical reads such as adapters, linkers, barcodes) Raw intensity files linked with de-identified participant IDs (IDATs, CELs)
Binary Alignment files (BAMs, de-multiplexed) – linked with participant de-identified ID Manifest file describing SNP or probe content on the genotyping array
Associated phenotypic data collected Associated phenotypic data collected Associated phenotypic data collected
Variant calling files (VCFs) Final analyses BIOM files (at minimum must contain OTUs) Final reports and analysis files generated
Mapping file indicating the relationship between the submitted files Mapping file indicating the relationship between the submitted files Mapping file indicating the relationship between the submitted files (completed Array Format template)

References

  1. Christensen, K.D.; Dukhovny, D.; Siebert, U. et al. (2015). "Assessing the Costs and Cost-Effectiveness of Genomic Sequencing". Journal of Personalized Medicine 5 (4): 470–86. doi:10.3390/jpm5040470. PMC PMC4695866. PMID 26690481. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4695866. 
  2. Aronson, S.J.; Rehm, H.L. (2015). "Building the foundation for genomics in precision medicine". Nature 526 (7573): 336–42. doi:10.1038/nature15816. PMC PMC5669797. PMID 26469044. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5669797. 
  3. Goldfeder, R.L.; Wall, D.P.; Khoury, M.J. et al. (2017). "Human Genome Sequencing at the Population Scale: A Primer on High-Throughput DNA Sequencing and Analysis". American Journal of Epidemiology 186 (8): 1000–1009. doi:10.1093/aje/kww224. PMC PMC6250075. PMID 29040395. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6250075. 
  4. 4.0 4.1 Prokop. J.W.; May, T.; Strong, K. et al. (2018). "Genome sequencing in the clinic: the past, present, and future of genomic medicine". Physiological Genomics 50 (8): 563–79. doi:10.1152/physiolgenomics.00046.2018. PMC PMC6139636. PMID 29727589. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6139636. 
  5. Popejoy, A.B.; Fullerton, S.M. (2016). "Genomics is failing on diversity". Nature 538 (7624): 161–64. doi:10.1038/538161a. PMC PMC5089703. PMID 27734877. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5089703. 
  6. H3Africa Consortium; Rotimi, C.; Abayomi, A. et al. (2014). "Research capacity. Enabling the genomic revolution in Africa". Science 344 (6190): 1346–8. doi:10.1126/science.1251546. PMC PMC4138491. PMID 24948725. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4138491. 
  7. Mulder, N.; Abimiku, A.; Adebamowo, S.N. et al. (2018). "H3Africa: Current perspectives". Pharmacogenomics and Personalized Medicine 11: 59–86. doi:10.2147/PGPM.S141546. PMC PMC5903476. PMID 29692621. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5903476. 
  8. 8.0 8.1 Mulder, N.J.; Adebiyi, E.; Alami, R. et al. (2016). "H3ABioNet, a sustainable pan-African bioinformatics network for human heredity and health in Africa". Genome Research 26 (2): 271–7. doi:10.1101/gr.196295.115. PMC PMC4728379. PMID 26627985. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728379. 
  9. "NIH Sharing Policies and Related Guidance on NIH-Funded Research Resources". Grants & Funding. National Institutes of Health. 2019. https://grants.nih.gov/policy/sharing.htm. 
  10. "Data, software and materials management and sharing policy". Funding. The Wellcome Trust. 10 July 2017. https://wellcome.ac.uk/funding/guidance/data-software-materials-management-and-sharing-policy. 
  11. 11.0 11.1 Lappalainen, I.; Almeida-King, J.; Kumanduri, V. et al. (2015). "The European Genome-phenome Archive of human data consented for biomedical research". Nature Genetics 47 (7): 692–5. doi:10.1038/ng.3312. PMC PMC5426533. PMID 26111507. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5426533. 
  12. Mulder, N.J.; Adebiyi, E.; Adebiyi, M. et al. (2017). "Development of Bioinformatics Infrastructure for Genomics Research". Global Heart 12 (2): 91–98. doi:10.1016/j.gheart.2017.01.005. PMC PMC5582980. PMID 28302555. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5582980. 
  13. Shabani, M.; Dyke, S.O.; Joly, Y. et al. (2015). "Controlled Access under Review: Improving the Governance of Genomic Data Access". PLoS Biology 13 (12): e1002339. doi:10.1371/journal.pbio.1002339. PMC PMC4697814. PMID 26720729. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4697814. 
  14. Dyke, S.O.M.; Linden, M.; Lappalainen, I. et al. (2018). "Registered access: Authorizing data access". European Journal of Human Genetics 26 (12): 1721-1731. doi:10.1038/s41431-018-0219-y. PMC PMC6244209. PMID 30069064. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6244209. 
  15. de Vries, J.; Tindana, P;. Littler, K. et al. (2015). "The H3Africa policy framework: negotiating fairness in genomics". Trends in Genetics 31 (3): 117-9. doi:10.1016/j.tig.2014.11.004. PMC PMC4471134. PMID 25601285. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4471134. 
  16. Cock, P.J.; Fields, C.J.; Goto, N. et al. (2010). "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants". Nucleic Acids Research 38 (6): 1767–71. doi:10.1093/nar/gkp1137. PMC PMC2847217. PMID 20015970. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217. 
  17. Hsi-Yang Fritz, M.; Leinonen, R.; Cochrane, G. et al. (2011). "Efficient storage of high throughput DNA sequencing data using reference-based compression". Genome Research 21 (5): 734–40. doi:10.1101/gr.114819.110. PMC PMC3083090. PMID 21245279. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3083090. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original paper listed references alphabetically; this wiki lists them by order of appearance, by design. The two footnotes were turned into inline references for convenience.