Journal:Building infrastructure for African human genomic data management
|Full article title||Building infrastructure for African human genomic data management|
|Journal||Data Science Journal|
|Author(s)||Parker, Ziyaad; Maslamoney, Suresh; Meintjes, Ayton; Botha, Gerrit; Panji, Sumir; Hazelhurst, Scott; Mulder, Nicola|
|Author affiliation(s)||University of Cape Town, University of the Witwatersrand|
|Primary contact||Email: ziyaad dot parker at uct dot ac dot za|
|Volume and issue||18(1)|
|Distribution license||Creative Commons Attribution 4.0 International|
Human genomic data are large and complex, and require adequate infrastructure for secure storage and transfer. The National Institutes of Health (NIH) and The Wellcome Trust have funded multiple projects on genomic research, including the Human Heredity and Health in Africa (H3Africa) initiative, and data are required to be deposited into the public domain. The European Genome-phenome Archive (EGA) is a repository for sequence and genotype data where data access is controlled by access committees. Access is determined by a formal application procedure for the purpose of secure storage and distribution, which must be in line with the informed consent of the study participants. H3Africa researchers based in Africa and generating their own data can benefit tremendously from the data sharing capabilities of the internet by using the appropriate technologies. The H3Africa Data Archive is an effort between the H3Africa data generating projects, H3ABioNet, and the EGA to store and submit genomic data to public repositories. H3ABioNet maintains the security of the H3Africa Data Archive, ensures ethical security compliance, supports users with data submission, and facilitates data transfers. The goal is to ensure efficient data flow between researchers, the archive, and the EGA or other public repositories. To comply with the H3Africa data sharing and release policy, nine months after the data is in secure storage, H3ABioNet converts the data into an Extensible Markup Language (XML) format ready for submission to EGA. This article describes the infrastructure that has been developed for African human genomic data management.
Keywords: genomic data, data archive, H3Africa data, African genomic data
Advances in high-throughput genomic technologies are laying the foundations for the goal of precision medicine to be realized. Decreasing costs and the capacity to generate larger volumes of human genomic data at faster rates are enabling population-level genomics studies to be conducted. However, most of the current population-level genomics studies and data generated to date have a significant population representational bias, with the majority of genome sequences being derived from European and North American ancestry, regions that have been early adopters of genomic technologies. African researchers, in general, have been late adopters of high-throughput technologies for use in population genomics due to more limited resources and funding. To address this critical gap in scientific knowledge about African genomics and population variation, and inspired by the African Society for Human Genetics, the National Institutes of Health (NIH) and The Wellcome Trust, through the Human Hereditary and Health in Africa (H3Africa) program, have funded multiple genomics projects led by African investigators. To support the H3Africa projects in terms of provisioning of infrastructure for secure data storage, management, and computing, the NIH has also funded a Pan-African Bioinformatics Network for H3Africa (H3ABioNet).
The H3Africa Consortium consists of multiple projects and sites distributed across Africa, most of which are generating genomic data linked to clinical data for specific diseases. The principal H3Africa funders (NIH and the Wellcome Trust) require any project data generated to be deposited into a data repository accessible by the scientific community. In order to facilitate the storage and accessibility of H3Africa genomics data, significant infrastructure, procedures, and policies were established. Part of H3ABioNet’s mandate is to develop processes and implement an infrastructure that will enable the ingestion, validation, annotation, secure storage, and submission of the African genomics data to the controlled access European Genome-phenome Archive (EGA). This has been achieved through the development of the H3Africa Data Archive, which also ensures a copy of the genomic data is securely stored and retained on the African continent. This article describes the infrastructure that has been developed, which to our knowledge, is the first formalized human genomic data archive on the continent.
In order to establish the H3Africa Archive, and to make the submission process seamless, a data storage infrastructure had to be created and new processes and policies developed and adhered to.
Data submission and access policy
Genomic data associated with phenotype data enables the possibility of re-identification of study participants; hence all human genomic data and its accompanying phenotypic data need to be governed by a controlled access policy. H3Africa is distinguished as biospecimens are also being collected and stored at one of the three H3Africa biorepositories, so researchers can request access to genomic data and/or biospecimens. As a consortium, H3Africa has developed its own data submission and access policy, which takes into account the genomic and phenotype data generated, as well as policies for the access to and transfer of biospecimens. A single H3Africa Data and Biospecimen Access committee has been established to oversee the secondary use of both the data, which is being deposited in the EGA, and biospecimens in the H3Africa biorepositories. The sharing and access policies and the H3Africa Data and Biospecimen Access Committee guidelines seek to provide a balance between protecting the rights of individuals and their data, while at the same time not acting as a barrier to advancing scientific knowledge. A data requester will need to identify the data in the EGA and apply for data access. The data access request is routed to the Data and Biospecimen Access Committee (DBAC), who review it to determine whether the intended research use is inline with the H3Africa data and access policy, and ensure the requester is a bona fide researcher. Once the data request has been reviewed, the H3Africa DBAC will provide a decision to approve or reject it.
Types of data being accepted
The principal data types being collected for submission to the H3Africa Data Archive and the EGA include genomic sequence data, genotype array files, the associated phenotypes and metadata that is collected along with the samples, and results of any analysis conducted. Genomic sequence data mainly comprises of short DNA sequence reads in FASTQ format. The types of data and associated files for the H3Africa research projects are summarized in Table 1.
During participant recruitment, participants sign consent forms, which give the researcher the right to use the data for research purposes, and may or may not include consent for sharing and secondary use. Data is generated at the project sites or wherever the sequencing or genotyping equipment is located. The data then undergo validation and quality assurance by the project to clean it, which usually takes up to two months, though timelines vary depending on the sample size. At this point, the project’s designated data submitter makes contact with the H3Africa Data Archive Team (HDAT) to begin the process of submitting their data to the archive. The submission process details are described in the results section. Once the data is accepted into the data archive’s cold storage, it will be incubated for a period of nine months, giving the data owner or researcher time to analyze their data and prepare their publications (Figure 1). Thereafter, the data undergo final processing to ensure EGA format compliance, and then they are submitted to the EGA.
The data validation and transfer process from the data source to the H3Africa Archive can take some time, often from two to four months depending on the availability of storage space, the transfer mechanism used, the speed of internet connections, and availability of technical human resources. From the date the data gets accepted into EGA, there is a 12-month publication embargo period through the DBAC. This means that the project which owns the data has 12 months to write and publish their papers with no threat of impingement on their work. Some journals require accession IDs before papers can be published, so the data needs to be in EGA prior to paper submission. When the 12 months are over, any researcher can access the data in EGA through the DBAC without a publication embargo.
Submission engagement process
H3Africa projects collect data from their sites or providers and store it at their hub for analysis. Engagement early on with the H3Africa projects and identification of the data managers and individuals who will be submitting data to the H3Africa Data Archive is beneficial in building up key stakeholder relationships. This engagement enables one to gauge how far the projects are with their timelines and provide an estimation of when data will be submitted to the H3Africa Data Archive, enabling the adequate provisioning of resources. A remote meeting is arranged with the data submitters to determine what infrastructure and resources are available in terms of bandwidth, technical expertise, and experience in using data transfer tools. The data submitters are then provided with a Data Submission pack and are encouraged to register their submission on the H3Africa Archive Dashboard in order to keep track of the various submissions and their current status. The HDAT assists the data submitters in preparing and encrypting their data for submission through providing guidelines and a series of meetings (Figure 2).
Data submission request and files
The information collected that is commonly referred to as the Data Submission Request (DSR) includes organization, abstract, dataset name, description, estimated deadline for submission, data type, institutional reference ethics code, phenotype variables, file types, size, number of samples, cases, controls, and link to GitHub code used to generate the data and analysis. Two additional files needed are the blank copy of the Case Report Form (CRF) or Questionnaire and a blank copy of the consent form used to collect the data. Projects also sign the H3Africa Archive Statement Agreement, which gives H3ABioNet the right to validate and submit the data to the EGA.
The H3ABioNet Archive Team sends a submission pack to the data submitter after receiving the initial intent to submit data. The pack includes a copy of the DSR confirming the information from the project as well as the mapping files. These files vary depending on the kind of data, but they should all include a phenotype data mapping file. The projects should all be collecting phenotype data; if they are not, they specify this in the DSR. Phenotypes are mostly "sex," "ethnicity," and "country," but they are encouraged to include any other phenotype data collected in the Case Report Form (CRF).
As part of its role in H3Africa, H3ABioNet agreed to host the H3Africa Archive to implement the consortium’s Data Sharing, Access, and Release policy. This required the development of data submission policies, guideline documents, and data infrastructure that were both secure and scalable. Below we describe the results of this infrastructure development.
The H3Africa data archive infrastructure
The H3Africa Data Archive physical infrastructure comprises three main components: a Landing Area server, a Vault server, and Cold Storage. Initial scoping exercises were conducted to determine where the H3Africa Data Archive should be situated and to build a proof of concept. During the proof of concept (POC) stage, certain criteria were defined and online interviews conducted across H3ABioNet Consortium Nodes to assess their suitability to host the physical components of the H3Africa Data Archive. The Nodes are physically situated across various countries in Africa such as South Africa, Ghana, Nigeria, Morocco, Egypt, Senegal, Tunisia, Sudan, Kenya, Malawi, Uganda, and Mauritius. These interviews focused on the following criteria:
- Stable in-country electrical supply
- Access to uninterruptible power supply (UPS) equipment
- Access to electrical generator hardware
- Existing IT technical human resources
- Existing IT infrastructure such as networking equipment, and dedicated and secure datacenter room facilities
- Data backup infrastructures
- Ease of procurement
The Landing Area is where all the incoming and outgoing data is stored. All data on this server is always stored in encrypted format, with the public encryption key used for incoming data provided by the HDAT to the projects. The current storage capacity on the landing area is 50 terabytes and is located in the institution’s DMZ (demilitarized zone). The firewall rules in the DMZ are not as strict as those on internal firewalls. We encourage the use of GridFTP for transfers of large data sets to the archive. This protocol allows multiple connections to be open at once, masking TCP latency problems; in our experience this performs well in an African setting, and there are good, free services that provide this. Data transfer from the archive to the EGA uses a UDP-based service.
The Vault is a secure black-box server with tight access control policies in place. Data is validated in the Vault only, as it needs to be decrypted to do so. More details about validation are provided later. The HDAT works with the data submitter to fix any issues identified with the data during the validation phase. The server is also used for creating the XML schemas to submit the data to EGA. This is the only server where data is allowed to be decrypted. The Vault currently has access to 220 terabytes of direct access storage (DAS).
Archival (cold) storage
The archival storage is where data is stored for the nine-month incubation period. A copy of the data encrypted with a separate H3ABioNet public/private key pair is kept in archival storage, while a second copy is encrypted using the EGA public key and submitted to the EGA. All data in archival storage is replicated to off-site secure storage for redundancy purposes. The archival storage is expandable up to 500 terabytes.
EGA deposit box
The EGA has made server space available for the H3Africa Consortium to deposit their data, also known as an EGA deposit box. Every entity submitted is given an accession ID, which includes samples, data sets, analyses, runs, experiments, and others. Both the HDAT and EGA teams have access to the EGA data deposit box. For security purposes, all data submitted to the EGA data deposit box is encrypted using the EGA encryption key before being transferred.
The transfer of data to the EGA can take quite some time, depending on the data size and internet speed. Table 2 shows a typical example of how long a data set takes to reach the EGA.
Data submitted via the internet is ingested into the Vault from the Landing Area (Table 2, step 1). Likewise, data that is due to be submitted to the EGA is moved from the Vault to the Landing Area. Due to network design, the only method of moving data between the Vault and Landing Area is via the network. It is not currently possible to connect external USB storage to the Landing area. The slowest data transfer speeds are recorded when transferring data from a local server via the internal network to the EGA. This is largely due to the various firewalls and packet inspection tools implemented along the data transfer path. As expected, connecting a high speed USB storage device directly to the Vault server yields higher transfer rates. The data transfer rate shown in Table 2 (step 3) was conducted using a USB 3.0 enabled storage device. The Landing Area is located in the institutional DMZ, as data transfers to and from this server have been optimized for internet based data transfers which yields the fastest transfer speeds.
As is evident from Table 2, data transfers are a challenge when internet speeds are slow and resources are limited. The data archive’s preferred online data transfer mechanism is “Globus Online,” which uses GridFTP, while EGA uses Aspera. The Aspera and Globus Online (GO) data transfer applications are optimized to efficiently and securely transfer data between two points on a public or private network. Both applications have security and fault recovery built into the system, which sets it apart from traditional data transfer methods such as FTP (file transfer protocol). The fault recovery measures work by setting checkpoints as data is successfully delivered to its destination. In the event of a network failure, when the transfer is restarted, GO or Aspera will pick up from the last checkpoint, compared to FTP, which would require restarting the entire copy.
The primary aim of data encryption is to secure the genomic data. Encryption works by encoding data using a secret private key. The data is only accessible or readable when using the matching public key to decrypt it. Without the private key, the data is inaccessible, making it a suitable method of securing data whilst in transit across public networks such as the internet. Encryption to EGA has a separate public key which is built into the EGACryptor tool. A major challenge of encryption is that it takes longer to encrypt or decrypt a file compared to using a standard password to protect the file. It also requires additional storage space, up to three times the size of the raw data. This is not much of an issue when working with small data sets, but for larger data sets, such as genomic data, storage space for encryption becomes an important factor. To encrypt a terabyte of data can take approximately an hour and thirty minutes. This varies depending on the file type and amount of resources used on the server at a particular time.
The Archive Dashboard is a web application that was built to keep track of data submissions from the data submitters. The dashboard tracks all the progress of the submissions in an intuitive user friendly interface. A user is able to register, log in, and fill in a data submission request form. The HDAT will respond by assisting the data submitter with the submission pack and file formats. Funders or project managers can log in to the dashboard with different access rights to view the progress of data from first engagement to submission to the EGA.
Current status of the Archive
At the time of writing, a total of nine data sets have been submitted to the H3Africa Archive and the EGA. The total size of all the data sets submitted is 118.7 terabytes, with the average data set being 13.2 terabytes. There are currently two studies that have been submitted to the EGA. The AWI-Gen Pilot Study (accession number: EGAS00001002482) is accessible via the EGA, and the H3Africa Chip (accession ID: EGAS00001002976) is currently under embargo and expected to be accessible to the greater scientific community soon. More datasets are expected to be submitted to the H3Africa Data Archive in the near future.
Discussion and conclusions
In order to implement the H3Africa data sharing policies, we developed what to our knowledge is the first human genomic data archive in Africa. There were many challenges encountered in the development of the infrastructure, most notably in data transfers when moving data around the globe and specifically across the African continent. Common challenges include:
- Researchers or data owners not wanting to share their data
- Communication issues
- Technical issues, such as slow internet speeds or expensive bandwidth
- Available server compute resources, available storage space, or familiarity with data transfer technologies
- Data governance which restricts the movement of genomic data across borders
The H3Africa Archive, though developed to address internal needs of the consortium, provides a useful proof of concept for the possibility of establishing local EGA facilities. It was designed based on the EGA architecture, and it ensures data security and conversion into EGA formats. A similar infrastructure could be used for other genomic data, where an archive of files is required with built-in secure storage, off-site replication, and data transfer procedures. Our experience has demonstrated that significant long-term resources are required for such an infrastructure, including both human and computational. We also recognize the value in data sharing initiatives as researchers and funders move increasingly to an open science ethos. Researchers from the project sites can benefit tremendously from the data sharing capabilities of the internet. In addition to having access to international data sets, by submitting their data to public data archives such as the EGA, they expose their research to the greater scientific community, which in itself holds many benefits.
The authors have no competing interests to declare.
- Christensen, K.D.; Dukhovny, D.; Siebert, U. et al. (2015). "Assessing the Costs and Cost-Effectiveness of Genomic Sequencing". Journal of Personalized Medicine 5 (4): 470–86. doi:10.3390/jpm5040470. PMC PMC4695866. PMID 26690481. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4695866.
- Aronson, S.J.; Rehm, H.L. (2015). "Building the foundation for genomics in precision medicine". Nature 526 (7573): 336–42. doi:10.1038/nature15816. PMC PMC5669797. PMID 26469044. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC5669797.
- Goldfeder, R.L.; Wall, D.P.; Khoury, M.J. et al. (2017). "Human Genome Sequencing at the Population Scale: A Primer on High-Throughput DNA Sequencing and Analysis". American Journal of Epidemiology 186 (8): 1000–1009. doi:10.1093/aje/kww224. PMC PMC6250075. PMID 29040395. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC6250075.
- Prokop. J.W.; May, T.; Strong, K. et al. (2018). "Genome sequencing in the clinic: the past, present, and future of genomic medicine". Physiological Genomics 50 (8): 563–79. doi:10.1152/physiolgenomics.00046.2018. PMC PMC6139636. PMID 29727589. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC6139636.
- Popejoy, A.B.; Fullerton, S.M. (2016). "Genomics is failing on diversity". Nature 538 (7624): 161–64. doi:10.1038/538161a. PMC PMC5089703. PMID 27734877. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC5089703.
- H3Africa Consortium; Rotimi, C.; Abayomi, A. et al. (2014). "Research capacity. Enabling the genomic revolution in Africa". Science 344 (6190): 1346–8. doi:10.1126/science.1251546. PMC PMC4138491. PMID 24948725. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4138491.
- Mulder, N.; Abimiku, A.; Adebamowo, S.N. et al. (2018). "H3Africa: Current perspectives". Pharmacogenomics and Personalized Medicine 11: 59–86. doi:10.2147/PGPM.S141546. PMC PMC5903476. PMID 29692621. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC5903476.
- Mulder, N.J.; Adebiyi, E.; Alami, R. et al. (2016). "H3ABioNet, a sustainable pan-African bioinformatics network for human heredity and health in Africa". Genome Research 26 (2): 271–7. doi:10.1101/gr.196295.115. PMC PMC4728379. PMID 26627985. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4728379.
- "NIH Sharing Policies and Related Guidance on NIH-Funded Research Resources". Grants & Funding. National Institutes of Health. 2019. https://grants.nih.gov/policy/sharing.htm.
- "Data, software and materials management and sharing policy". Funding. The Wellcome Trust. 10 July 2017. https://wellcome.ac.uk/funding/guidance/data-software-materials-management-and-sharing-policy.
- Lappalainen, I.; Almeida-King, J.; Kumanduri, V. et al. (2015). "The European Genome-phenome Archive of human data consented for biomedical research". Nature Genetics 47 (7): 692–5. doi:10.1038/ng.3312. PMC PMC5426533. PMID 26111507. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC5426533.
- Mulder, N.J.; Adebiyi, E.; Adebiyi, M. et al. (2017). "Development of Bioinformatics Infrastructure for Genomics Research". Global Heart 12 (2): 91–98. doi:10.1016/j.gheart.2017.01.005. PMC PMC5582980. PMID 28302555. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC5582980.
- Shabani, M.; Dyke, S.O.; Joly, Y. et al. (2015). "Controlled Access under Review: Improving the Governance of Genomic Data Access". PLoS Biology 13 (12): e1002339. doi:10.1371/journal.pbio.1002339. PMC PMC4697814. PMID 26720729. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4697814.
- Dyke, S.O.M.; Linden, M.; Lappalainen, I. et al. (2018). "Registered access: Authorizing data access". European Journal of Human Genetics 26 (12): 1721-1731. doi:10.1038/s41431-018-0219-y. PMC PMC6244209. PMID 30069064. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC6244209.
- de Vries, J.; Tindana, P;. Littler, K. et al. (2015). "The H3Africa policy framework: negotiating fairness in genomics". Trends in Genetics 31 (3): 117-9. doi:10.1016/j.tig.2014.11.004. PMC PMC4471134. PMID 25601285. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4471134.
- Cock, P.J.; Fields, C.J.; Goto, N. et al. (2010). "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants". Nucleic Acids Research 38 (6): 1767–71. doi:10.1093/nar/gkp1137. PMC PMC2847217. PMID 20015970. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC2847217.
- Hsi-Yang Fritz, M.; Leinonen, R.; Cochrane, G. et al. (2011). "Efficient storage of high throughput DNA sequencing data using reference-based compression". Genome Research 21 (5): 734–40. doi:10.1101/gr.114819.110. PMC PMC3083090. PMID 21245279. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3083090.
- Ananthakrishnan, R.; Chard, K;. Foster, I. et al. (2015). "Globus Platform-as-a-Service for Collaborative Science Applications". Concurrency and computation: Practice and Experience 27 (2): 290–305. doi:10.1002/cpe.3262. PMC PMC4309390. PMID 25642152. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4309390.
- Foster, I. (2011). "Globus Online: Accelerating and Democratizing Science through Cloud-Based Services". IEEE Internet Computing 15 (3): 70–73. doi:10.1109/MIC.2011.64.
- Madduri, R.K.; Sulakhe, D.; Lacinski, L. et al. (2014). "Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services". Concurrency and Computing: Practice & Experience 26 (13): 2266-2279. doi:10.1002/cpe.3274. PMC PMC4203657. PMID 25342933. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4203657.
- Mahdi, M.S.R.; Aziz, M.M.A.; Alhadidi, D. et al. (2019). "Secure Similar Patients Query on Encrypted Genomic Data". IEEE Journal of Biomedical and Health Informatics 23 (6): 2611-2618. doi:10.1109/JBHI.2018.2881086. PMID 30442622.
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original paper listed references alphabetically; this wiki lists them by order of appearance, by design. The two footnotes were turned into inline references for convenience.