Journal:A polyglot approach to bioinformatics data integration: A phylogenetic analysis of HIV-1
|Full article title||A polyglot approach to bioinformatics data integration: A phylogenetic analysis of HIV-1|
|Author(s)||Reisman, S.; Hatzopoulos, T.; Läufer, K.; Thiruvathukal, G.K.; Putonti, C.|
|Author affiliation(s)||Loyola University Chicago|
|Primary contact||Email: firstname.lastname@example.org|
|Volume and issue||12|
|Distribution license||Creative Commons Attribution-NonCommercial 3.0 Unported|
As sequencing technologies continue to drop in price and increase in throughput, new challenges emerge for the management and accessibility of genomic sequence data. We have developed a pipeline for facilitating the storage, retrieval, and subsequent analysis of molecular data, integrating both sequence and metadata. Taking a polyglot approach involving multiple languages, libraries, and persistence mechanisms, sequence data can be aggregated from publicly available and local repositories. Data are exposed in the form of a RESTful web service, formatted for easy querying, and retrieved for downstream analyses. As a proof of concept, we have developed a resource for annotated HIV-1 sequences. Phylogenetic analyses were conducted for >6,000 HIV-1 sequences revealing spatial and temporal factors influence the evolution of the individual genes uniquely. Nevertheless, signatures of origin can be extrapolated even despite increased globalization. The approach developed here can easily be customized for any species of interest.
Keywords: polyglot programming, RESTful web service, phylogenetics
The increased throughput, coupled with reduced cost and time, of contemporary sequencing technologies has led to a surge in the number of publicly available, complete, annotated genomic sequences. For smaller viral species, it is now feasible to not only produce a single genome for a species but also capture the diversity present in an ecological niche, the focus of numerous metagenomic studies as well as more targeted investigations. Furthermore, next-generation sequencing technologies have tremendous potential for the future of diagnostics and subsequent treatment choices, particularly for viral infections. The sensitivity of deep sequencing can capture even rare variants in mixed infections as well as quasispecies. Investigation of the viable variations within a viral species not only provides insight into the evolutionary history of a species but also unveils putative avenues for targeted therapies, such as small interfering RNAs and control strategies.
Molecular biology is now plagued with the challenges facing numerous other fields – big data. Cloud-based solutions, e.g. CloudBurst, Atlas2, and Rainbow, have provided much needed leverage to meet these demands, facilitating large-scale sequence analyses, while also introducing new difficulties. Furthermore, noSQL databases afford a streamlined solution to both manage large datasets and simplify data retrieval and subsequent analysis. The added benefit of agility and scalability of noSQL databases is ideal for the rapidly advancing trends in DNA sequencing technologies, and it is thus not surprising that noSQL databases have been gaining traction in molecular studies.
With the increase in the amount of publicly available genomic sequence data, progress can be stymied by the simple task of collecting sequence data and associated, relevant metadata. In an effort to facilitate the aggregation and management of genomic data for subsequent analyses, we have developed a polyglot approach involving multiple languages (Python and Scala), libraries (Flask and BioJavaX), and persistence mechanisms (text files and MongoDB NoSQL databases). Individual genes or all genes for a given species can be examined beyond just the sequence itself, including information regarding, for instance, the location and date of isolation.
The code developed is agile; it can be applied for any organism of interest to the user. The approach can be customized for any species of interest. The presented solution is developed with downstream evolutionary analyses in mind, as shown by a proof-of-concept study of the evolution of HIV. Our investigation into the three main HIV genes: gag, pol, and env, reveals spatial and temporal factors influence the evolution of the individual genes uniquely. The web service for the HIV collection (as well as other datasets investigated by the authors) is publicly available at http://hivdb.cs.luc.edu, and the scripts for generating such a data collection are publicly available at https://github.com/LoyolaChicagoCode/hiv-biojava-scala.
Results and discussion
Data pipeline for collecting sequence data
Code has been developed to aggregate genomic sequence data and available sequence metadata for subsequent analyses. Figure 1 summarizes this process. All complete and partial genome sequences were parsed and separated into individual folders for each gene via a Scala parser utilizing the BioJavaX library. Each sequence was stored in its gene folder with any relevant metadata available within the GenBank file. The generated folders for each of the parsed genes were then pipelined through several python scripts in order to accomplish several post-processing tasks. First, duplicate gene sequences parsed from the same genome were removed. Second, the gene folders were used to create FASTA-formatted records for each of the gene sequences with any necessary metadata stored in the resulting record’s FASTA header. Finally, the PyMongo library was used to insert each of the final FASTA records within our publicly exposed MongoDB database.
Genomic sequence data can then be accessed via a RESTful web service located at http://hivdb.cs.luc.edu. This architecture allows our service to be easily and efficiently accessed by any future data consumers via common web protocols. Data can be queried based upon attributes regarding the source of the data. For example, as shown in Figure 2, the gag gene sequences from strains isolated in the USA can be accessed via the web service. The user can specify search criteria including a year (or range of years) of isolation, the location of isolation, and/or accession number. Sequences meeting the user’s search criteria can be returned to the web via the Query button or downloaded. All sequence results are in FASTA format for subsequent analysis, such as sequence alignment, primer development, and phylogenetic analysis.
The pipeline has been developed to facilitate users to create repositories for an organism(s) of interest as well as queryable aspects of the sequence annotations. Users need only supply sequences and select attributes and/or genes of interest (otherwise all attributes and genes will be selected). Data are automatically processed. Furthermore, the pipeline is not restricted to publicly available data; any GenBank-formatted file (public or private) can be included. Given the increased throughput of contemporary sequencing technologies and the decreased cost in sequencing runs, whole genome sequencing is being conducted at unprecedented rates. As such, researchers sequencing novel strains or isolates can incorporate their strains into the data repository once GenBank files are generated. Although this pipeline has been employed by the authors for the analysis of several different taxa, the RESTful web service presented here includes publicly available data for HIV-1 sequences.
Case study: Investigation of the evolution of HIV-1
All publicly available complete and near-complete HIV-1 genomic sequences, totaling more than 6,000 sequences, were retrieved from the National Center for Biotechnology Information (NCBI) and processed by our pipeline (see the "Methods and materials" section). Individual gene sequences are publicly available at http://hivdb.cs.luc.edu. Data are accessible in FASTA format to facilitate downstream analyses. To incorporate the metadata collected, including country and date of isolation, this information has been integrated into the FASTA record header. HIV-1 sequences were selected as a proof of concept for this tool as HIV sequences are among the most well curated, thanks in large part to efforts such as those at the Los Alamos National Laboratory’s HIV sequence database.
Previously, phylogenetics has shed light on the origin of HIV and played a key role in identifying recombination events. As previous molecular studies have shown, the evolutionary history of the HIV-1 lineage includes three groups (M, N, and O) representative of separate transfers from chimpanzees. Focusing on the three HIV genes gag, pol, and env, the hivdb data repository was queried for coding regions isolated from the same country as well as globally over a particular time period. Host, immunological and antiretroviral drug selection pressures have shaped much of the diversity observed within these three genes. For instance, the investigation of the HIV gag gene sequences from the USA (2,048 sequences: 1990–2011) and Thailand (872 sequences: 2000–2011) is shown in Figure 3A and B, respectively. The phylogenic trees derived for different geographic regions revealed different tree topologies as expected. Sequences isolated during 2005 in the USA exhibit significant sequence variation, including a number of sequences which are distinctly different from sequences isolated during any other year (Fig. 3A). These two gag trees reveal a general trend observed for other countries and other genes: sequences isolated during the same year do not necessarily group together or exhibit a ladder-like topology frequently observed for intra-host HIV phylogenies; this is in concordance with previous HIV survey findings that multiple lineages coexist at any given time.
In addition to looking at the viral diversity from isolates collected within the same country, we also investigated the variants present globally at a given time. Again sequences were retrieved for gag (725 sequences), pol (818 sequences), and env (427 sequences) coding regions. In this example, sequences were retrieved if annotated as being isolated between 2000 and 2005. As shown in Figure 4, there are three main groups within the tree, regardless of the gene being considered. Sequences isolated from Asia are typically found within the same clade, as are sequences from Africa. The third group includes sequences from Europe, North America, and South America. There are, of course, deviations from this trend; these deviations can be the result of multiple introductions, group, or the presence of more than one subtype or recombinant form in circulation, a factor that has been observed by other studies. Sequences isolated from the same geographic region within the same clade, however, suggest that signatures of origin can be extrapolated even despite increased globalization.
Genomic studies often must consider not only sequences but also the metadata surrounding those sequences. One barrier to such studies is a simple method to collect and organize sequences such that their metadata is also easily accessible. We have taken a polyglot approach to develop a tool which pipelines the process of collecting genomic data and organizing it as automated. As a proof of concept, our pipeline process has been applied to an evolutionary study of HIV-1. Phylogenetic analysis of the HIV genes gag, pol, and env finds both spatial and temporal factors uniquely influence the evolution of the individual genes; a finding that is in congruence with prior studies of the evolutionary history of the virus. More importantly, the case study highlights the abilities of the tool. Although utilized for the investigation of a virus here, the approach can be applied to any species of interest.
Methods and Materials
All generated FASTA-formatted files are stored within the document-based noSQL MongoDB. Since metadata from publicly accessible genome data are often not uniformly written, such a system allows each file to contain its own attributes with MongoDB’s key value documents. As a result, updated information can easily be added to any given FASTA file, without needing to change the structure of our database. By default, each document contains keys titled “sequence,” “country,” “accession,” “date,” “gene,” and “note” which map to their corresponding values. A RESTful web service has been created using the Flask Python microframework; this permits users to query stored documents via several parameters, including country of isolation, date, and accession number. Although the queries can be completed through standard HTTP GET and POST requests, a user interface has also been developed for accessing the data.
Extracting information from GenBank files
Scala parsers were developed to extract metadata from NCBI GenBank files. The parser utilizes each GenBank file’s CDS tags in order to retrieve information about each gene sequence. Then with the start and end nucleotide found in the tags, the gene sequence is taken from the genome within the GenBank file. Post-processing of the records was performed using Python scripts developed in-house; these scripts remove any duplicate records (an artifact of duplicate gene annotations within the GenBank file) as well as create FASTA-formatted sequence files. The PyMongo library was used to insert the data into the MongoDB. All scripts can be found online at https://github.com/LoyolaChicagoCode/hiv-biojava-scala.
HIV data collection
HIV-1 genomes were downloaded as GenBank files from the NCBI nucleotide database specifying the following: "Human immunodeficiency virus 1" (porgn:_txid11676) AND (8000:11000[Sequence Length]). This query collects all full-length and near full-length genomic sequences. Data were collected from the NCBI on February 26, 2013, obtaining 4,724 individual sequence records. Records missing country of isolation and/or collection date information, totaling 1,622 records, were manually curated via one of two sources. Records retrieved which are also available via Los Alamos National Laboratory (LANL)’s HIV database were referenced to ascertain whether the isolation/date information was available. In the event that these data were also missing from LANL, publications referenced in the GenBank file were evaluated. The database was updated at a later date (September 25, 2015), further exemplifying the ease of use for the proposed method of data aggregation and exposure in the form of a RESTful web service. This update expanded the sequence database to include an additional 1,342 sequences (6,066 total). Data are stored through figshare and can be retrieved via wget: http://files.figshare.com/2304758/hiv.tar.gz.
Sequences retrieved from the HIV database were examined following one of two strategies. First, sequences were aligned via ClustalW, and neighbor-joining trees with partial deletion (site coverage cutoff, 75%) were computed with the maximum composite likelihood model using the MEGA 5 software tool; trees were visualized using the tool Phylowidget and produced using the Adobe Illustrator. The trees shown in Figures 3 and 4 were created using this strategy. In parallel, a second strategy was employed: sequences were aligned using MUSCLE and maximum likelihood trees using the Jukes–Cantor and generalized time-reversible models that were generated via FastTree within the Geneious tool (Biomatters Ltd.). In deriving these trees, support values were computed. Trees were visualized using Geneious; Supplementary Files 1 and 2 contain the phylogenies (derived using the Jukes–Cantor model and with support values shown) for the same set of sequences as shown in Figure 3A and B, respectively. Newick format files can be found for all five trees (Figs. 3 and 4) derived using this second strategy with the Jukes–Cantor model in Supplementary File 3.
Supplementary files 1–3: Download (10M .zip)
The authors would like to thank Mr. Yousef Aleneze for his preliminary work on the project.
Academic editor: Jike Cui, Associate Editor
Peer review: Four peer reviewers contributed to the peer review report. Reviewers’ reports totaled 1,313 words, excluding any confidential comments to the academic editor.
Funding: SR is partially supported by the College of Arts and Sciences at Loyola University Chicago. GT and CP are partially supported by Loyola University Chicago’s Research Support Grant. The authors confirm that the funder had no influence over the study design, content of the article, or selection of this journal.
Competing interests: Authors disclose no potential conflicts of interest.
Paper subject to independent expert blind peer review. All editorial decisions made by independent academic editor. Upon submission manuscript was subject to anti-plagiarism scanning. Prior to publication all authors have given signed confirmation of agreement to article publication and compliance with all applicable ethical and legal requirements, including the accuracy of author and contributor information, disclosure of competing interests and funding sources, compliance with ethical requirements relating to human and animal study participants, and compliance with any copyright requirements of third parties. This journal is a member of the Committee on Publication Ethics (COPE).
Author contributions: Conceived and designed the experiments: GT, CP. Contributed to the development of the code: SR, TH, KL, GT. Analyzed the data: SR, TH, CP. Wrote the first draft of the manuscript: SR, CP. Contributed to the writing of the manuscript: TH, KL, GT. Agreed with the manuscript results and conclusions: SR, TH, KL, GT, CP. All authors reviewed and approved the final manuscript.
- Fierer, N.; Breitbart, M.; Nulton, J. et al. (2007). "Metagenomic and small-subunit rRNA analyses reveal the genetic diversity of bacteria, archaea, fungi, and viruses in soil". Applied and Environmental Microbiology 73 (21): 7059-66. doi:10.1128/AEM.00358-07. PMC PMC2074941. PMID 17827313. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC2074941.
- Holmfeldt, K.; Solonenko, N.; Shah, M. et al. (2013). "Twelve previously unknown phage genera are ubiquitous in global oceans". Proceedings of the National Academy of Sciences of the United States of America 110 (31): 12798-803. doi:10.1073/pnas.1305956110. PMC PMC3732932. PMID 23858439. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3732932.
- Hurwitz, B.L.; Sullivan, M.B. (2013). "The Pacific Ocean Virome (POV): A marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology". PLoS One 8 (2): e57355. doi:10.1371/journal.pone.0057355. PMC PMC3585363. PMID 23468974. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3585363.
- Deng, L.; Ignacio-Espinoza, J.C.; Gregory, A.C. (2014). "Viral tagging reveals discrete populations in Synechococcus viral genome sequence space". Nature 513 (7517): 242-5. doi:10.1038/nature13459. PMID 25043051.
- Barzon, L.; Lavezzo, E.; Militello, V. et al. (2011). "Applications of next-generation sequencing technologies to diagnostic virology". International Journal of Medical Sciences 12 (11): 7861-84. doi:10.3390/ijms12117861. PMC PMC3233444. PMID 22174638. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3233444.
- Wang, C.; Mitsuya, Y.; Gharizadeh, B. et al. (2007). "Characterization of mutation spectra with ultra-deep pyrosequencing: Application to HIV-1 drug resistance". Genome Research 17 (8): 1195-201. doi:10.1101/gr.6468307. PMC PMC1933516. PMID 17600086. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC1933516.
- Ramakrishnan, M.A.; Tu, Z.J.; Singh, S. et al. (2009). "The feasibility of using high resolution genome sequencing of influenza A viruses to detect mixed infections and quasispecies". PLoS One 4 (9): e7105. doi:10.1371/journal.pone.0007105. PMC PMC2740821. PMID 19771155. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC2740821.
- Solmone, M.; Vincenti, D.; Prosperi, M.C. et al. (2009). "Use of massively parallel ultradeep pyrosequencing to characterize the genetic diversity of hepatitis B virus in drug-resistant and drug-naive patients and to detect minor variants in reverse transcriptase and hepatitis B S antigen". Journal of Virology 83 (4): 1718-26. doi:10.1128/JVI.02011-08. PMC PMC2643754. PMID 19073746. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC2643754.
- Abdelrahman, T.; Hughes, J.; Main, J. et al. (2015). "Next-generation sequencing sheds light on the natural history of hepatitis C infection in patients who fail treatment". Hepatology 61 (1): 88–97. doi:10.1002/hep.27192. PMC PMC4303934. PMID 24797101. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4303934.
- Verheyen, J.; Litau, E.; Sing, T. et al. (2006). "Compensatory mutations at the HIV cleavage sites p7/p1 and p1/p6-gag in therapy-naive and therapy-experienced patients". Antiviral Therapy 11 (7): 879-87. PMID 17302250.
- Quiñones-Mateu, M.E.; Avila, S.; Reyes-Teran, G.; Martinez, M.A. (2015). "Deep sequencing: becoming a critical tool in clinical virology". Journal of Clinical Virology 61 (1): 9–19. doi:10.1016/j.jcv.2014.06.013. PMC PMC4119849. PMID 24998424. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4119849.
- Lares, M.R.; Rossi, J.J.; Ouellet, D.L. (2010). "RNAi and small interfering RNAs in human disease therapeutic applications". Trends in Biotechnology 28 (11): 570-9. doi:10.1016/j.tibtech.2010.07.009. PMC PMC2955826. PMID 20833440. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC2955826.
- Truong, N.P.; Gu, W.; Prasadam, I. et al. (2013). "An influenza virus-inspired polymer system for the timed release of siRNA". Nature Communications 4: 1902. doi:10.1038/ncomms2905. PMID 23695696.
- Paul, A.M.; Shi, Y.; Acharya, D. et al. (2014). "Delivery of antiviral small interfering RNA with gold nanoparticles inhibits dengue virus infection in vitro". Journal of General Virology 95 (Pt 8): 1712-22. doi:10.1099/vir.0.066084-0. PMC PMC4103068. PMID 24828333. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4103068.
- Jin, F.; Li, S.; Zheng, K. et al. (2014). "Silencing herpes simplex virus type 1 capsid protein encoding genes by siRNA: A promising antiviral therapeutic approach". PLoS One 9 (5): e96623. doi:10.1371/journal.pone.0096623. PMC PMC4008601. PMID 24794394. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4008601.
- Schatz, M.C. (2009). "CloudBurst: Highly sensitive read mapping with MapReduce". Bioinformatics 25 (11): 1363-9. doi:10.1093/bioinformatics/btp236. PMC PMC2682523. PMID 19357099. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC2682523.
- Evani, U.S.; Challis, D.; Yu, J. et al. (2012). "Atlas2 Cloud: A framework for personal genome analysis in the cloud". BMC Genomics 13 (Suppl 6): S19. doi:10.1186/1471-2164-13-S6-S19. PMC PMC3481437. PMID 23134663. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3481437.
- Zhao, S.; Prenger, K.; Smith, L. et al. (2013). "Rainbow: A tool for large-scale whole-genome sequencing data analysis using cloud computing". BMC Genomics 14: 425. doi:10.1186/1471-2164-14-425. PMC PMC3698007. PMID 23802613. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3698007.
- Schatz, M.C.; Langmead, B.; Salzberg, S.L. (2010). "Cloud computing and the DNA data race". Nature Biotechnology 28: 7. doi:10.1038/nbt0710-691. PMC PMC2904649. PMID 20622843. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC2904649.
- Borozan, I.; Wilson, S.; Blanchette, P. (2012). "CaPSID: A bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes". BMC Bioinformatics 13: 206. doi:10.1186/1471-2105-13-206. PMC PMC3464663. PMID 22901030. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3464663.
- Hird, S.M. (2012). "lociNGS: A lightweight alternative for assessing suitability of next-generation loci for evolutionary analysis". PLoS One 7 (10): e46847. doi:10.1371/journal.pone.0046847. PMC PMC3468592. PMID 23071651. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3468592.
- Ningthoujam, S.S.; Choudhury, M.D.; Potsangbam, K.S. et al. (2014). "NoSQL data model for semi-automatic integration of ethnomedicinal plant data from multiple sources". Phytochemical Analysis 25 (6): 495-507. doi:10.1002/pca.2520. PMID 24737485.
- Fielding, R.T. (2000). "Architectural Styles and the Design of Network-based Software Architectures". University of California, Irvine. https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm.
- Gao, F.; Bailes, E.; Robertson, D.L. et al. (1999). "Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes". Nature 397 (6718): 436-41. doi:10.1038/17130. PMID 9989410.
- Gao, F.; Yue, L.; Robertson, D.L. et al. (1994). "Genetic diversity of human immunodeficiency virus type 2: Evidence for distinct sequence subtypes with differences in virus biology". Journal of Virology 68 (11): 7433-47. PMC PMC237186. PMID 7933127. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC237186.
- Rambaut, A.; Posada, D.; Crandall, K.A.; Holmes, E.C. (2004). "The causes and consequences of HIV evolution". Nature Reviews Genetics 5 (1): 52–61. doi:10.1038/nrg1246. PMID 14708016.
- Brenner, B.; Wainberg, M.A.; Roger, M. (2013). "Phylogenetic inferences on HIV-1 transmission: Implications for the design of prevention and treatment interventions". AIDS 27 (7): 1045-57. doi:10.1097/QAD.0b013e32835cffd9. PMC PMC3786580. PMID 23902920. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3786580.
- Shankarappa, R.; Margolick, J.B.; Gange, S.J. et al. (1999). "Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection". Journal of Virology 73 (12): 10489-502. PMC PMC113104. PMID 10559367. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC113104.
- Castro-Nallar, E.; Pérez-Losada, M.; Burton, G.F.; Crandall, K.A. (2012). "The evolution of HIV: Inferences using phylogenetics". Molecular Phylogenetics and Evolution 62 (2): 777-92. doi:10.1016/j.ympev.2011.11.019. PMC PMC3258026. PMID 22138161. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3258026.
- Ahumada-Ruiz, S.; Flores-Figueroa, D.; Toala-González, I.; Thomson, M.M. (2009). "Analysis of HIV-1 pol sequences from Panama: Identification of phylogenetic clusters within subtype B and detection of antiretroviral drug resistance mutations". Infection, Genetics and Evolution 9 (5): 933-40. doi:10.1016/j.meegid.2009.06.013. PMID 19559103.
- Jung, M.; Leye, N.; Vidal, N. (2012). "The origin and evolutionary history of HIV-1 subtype C in Senegal". PLoS One 7 (3): e33579. doi:10.1371/journal.pone.0033579. PMC PMC3314668. PMID 22470456. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3314668.
- Abubakar, Y.F.; Meng, Z.; Zhang, X.; Xu, J. (2013). "Multiple independent introductions of HIV-1 CRF01_AE identified in China: What are the implications for prevention?". PLoS One 8 (11): e80487. doi:10.1371/journal.pone.0080487. PMC PMC3839914. PMID 24282546. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3839914.
- Tamura, K.; Peterson, D.; Peterson, N. et al. (2011). "MEGA5: Molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods". Molecular Biology and Evolution 28 (10): 2731-9. doi:10.1093/molbev/msr121. PMC PMC3203626. PMID 21546353. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3203626.
- Jordan, G.E.; Piel, W.H. (2008). "PhyloWidget: Web-based visualizations for the tree of life". Bioinformatics 24 (14): 1641-2. doi:10.1093/bioinformatics/btn235. PMID 18487241.
- Price, M.N.; Dehal, P.S.; Arkin, A.P. (2010). "FastTree 2: Approximately maximum-likelihood trees for large alignments". PLoS One 5 (3): e9490. doi:10.1371/journal.pone.0009490. PMC PMC2835736. PMID 20224823. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC2835736.
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.