Difference between revisions of "Journal:Eleven quick tips for architecting biomedical informatics workflows with cloud computing"

Full article title	Eleven quick tips for architecting biomedical informatics workflows with cloud computing
Journal	PLoS Computational Biology
Author(s)	Cole, Brian S.; Moore, Jason H.
Author affiliation(s)	University of Pennsylvania
Primary contact	Email: colebr at upenn dot edu
Editors	Ouellette, Francis
Year published	2018
Volume and issue	14(3)
Page(s)	e1005994
DOI	10.1371/journal.pcbi.1005994
ISSN	1553-7358
Distribution license	Creative Commons Attribution 4.0 International
Website	http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005994
Download	http://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005994&type=printable (PDF)

Revision as of 19:20, 26 June 2018

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Cloud computing has revolutionized the development and operations of hardware and software across diverse technological arenas, yet academic biomedical research has lagged behind despite the numerous and weighty advantages that cloud computing offers. Biomedical researchers who embrace cloud computing can reap rewards in cost reduction, decreased development and maintenance workload, increased reproducibility, ease of sharing data and software, enhanced security, horizontal and vertical scalability, high availability, a thriving technology partner ecosystem, and much more. Despite these advantages that cloud-based workflows offer, the majority of scientific software developed in academia does not utilize cloud computing and must be migrated to the cloud by the user. In this article, we present 11 quick tips for designing biomedical informatics workflows on compute clouds, distilling knowledge gained from experience developing, operating, maintaining, and distributing software and virtualized appliances on the world’s largest cloud. Researchers who follow these tips stand to benefit immediately by migrating their workflows to cloud computing and embracing the paradigm of abstraction.

Introduction

Cloud computing is the on-demand use of computational hardware, software, and networks provided by a third party.^[1] The rise of the internet allowed companies to offer fully internet-based file storage services, including Amazon Web Services’ Simple Storage Service, which launched in 2006.^[2] Throughout the past decade, cloud computing has expanded from simple file and object storage to a comprehensive array of on-demand services ranging from bare metal servers and networks to fully managed databases and clusters of computers capable of data processing at a massive scale.^[3]^[4]

Modern cloud computing providers and the customers that utilize their services share responsibility for computer systems, with the cloud provider managing the physical hardware and virtualization software and the consumer utilizing the cloud services to architect workflows which may include applications, databases, systems and networks, storage, web servers, and much more.^[5]^[6] In this way, cloud computing allows users to offload the burden of managing physical systems and focus on building and operating solutions.

Cloud computing has revolutionized the way businesses operate. By using a cloud provider instead of operating private data centers, companies can reduce costs by paying for only the hardware they use and only when they use it. In addition, cloud-based technological solutions offer many important advantages when compared to conventional enterprise data centers, including the ability to dynamically scale up under increased load, recover from disaster incidents automatically, remotely monitor application states, automate hardware and software deployments, and manage security through code. In addition, many cloud providers operate multiple data centers across continents, providing redundancy across different locations in the world to increase fault tolerance and reduce latency. Finally, cloud computing has evolved a new paradigm of microservice-centric application design, wherein the traditional monolithic software stack is replaced with loosely coupled components which can each be scaled individually, updated individually, and even replaced with fully managed cloud services such as message passing services, serverless function execution services, managed databases and data lakes, and even container management services. Businesses have exploited these advantages of cloud computing to gain an edge in a competitive landscape, ushering in a new era of computing that emphasizes abstraction, agility, and virtualization.

Scientific computing in academic research environments still mostly utilizes in-house enterprise compute systems such as high-performance computing (HPC) clusters.^[7] In these systems, all software, hardware, data storage, networking, and security are the responsibility of the institution, including compliance with applicable state and federal laws such as HIPAA and other regulations which govern data storage for protected health information and human genetic data. The fact that scientific institutions manage their own separate compute systems poses serious problems for reproducibility due to differences in hardware and software across institutions.^[8]^[9]^[10] Additionally, the HPC model fails to allow researchers to capitalize on the innovations offered by cloud computing. For these reasons, we have compiled a set of eleven quick tips to help biomedical researchers and their teams design solutions using cloud computing. We provide a high-level overview of some best practices for cloud computing with an emphasis on reproducibility, cost reduction, efficiency of development and operations, and ease of implementation.

1. Templatize infrastructure with version control

Cloud computing providers such as Microsoft Azure, Google Cloud Platform, Amazon Web Services, and others have developed templating systems that allow users to describe a set of cloud infrastructure components in a declarative manner. These templates can be used to create a virtualized compute system in the cloud using a language such as JSON or YAML, both of which are human-readable data formats.^[11] Templates allow developers to manage infrastructure such as web servers, data storage, and fully configured networks and firewalls as code. These templates may be version-controlled and shared, allowing lateral transfer of full compute systems between academic institutions. Templatized infrastructure makes it is easy to reproduce the exact same system at any point in time, and this provides an important benefit to researchers who wish to implement generalizable solutions instead of simply sharing source code. Templates allow researchers to develop virtual applications that provide a control over hardware and networking that is difficult or impossible to achieve when researchers use their institutional HPC systems. Additionally, templates themselves are lightweight documents that are amenable to version control, providing additional utility. Finally, templates can be modified programmatically and without instantiating the computational stack they describe, allowing developers to modify and improve templates without invoking costs.

Version-control systems such as Git give developers immense control over software changes, including branching and forking mechanisms, which allow developers to safely implement new features and make modifications.^[8] Additionally, repository hosting services such as GitHub allow researchers to share workflows and source code, aiding in reproducibility and lateral transfer of software.

In cloud computing, infrastructure of entire complex systems can be templatized. These templates can then be version-controlled, allowing researchers and developers to keep a record of prior versions of their software and providing a mechanism to roll back to an earlier version in the event of a failure such as an automated test failure. Version control therefore plays a vital role in designing workflows on cloud computing because it applies not only to the software, but also to templates that describe virtualized hardware and networks.

Academic scientists who work in isolated compute environments such as institutional HPC clusters might not employ version control at all, instead opting to develop and operate applications and workflows entirely within the cluster. This practice is undesirable in that it fails to keep a record of code changes, fails to provide a mechanism for distribution of source code to other researchers, and fails to provide a mechanism by which collections of code can be migrated to other systems. It is strongly encouraged that absolutely every piece of code and infrastructure template be version-controlled, and further, that version control becomes a first step in all bioinformatics workflow development. Cloud computing providers often offer fully managed services for version-control hosting, allowing researchers, teams, and even whole institutions to maintain private collections of repositories without the need to manage a version-control server or use a third-party version-control service like GitHub.

An example of a cloud-based virtual appliance which uses a version-controlled template to recreate infrastructure is EVE.^[12] EVE is a cloud application that utilizes snapshots of software and reference data to perform reproducible annotation of human genetic variants. The appliance’s infrastructure is declared in a CloudFormation template which can be shared, modified offline, and used to instantiate an exact copy of the same hardware–software stack for annotation, a bioinformatics workflow which is difficult to reproduce across varying compute environments that are not controlled for software and reference data versions across space and time. EVE is an example of how templatized infrastructure, and imaged software and reference data allow cloud computing to enhance reproducibility of biomedical informatics workflows.

References

↑ Charlebois, K.; Palmour, N.; Knoppers, B.M. (2016). "The Adoption of Cloud Computing in the Field of Genomics Research: The Influence of Ethical and Legal Issues". PLoS One 11 (10): e0164347. doi:10.1371/journal.pone.0164347. PMC PMC5068798. PMID 27755563. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5068798.
↑ Fusaro, V.A.; Patil, P.; Gafni, E. et al. (2011). "Biomedical cloud computing with Amazon Web Services". PLoS Computational Biology 7 (8): e1002147. doi:10.1371/journal.pcbi.1002147. PMC PMC3161908. PMID 21901085. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3161908.
↑ Schadt, E.E.; Linderman, M.D.; Sorenson, J. et al. (2011). "Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology". Nature Reviews Genetics 12 (3): 224. doi:10.1038/nrg2857-c2. PMID 21301474.
↑ Muth, T.; Peters, J.; Blackburn, J. et al. (2013). "ProteoCloud: a full-featured open source proteomics cloud computing pipeline". Journal of Proteomics 88: 104–8. doi:10.1016/j.jprot.2012.12.026. PMID 23305951.
↑ Grossman, R.L.; White, K.P. (2012). "A vision for a biomedical cloud". Journal of Internal Medicine 271 (2): 122–30. doi:10.1111/j.1365-2796.2011.02491.x. PMID 22142244.
↑ Stein, L.D.; Knoppers, B.M.; Campbell, P. (2015). "Data analysis: Create a cloud commons". Nature 523 (7559): 149–51. doi:10.1038/523149a. PMID 26156357.
↑ Jackson, K.R.; Ramakrishnan, L.; Muriki, K. et al. (2010). "Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud". IEEE Second International Conference on Cloud Computing Technology and Science: 159-168. doi:10.1109/CloudCom.2010.69.
↑ ^8.0 ^8.1 Sandve, G.K.; Nekrutenko, A.; Taylor, J. et al. (2013). "Ten simple rules for reproducible computational research". PLoS Computational Biology 9 (10): e1003285. doi:10.1371/journal.pcbi.1003285. PMC PMC3812051. PMID 24204232. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3812051.
↑ Begley, C.G.; Ioannidis, J.P. (2015). "Reproducibility in science: improving the standard for basic and preclinical research". Circulation Research 116 (1): 116–26. doi:10.1161/CIRCRESAHA.114.303819. PMID 25552691.
↑ Peng, R.D. (2011). "Reproducible research in computational science". Science 334 (6060): 1226–7. doi:10.1126/science.1213847. PMC PMC3383002. PMID 22144613. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3383002.
↑ Yamato, Y.; Muroi, M.; Tanaka, K. et al. (2014). "Development of template management technology for easy deployment of virtual resources on OpenStack". Journal of Cloud Computing 3: 7. doi:10.1186/s13677-014-0007-3.
↑ Cole, B.S.; Moore, J.H. (2017). "EVE: Cloud-Based Annotation of Human Genetic Variants". In Squillero, G.; Sim, K.. Applications of Evolutionary Computation: EvoApplications 2017. Lecture Notes in Computer Science. 10199. Springer. doi:10.1007/978-3-319-55849-3_6. ISBN 9783319558493.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. Grammar has been updated for clarity. In some cases important information was missing from the references, and that information was added. The original title uses "architecting" as a verb; we've kept it in the title to reference the original article, but references in in-line text have been changed to "designing."

[CharleboisTheAdopt16-1] Charlebois, K.; Palmour, N.; Knoppers, B.M. (2016). "The Adoption of Cloud Computing in the Field of Genomics Research: The Influence of Ethical and Legal Issues". PLoS One 11 (10): e0164347. doi:10.1371/journal.pone.0164347. PMC PMC5068798. PMID 27755563. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5068798.

[FusaroBiomedical11-2] Fusaro, V.A.; Patil, P.; Gafni, E. et al. (2011). "Biomedical cloud computing with Amazon Web Services". PLoS Computational Biology 7 (8): e1002147. doi:10.1371/journal.pcbi.1002147. PMC PMC3161908. PMID 21901085. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3161908.

[SchadtCloud11-3] Schadt, E.E.; Linderman, M.D.; Sorenson, J. et al. (2011). "Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology". Nature Reviews Genetics 12 (3): 224. doi:10.1038/nrg2857-c2. PMID 21301474.

[MuthProteo13-4] Muth, T.; Peters, J.; Blackburn, J. et al. (2013). "ProteoCloud: a full-featured open source proteomics cloud computing pipeline". Journal of Proteomics 88: 104–8. doi:10.1016/j.jprot.2012.12.026. PMID 23305951.

[GrossmanAVision12-5] Grossman, R.L.; White, K.P. (2012). "A vision for a biomedical cloud". Journal of Internal Medicine 271 (2): 122–30. doi:10.1111/j.1365-2796.2011.02491.x. PMID 22142244.

[SteinData15-6] Stein, L.D.; Knoppers, B.M.; Campbell, P. (2015). "Data analysis: Create a cloud commons". Nature 523 (7559): 149–51. doi:10.1038/523149a. PMID 26156357.

[JacksonPerform10-7] Jackson, K.R.; Ramakrishnan, L.; Muriki, K. et al. (2010). "Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud". IEEE Second International Conference on Cloud Computing Technology and Science: 159-168. doi:10.1109/CloudCom.2010.69.

[SandveTen13-8] 8.0 ^8.1 Sandve, G.K.; Nekrutenko, A.; Taylor, J. et al. (2013). "Ten simple rules for reproducible computational research". PLoS Computational Biology 9 (10): e1003285. doi:10.1371/journal.pcbi.1003285. PMC PMC3812051. PMID 24204232. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3812051.

[BegleyRepro15-9] Begley, C.G.; Ioannidis, J.P. (2015). "Reproducibility in science: improving the standard for basic and preclinical research". Circulation Research 116 (1): 116–26. doi:10.1161/CIRCRESAHA.114.303819. PMID 25552691.

[PengRepro11-10] Peng, R.D. (2011). "Reproducible research in computational science". Science 334 (6060): 1226–7. doi:10.1126/science.1213847. PMC PMC3383002. PMID 22144613. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3383002.

[YamatoDevel14-11] Yamato, Y.; Muroi, M.; Tanaka, K. et al. (2014). "Development of template management technology for easy deployment of virtual resources on OpenStack". Journal of Cloud Computing 3: 7. doi:10.1186/s13677-014-0007-3.

[ColeEVE17-12] Cole, B.S.; Moore, J.H. (2017). "EVE: Cloud-Based Annotation of Human Genetic Variants". In Squillero, G.; Sim, K.. Applications of Evolutionary Computation: EvoApplications 2017. Lecture Notes in Computer Science. 10199. Springer. doi:10.1007/978-3-319-55849-3_6. ISBN 9783319558493.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

@@ Line 36: / Line 36: @@
 Scientific computing in academic research environments still mostly utilizes in-house enterprise compute systems such as high-performance computing (HPC) clusters.<ref name="JacksonPerform10">{{cite journal |title=Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud |journal=IEEE Second International Conference on Cloud Computing Technology and Science |author=Jackson, K.R.; Ramakrishnan, L.; Muriki, K. et al. |page=159-168 |year=2010 |doi=10.1109/CloudCom.2010.69}}</ref> In these systems, all software, hardware, data storage, networking, and security are the responsibility of the institution, including compliance with applicable state and federal laws such as [[HIPAA]] and other regulations which govern data storage for protected health [[information]] and human genetic data. The fact that scientific institutions manage their own separate compute systems poses serious problems for reproducibility due to differences in hardware and software across institutions.<ref name="SandveTen13">{{cite journal |title=Ten simple rules for reproducible computational research |journal=PLoS Computational Biology |author=Sandve, G.K.; Nekrutenko, A.; Taylor, J. et al. |volume=9 |issue=10 |page=e1003285 |year=2013 |doi=10.1371/journal.pcbi.1003285 |pmid=24204232 |pmc=PMC3812051}}</ref><ref name="BegleyRepro15">{{cite journal |title=Reproducibility in science: improving the standard for basic and preclinical research |journal=Circulation Research |author=Begley, C.G.; Ioannidis, J.P. |volume=116 |issue=1 |page=116–26 |year=2015 |doi=10.1161/CIRCRESAHA.114.303819 |pmid=25552691}}</ref><ref name="PengRepro11">{{cite journal |title=Reproducible research in computational science |journal=Science |author=Peng, R.D. |volume=334 |issue=6060 |page=1226–7 |year=2011 |doi=10.1126/science.1213847 |pmid=22144613 |pmc=PMC3383002}}</ref> Additionally, the HPC model fails to allow researchers to capitalize on the innovations offered by cloud computing. For these reasons, we have compiled a set of eleven quick tips to help biomedical researchers and their teams design solutions using cloud computing. We provide a high-level overview of some best practices for cloud computing with an emphasis on reproducibility, cost reduction, efficiency of development and operations, and ease of implementation.
+==1. Templatize infrastructure with version control==
+Cloud computing providers such as Microsoft Azure, Google Cloud Platform, Amazon Web Services, and others have developed templating systems that allow users to describe a set of cloud infrastructure components in a declarative manner. These templates can be used to create a virtualized compute system in the cloud using a language such as JSON or YAML, both of which are human-readable data formats.<ref name="YamatoDevel14">{{cite journal |title=Development of template management technology for easy deployment of virtual resources on OpenStack |journal=Journal of Cloud Computing |author=Yamato, Y.; Muroi, M.; Tanaka, K. et al. |volume=3 |page=7 |year=2014 |doi=10.1186/s13677-014-0007-3}}</ref> Templates allow developers to manage infrastructure such as web servers, data storage, and fully configured networks and firewalls as code. These templates may be version-controlled and shared, allowing lateral transfer of full compute systems between academic institutions. Templatized infrastructure makes it is easy to reproduce the exact same system at any point in time, and this provides an important benefit to researchers who wish to implement generalizable solutions instead of simply sharing source code. Templates allow researchers to develop virtual applications that provide a control over hardware and networking that is difficult or impossible to achieve when researchers use their institutional HPC systems. Additionally, templates themselves are lightweight documents that are amenable to version control, providing additional utility. Finally, templates can be modified programmatically and without instantiating the computational stack they describe, allowing developers to modify and improve templates without invoking costs.
+Version-control systems such as Git give developers immense control over software changes, including branching and forking mechanisms, which allow developers to safely implement new features and make modifications.<ref name="SandveTen13" /> Additionally, repository hosting services such as GitHub allow researchers to share workflows and source code, aiding in reproducibility and lateral transfer of software.
+In cloud computing, infrastructure of entire complex systems can be templatized. These templates can then be version-controlled, allowing researchers and developers to keep a record of prior versions of their software and providing a mechanism to roll back to an earlier version in the event of a failure such as an automated test failure. Version control therefore plays a vital role in designing workflows on cloud computing because it applies not only to the software, but also to templates that describe virtualized hardware and networks.
+Academic scientists who work in isolated compute environments such as institutional HPC clusters might not employ version control at all, instead opting to develop and operate applications and workflows entirely within the cluster. This practice is undesirable in that it fails to keep a record of code changes, fails to provide a mechanism for distribution of source code to other researchers, and fails to provide a mechanism by which collections of code can be migrated to other systems. It is strongly encouraged that absolutely every piece of code and infrastructure template be version-controlled, and further, that version control becomes a first step in all [[bioinformatics]] workflow development. Cloud computing providers often offer fully managed services for version-control hosting, allowing researchers, teams, and even whole institutions to maintain private collections of repositories without the need to manage a version-control server or use a third-party version-control service like GitHub.
+An example of a cloud-based virtual appliance which uses a version-controlled template to recreate infrastructure is EVE.<ref name="ColeEVE17">{{cite book |chapter=EVE: Cloud-Based Annotation of Human Genetic Variants |title=Applications of Evolutionary Computation: EvoApplications 2017 |author=Cole, B.S.; Moore, J.H. |editor=Squillero, G.; Sim, K. |publisher=Springer |series=Lecture Notes in Computer Science |volume=10199 |year=2017 |isbn=9783319558493 |doi=10.1007/978-3-319-55849-3_6}}</ref> EVE is a cloud application that utilizes snapshots of software and reference data to perform reproducible annotation of human genetic variants. The appliance’s infrastructure is declared in a CloudFormation template which can be shared, modified offline, and used to instantiate an exact copy of the same hardware–software stack for annotation, a bioinformatics workflow which is difficult to reproduce across varying compute environments that are not controlled for software and reference data versions across space and time. EVE is an example of how templatized infrastructure, and imaged software and reference data allow cloud computing to enhance reproducibility of [[Health informatics|biomedical informatics]] workflows.
 ==References==

Difference between revisions of "Journal:Eleven quick tips for architecting biomedical informatics workflows with cloud computing"

Revision as of 19:20, 26 June 2018

Contents

Abstract

Introduction

1. Templatize infrastructure with version control

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export