Journal:Eleven quick tips for architecting biomedical informatics workflows with cloud computing

From LIMSWiki
Revision as of 19:51, 26 June 2018 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Eleven quick tips for architecting biomedical informatics workflows with cloud computing
Journal PLoS Computational Biology
Author(s) Cole, Brian S.; Moore, Jason H.
Author affiliation(s) University of Pennsylvania
Primary contact Email: colebr at upenn dot edu
Editors Ouellette, Francis
Year published 2018
Volume and issue 14(3)
Page(s) e1005994
DOI 10.1371/journal.pcbi.1005994
ISSN 1553-7358
Distribution license Creative Commons Attribution 4.0 International
Website http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005994
Download http://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005994&type=printable (PDF)

Abstract

Cloud computing has revolutionized the development and operations of hardware and software across diverse technological arenas, yet academic biomedical research has lagged behind despite the numerous and weighty advantages that cloud computing offers. Biomedical researchers who embrace cloud computing can reap rewards in cost reduction, decreased development and maintenance workload, increased reproducibility, ease of sharing data and software, enhanced security, horizontal and vertical scalability, high availability, a thriving technology partner ecosystem, and much more. Despite these advantages that cloud-based workflows offer, the majority of scientific software developed in academia does not utilize cloud computing and must be migrated to the cloud by the user. In this article, we present 11 quick tips for designing biomedical informatics workflows on compute clouds, distilling knowledge gained from experience developing, operating, maintaining, and distributing software and virtualized appliances on the world’s largest cloud. Researchers who follow these tips stand to benefit immediately by migrating their workflows to cloud computing and embracing the paradigm of abstraction.

Introduction

Cloud computing is the on-demand use of computational hardware, software, and networks provided by a third party.[1] The rise of the internet allowed companies to offer fully internet-based file storage services, including Amazon Web Services’ Simple Storage Service, which launched in 2006.[2] Throughout the past decade, cloud computing has expanded from simple file and object storage to a comprehensive array of on-demand services ranging from bare metal servers and networks to fully managed databases and clusters of computers capable of data processing at a massive scale.[3][4]

Modern cloud computing providers and the customers that utilize their services share responsibility for computer systems, with the cloud provider managing the physical hardware and virtualization software and the consumer utilizing the cloud services to architect workflows which may include applications, databases, systems and networks, storage, web servers, and much more.[5][6] In this way, cloud computing allows users to offload the burden of managing physical systems and focus on building and operating solutions.

Cloud computing has revolutionized the way businesses operate. By using a cloud provider instead of operating private data centers, companies can reduce costs by paying for only the hardware they use and only when they use it. In addition, cloud-based technological solutions offer many important advantages when compared to conventional enterprise data centers, including the ability to dynamically scale up under increased load, recover from disaster incidents automatically, remotely monitor application states, automate hardware and software deployments, and manage security through code. In addition, many cloud providers operate multiple data centers across continents, providing redundancy across different locations in the world to increase fault tolerance and reduce latency. Finally, cloud computing has evolved a new paradigm of microservice-centric application design, wherein the traditional monolithic software stack is replaced with loosely coupled components which can each be scaled individually, updated individually, and even replaced with fully managed cloud services such as message passing services, serverless function execution services, managed databases and data lakes, and even container management services. Businesses have exploited these advantages of cloud computing to gain an edge in a competitive landscape, ushering in a new era of computing that emphasizes abstraction, agility, and virtualization.

Scientific computing in academic research environments still mostly utilizes in-house enterprise compute systems such as high-performance computing (HPC) clusters.[7] In these systems, all software, hardware, data storage, networking, and security are the responsibility of the institution, including compliance with applicable state and federal laws such as HIPAA and other regulations which govern data storage for protected health information and human genetic data. The fact that scientific institutions manage their own separate compute systems poses serious problems for reproducibility due to differences in hardware and software across institutions.[8][9][10] Additionally, the HPC model fails to allow researchers to capitalize on the innovations offered by cloud computing. For these reasons, we have compiled a set of eleven quick tips to help biomedical researchers and their teams design solutions using cloud computing. We provide a high-level overview of some best practices for cloud computing with an emphasis on reproducibility, cost reduction, efficiency of development and operations, and ease of implementation.

1. Templatize infrastructure with version control

Cloud computing providers such as Microsoft Azure, Google Cloud Platform, Amazon Web Services, and others have developed templating systems that allow users to describe a set of cloud infrastructure components in a declarative manner. These templates can be used to create a virtualized compute system in the cloud using a language such as JSON or YAML, both of which are human-readable data formats.[11] Templates allow developers to manage infrastructure such as web servers, data storage, and fully configured networks and firewalls as code. These templates may be version-controlled and shared, allowing lateral transfer of full compute systems between academic institutions. Templatized infrastructure makes it is easy to reproduce the exact same system at any point in time, and this provides an important benefit to researchers who wish to implement generalizable solutions instead of simply sharing source code. Templates allow researchers to develop virtual applications that provide a control over hardware and networking that is difficult or impossible to achieve when researchers use their institutional HPC systems. Additionally, templates themselves are lightweight documents that are amenable to version control, providing additional utility. Finally, templates can be modified programmatically and without instantiating the computational stack they describe, allowing developers to modify and improve templates without invoking costs.

Version-control systems such as Git give developers immense control over software changes, including branching and forking mechanisms, which allow developers to safely implement new features and make modifications.[8] Additionally, repository hosting services such as GitHub allow researchers to share workflows and source code, aiding in reproducibility and lateral transfer of software.

In cloud computing, infrastructure of entire complex systems can be templatized. These templates can then be version-controlled, allowing researchers and developers to keep a record of prior versions of their software and providing a mechanism to roll back to an earlier version in the event of a failure such as an automated test failure. Version control therefore plays a vital role in designing workflows on cloud computing because it applies not only to the software, but also to templates that describe virtualized hardware and networks.

Academic scientists who work in isolated compute environments such as institutional HPC clusters might not employ version control at all, instead opting to develop and operate applications and workflows entirely within the cluster. This practice is undesirable in that it fails to keep a record of code changes, fails to provide a mechanism for distribution of source code to other researchers, and fails to provide a mechanism by which collections of code can be migrated to other systems. It is strongly encouraged that absolutely every piece of code and infrastructure template be version-controlled, and further, that version control becomes a first step in all bioinformatics workflow development. Cloud computing providers often offer fully managed services for version-control hosting, allowing researchers, teams, and even whole institutions to maintain private collections of repositories without the need to manage a version-control server or use a third-party version-control service like GitHub.

An example of a cloud-based virtual appliance which uses a version-controlled template to recreate infrastructure is EVE.[12] EVE is a cloud application that utilizes snapshots of software and reference data to perform reproducible annotation of human genetic variants. The appliance’s infrastructure is declared in a CloudFormation template which can be shared, modified offline, and used to instantiate an exact copy of the same hardware–software stack for annotation, a bioinformatics workflow which is difficult to reproduce across varying compute environments that are not controlled for software and reference data versions across space and time. EVE is an example of how templatized infrastructure, and imaged software and reference data allow cloud computing to enhance reproducibility of biomedical informatics workflows.

2. Embrace ephemerality: Image data and software

The on-demand nature of cloud computing has driven innovation in imaging technology as well as templating technology. In contrast to local data centers, cloud computing encourages users to expand computational capacity when needed, and users do not need to leave a server running all the time. Instead, users can instantiate the hardware they need only when they need it and shut it down afterwards, thus ending the operational expense. This ephemeral approach to computing has spurred development of imaging and snapshotting services.

An important element of cloud providers is their ability to take snapshots and images of data storage volumes which can be used to later recreate the internal state of a server. A user can install software and data onto a virtual server and then create an image of the block storage devices that server uses, including the operating system, file system, partitions, user accounts, and all data. The ability to image data and software provides tremendous utility to biomedical researchers who wish to develop reproducible workflows. External data sources upon which biomedical workflows depend may change over time; for example, databases of genetic polymorphisms are updated regularly, and genome assemblies are revised as more genotype data is accrued. Imaging the reference data that is used in a particular biomedical workflow is an excellent way to provide a snapshot in time which will not mutate, providing a reproducible workflow by controlling software and data. When combined with templatized infrastructure, snapshots and images can fully recreate the state of a virtual appliance without the requirement that the end user copies data or installs and configures any software whatsoever.

3. Use containers

Containers are software systems that provide the ability to wrap software and data in an isolated logical unit that can be deployed stably in a variety of computing environments.[13] Containers play an important role in the development of distributed systems by allowing tasks to be broken up into isolated units that can be scaled by increasing the number of containers running simultaneously. Additionally, containers can be leveraged for reproducible computational analysis.[14] Importantly, cloud providers often offer integration with containers such as Docker, allowing developers to manage and scale a containerized application across a cluster of servers.

A compelling example of containerized applications for biomedical informatics workflows is presented by Polański et al., who implement 14 useful bioinformatics workflows as isolated Docker images that are provided both directly and integrated into the CyVerse Discovery Environment[15], which is a National Science Foundation-funded cyberinfrastructure initiative formerly known as iPlant.[16] These images, shared on both GitHub and DockerHub, are useful not only within the CyVerse Discovery Environment but also via managed Docker services, including Amazon Web Services (AWS) Elastic Container Service, Microsoft Azure Container Service, Google Kubernetes Engine, and others.

4. Manage security and privacy as code

Cloud providers often operate under a shared responsibility model for security, in which the cloud providers are responsible for the physical security of the cloud platform and the users are responsible for the security of their applications, configurations, and networks.[17] While this imposes new responsibilities on users who otherwise would operate entirely within an institutional compute system such as an HPC, it also creates opportunities to take control of security as code. Much like servers and storage volumes, firewalls and account control in cloud computing are expressed as code, which may be version-controlled and updated continuously. Cloud computing and the infrastructure-as-code paradigm allow developers to configure and deploy firewalls, logical networks, and authentication/authorization mechanisms in a declarative manner. This allows developers to focus on security in the same way as hardware and software and pushes security into a central position in the process of development and operations of cloud applications. Cloud computing also allows automated security testing, an important component of agile software development.

In addition, privacy settings are also amenable to programmatic and automated management in cloud computing. Access to specific cloud resources is controlled by provider-specific mechanisms, including role-based account management and resource-specific access control. Users are encouraged to manage privacy by a principle of minimum privilege, complying with all applicable regulations. Cloud computing providers make it easy to control which users can access which resources, including sensitive datasets. In addition, access logs for cloud-based data storage and built-in encryption mechanisms offer fine-grained auditing capabilities for researchers to demonstrate compliance.

5. Use managed services instead of reinventing them

Cloud providers compete with each other to offer convenient and cost-saving managed services to perform common tasks without the user having to implement them.[18] These include message passing, email, notification services, monitoring and logging, authentication, managed databases and data lakes, cluster management tools such as for Apache Spark and Hadoop, and much more. Utilizing these services is not only cost-effective but also offloads the burden of development and maintenance. Additionally, these services are often implemented in a distributed and highly available manner, utilizing redundancy and cross-data center replication technology. All of this is provided and maintained by the cloud service provider, and effective utilization of managed services can yield tremendous gains for very little investment.

References

  1. Charlebois, K.; Palmour, N.; Knoppers, B.M. (2016). "The Adoption of Cloud Computing in the Field of Genomics Research: The Influence of Ethical and Legal Issues". PLoS One 11 (10): e0164347. doi:10.1371/journal.pone.0164347. PMC PMC5068798. PMID 27755563. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5068798. 
  2. Fusaro, V.A.; Patil, P.; Gafni, E. et al. (2011). "Biomedical cloud computing with Amazon Web Services". PLoS Computational Biology 7 (8): e1002147. doi:10.1371/journal.pcbi.1002147. PMC PMC3161908. PMID 21901085. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3161908. 
  3. Schadt, E.E.; Linderman, M.D.; Sorenson, J. et al. (2011). "Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology". Nature Reviews Genetics 12 (3): 224. doi:10.1038/nrg2857-c2. PMID 21301474. 
  4. Muth, T.; Peters, J.; Blackburn, J. et al. (2013). "ProteoCloud: a full-featured open source proteomics cloud computing pipeline". Journal of Proteomics 88: 104–8. doi:10.1016/j.jprot.2012.12.026. PMID 23305951. 
  5. Grossman, R.L.; White, K.P. (2012). "A vision for a biomedical cloud". Journal of Internal Medicine 271 (2): 122–30. doi:10.1111/j.1365-2796.2011.02491.x. PMID 22142244. 
  6. Stein, L.D.; Knoppers, B.M.; Campbell, P. (2015). "Data analysis: Create a cloud commons". Nature 523 (7559): 149–51. doi:10.1038/523149a. PMID 26156357. 
  7. Jackson, K.R.; Ramakrishnan, L.; Muriki, K. et al. (2010). "Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud". IEEE Second International Conference on Cloud Computing Technology and Science: 159-168. doi:10.1109/CloudCom.2010.69. 
  8. 8.0 8.1 Sandve, G.K.; Nekrutenko, A.; Taylor, J. et al. (2013). "Ten simple rules for reproducible computational research". PLoS Computational Biology 9 (10): e1003285. doi:10.1371/journal.pcbi.1003285. PMC PMC3812051. PMID 24204232. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3812051. 
  9. Begley, C.G.; Ioannidis, J.P. (2015). "Reproducibility in science: improving the standard for basic and preclinical research". Circulation Research 116 (1): 116–26. doi:10.1161/CIRCRESAHA.114.303819. PMID 25552691. 
  10. Peng, R.D. (2011). "Reproducible research in computational science". Science 334 (6060): 1226–7. doi:10.1126/science.1213847. PMC PMC3383002. PMID 22144613. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3383002. 
  11. Yamato, Y.; Muroi, M.; Tanaka, K. et al. (2014). "Development of template management technology for easy deployment of virtual resources on OpenStack". Journal of Cloud Computing 3: 7. doi:10.1186/s13677-014-0007-3. 
  12. Cole, B.S.; Moore, J.H. (2017). "EVE: Cloud-Based Annotation of Human Genetic Variants". In Squillero, G.; Sim, K.. Applications of Evolutionary Computation: EvoApplications 2017. Lecture Notes in Computer Science. 10199. Springer. doi:10.1007/978-3-319-55849-3_6. ISBN 9783319558493. 
  13. Boettiger, C. (2015). "An introduction to Docker for reproducible research". ACM SIGOPS Operating Systems Review 49 (1): 71–9. doi:10.1145/2723872.2723882. 
  14. Beaulieu-Jones, B.K.; Greene, C.S. (2017). "Reproducibility of computational workflows is automated using continuous analysis". Nature Biotechnology 35: 342–46. doi:10.1038/nbt.3780. 
  15. Polański, K.; Gao, B.; Mason, S.A. et al. (2018). "Bringing numerous methods for expression and promoter analysis to a public cloud computing service". Bioinformatics 34 (5): 884-886. doi:10.1093/bioinformatics/btx692. PMID 29126246. 
  16. Merchant, N.; Lyons, E.; Goff, S. et al. (2016). "The iPlant Collaborative: Cyberinfrastructure for Enabling Data to Discovery for the Life Sciences". PLoS Biology 14 (1): e1002342. doi:10.1371/journal.pbio.1002342. PMC PMC4709069. PMID 26752627. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4709069. 
  17. Sabahi, F. (2011). "Cloud computing security threats and responses". IEEE 3rd International Conference on Communication Software and Networks: 245-9. doi:10.1109/ICCSN.2011.6014715. 
  18. Grossman, R.L. (2009). "The Case for Cloud Computing". IT Professional 11 (2): 23-7. doi:10.1109/MITP.2009.40. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In a few cases important information was missing from the references, and that information was added. The original title uses "architecting" as a verb; we've kept it in the title to reference the original article, but references in in-line text have been changed to "designing." The original Beaulieu-Jones and Greene reference referred to the bioRxiv version; a published and peer-reviewed version was referenced for this version.