Difference between revisions of "Journal:Named data networking for genomics data management and integrated workflows"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
Line 45: Line 45:
Moreover, data management problems require the community to build and the scientists to spend time learning complex infrastructures (e.g., cloud platforms, grids) and creating tools, scripts, and workflows that can (semi-) automate their research. The current trend of moving from localized institutional storage and computing to an on-demand cloud computing model adds another layer of complexity to the workflows. The next generation of scientific breakthroughs may require massive data. Our ability to manage, distribute, and utilize these types of extreme-scale datasets and securely integrate them with computational platforms may dictate our success (or failure) in future scientific research.
Moreover, data management problems require the community to build and the scientists to spend time learning complex infrastructures (e.g., cloud platforms, grids) and creating tools, scripts, and workflows that can (semi-) automate their research. The current trend of moving from localized institutional storage and computing to an on-demand cloud computing model adds another layer of complexity to the workflows. The next generation of scientific breakthroughs may require massive data. Our ability to manage, distribute, and utilize these types of extreme-scale datasets and securely integrate them with computational platforms may dictate our success (or failure) in future scientific research.


Our experience in designing and deploying protocols for "big data" science  
Our experience in designing and deploying protocols for "big data" science<ref name="ShannigrahiNDN18" /><ref name="OlschanowskySupport14">{{cite journal |title=Supporting climate research using named data networking |journal=Proceedings of the IEEE 20th International Workshop on Local & Metropolitan Area Networks |author=Olschanowsky, C.; Shannigrahi, S. Papadopolous, C. et al. |pages=1–6 |year=2014 |doi=10.1109/LANMAN.2014.7028640}}</ref><ref name="OFanManaging15">{{cite journal |title=Managing scientific data with named data networking |journal=Proceedings of the Fifth International Workshop on Network-Aware Data Management |author=Fan, C.l Shannigrahi, S.; DiBenedetto, S. et al. |pages=1–7 |year=2015 |doi=10.1145/2832099.2832100}}</ref><ref name="ShannigrahiNamed15">{{cite journal |title=Named Data Networking in Climate Research and HEP Applications |journal=Journal of Physics: Conference Series |author=Shannigrahi, S.; Papadopolous, C.; Yeh, E. et al. |volume=664 |at=052033 |year=2015 |doi=10.1088/1742-6596/664/5/052033}}</ref><ref name="ShannigrahiRequest17">{{cite journal |title=Request aggregation, caching, and forwarding strategies for improving large climate data distribution with NDN: a case study |journal=Proceedings of the 4th ACM Conference on Information-Centric Networking |author=Shannigrahi, S.; Fan, C.; Papadopolous, C. |pages=54–65 |year=2017 |doi=10.1145/3125719.3125722}}</ref> <ref name="ShannigrahiNamed18">{{cite journal |title=Named Data Networking Strategies for Improving Large Scientific Data Transfers |journal=Proceedings of the 2018 IEEE International Conference on Communications Workshops |author=Shannigrahi, S.; Fan, C.; Papadopolous, C. et al. |year=2018 |doi=10.1109/ICCW.2018.8403576}}</ref> suggests that using hierarchical and community-developed names for storing, discovering, and accessing data can dramatically simplify [[scientific data management system]]s (SDMSs), and that the network is the ideal place for integrating domain [[workflow]]s with distributed services. In this work, we propose a named ecosystem over an evolving but well-researched future internet architecture: named data networking (NDN). NDN utilizes content names for all data management operations such as content addressing, content discovery, and retrieval. Utilizing content names for all network operations massively simplifies data management infrastructure. Users simply ask for the content by name (e.g., “/ncbi/homo/sapiens/hg38”) and the network delivers the content to the user.


<ref name="OlschanowskySupport14">{{cite journal |title=Supporting climate research using named data networking |journal=Proceedings of the IEEE 20th International Workshop on Local & Metropolitan Area Networks |author=Olschanowsky, C.; Shannigrahi, S. Papadopolous, C. et al. |pages=1–6 |year=2014 |doi=10.1109/LANMAN.2014.7028640}}</ref>
Using content names that are understood by the end-user over an NDN network provides multiple advantages: natural caching of popular content near the users, unified access mechanisms, and location-agnostic publication of data and services. For example, a dataset properly named can be downloaded by, for example, NCBI or GeneLab at NASA, whichever is closer to the researcher. Additionally, the derived data (results, annotations, publications) are easily publishable into the network (possibly after vetting and quality control by NCBI or NASA) and immediately discoverable if appropriate naming conventions are agreed upon and followed. Finally, NDN shifts the trust to content itself; each piece of content is [[Encryption|cryptographically]] signed by the data producer and verifiable by anyone for provenance.


<ref name="OFanManaging15">{{cite journal |title=Managing scientific data with named data networking |journal=Proceedings of the Fifth International Workshop on Network-Aware Data Management |author=Fan, C.l Shannigrahi, S.; DiBenedetto, S. et al. |pages=1–7 |year=2015 |doi=10.1145/2832099.2832100}}</ref>
In this work, we first introduce NDN and the architectural constructs that make it attractive for the genomics community. We then discuss the data management and cyberinfrastructure challenges faced by the genomics community and how NDN can help alleviate them. We then present our pilot study applying NDN to a contemporary genomics workflow GEMmaker<ref name="HadishSystems20">{{cite web |url=https://zenodo.org/record/3620945 |title=SystemsGenetics/GEMmaker: Release v1.1 |work=Zenodo |author=Hadish, J.; Biggs, T.; Shealy, B. et al. |date=22 January 2020 |doi=10.5281/zenodo.3620945}}</ref> and evaluate the integration. Finally, we discuss future research directions and an integration roadmap with cloud computing services.


Shannigrahi et al., 2015;
==Named data networking==
NDN<ref name="ZhangNamed14">{{cite journal |title=Named data networking |journal=ACM SIGCOMM Computer Communication Review |author=Zhang, L.; Afanasyev, A.; Burke, J. et al. |volume=44 |issue=3 |pages=66-73 |year=2014 |doi=10.1145/2656877.2656887}}</ref> is a new networking paradigm that adopts a drastically different communication model than that current IP model. In NDN, data is accessed by content names (e.g., “/Human/DNA/Genome/hg38”) rather than through the host where it resides (e.g., ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz). Naming the data allows the network to participate in operations that were not feasible before. Specifically, the network can take part in discovering and local caching of the data, merging similar requests, retrieval from multiple distributed data sources, and more. In NDN, the communication primitive is straightforward (Figures 1A): the consumer asks for the content by content name (an “Interest” in NDN terminology), and the network forwards the request toward the publisher.


<ref name="ShannigrahiRequest17">{{cite journal |title=Request aggregation, caching, and forwarding strategies for improving large climate data distribution with NDN: a case study |journal=Proceedings of the 4th ACM Conference on Information-Centric Networking |author=Shannigrahi, S.; Fan, C.; Papadopolous, C. |pages=54–65 |year=2017 |doi=10.1145/3125719.3125722}}</ref>


Shannigrahi et al., 2018a;
[[File:Fig1 Ogle FrontBigData2021 4.jpg|726px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="726px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<blockquote>'''Figure 1.''' NDN Forwarding. The two servers on the right announce a namespace (/google) for the data they serve. The routers make a note of this incoming announcement. When the laptops ask for /google/index.html, the routers forward the requests on the appropriate interfaces (31, 32, or both, depending on configuration). Data follows the reverse path.<ref name="ZhangNamed14" /></blockquote>
|-
|}
|}


<ref name="ShannigrahiNDN18" />


suggests that using hierarchical and community-developed names for storing, discovering, and accessing data can dramatically simplify scientific data management systems, and that the network is the ideal place for integrating domain workflows with distributed services. In this work, we propose a named ecosystem over an evolving but well-researched future internet architecture: named data networking (NDN). NDN utilizes content names for all data management operations such as content addressing, content discovery, and retrieval. Utilizing content names for all network operations massively simplifies data management infrastructure. Users simply ask for the content by name (e.g., “/ncbi/homo/sapiens/hg38”) and the network delivers the content to the user.





Revision as of 02:03, 19 March 2021

Full article title Named data networking for genomics data management and integrated workflows
Journal Frontiers in Big Data
Author(s) Ogle, Cameron; Reddick, David; McKnight, Coleman; Biggs, Tyler; Pauly, Rini; Ficklin, Stephen P.; Feltus, F. Alex; Shannigrahi, Susmit
Author affiliation(s) Clemson University, Tennessee Tech University, Washington State University
Primary contact Email: sshannigrahi at tntech dot edu
Year published 2021
Volume and issue 4
Article # 582468
DOI 10.3389/fdata.2021.582468
ISSN 2624-909X
Distribution license Creative Commons Attribution 4.0 International
Website https://www.frontiersin.org/articles/10.3389/fdata.2021.582468/full
Download https://www.frontiersin.org/articles/10.3389/fdata.2021.582468/pdf (PDF)

Abstract

Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high-resolution biological data. The community is rapidly heading toward the petascale in single-investigator laboratory settings. As evidence, the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) central DNA sequence repository alone contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous, as they are not only large in size but also stored in various geographically distributed repositories such as those hosted by the NCBI, as well as in the DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA’s GeneLab.

In this work, we first systematically point out the data management challenges of the genomics community. We then introduce named data networking (NDN), a novel but well-researched internet architecture capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths), all while eliminating the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Using name-based operations also streamlines deployment and integration of workflows with various cloud platforms.

We make four signigicant contributions with this wor. First, we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate. Second, we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. Third, as a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in the "Method" section) to publish data from broadly used data repositories, including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP).

The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN’s properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN; we are working on extending and evaluating our pilot deployment and will present systematic results in a future work.

Keywords: genomics data, genomics workflows, large science data, cloud computing, named data networking

Introduction

Scientific communities are entering a new era of exploration and discovery in many fields, driven by high-density data accumulation. A few examples are climate science[1], high-energy particle physics (HEP)[2], astrophysics[3][4], genomics[5], seismology[6], and biomedical research[7], just to name a few. Often referred to as “data-intensive” science, these communities utilize and generate extremely large volumes of data, often reaching into the petabytes[8] and soon projected to reach into the exabytes.

Data-intensive science has created radically new opportunities. Take for example high-throughput DNA sequencing (HTDS). Until recently, HTDS was slow and expensive, and only a few institutes were capable of performing it at scale.[9] With the advances in supercomputers, specialized DNA sequencers, and better bioinformatics algorithms, the effectiveness and cost of sequencing has dropped considerably and continues to drop. For example, sequencing the first reference human genome cost around $2.7 billion over 15 years, while currently it costs under $1,000 to resequence a human genome.[10] With commercial incentives, several companies are offering fragmented genome re-sequencing under $100, performed in only a few days. This massive drop in cost and improvement in speed supports more advanced scientific discovery. For example, earlier scientists could only test their hypothesis on a small number of genomes or gene expression conditions within or between species. With more publicly available datasets[5], scientists can test their hypotheses against a larger number of genomes, potentially enabling them to identify rare mutations, precisely classify diseases based on a specific patient, and, thusly, more accurately treat the disease.[11]

While the growth of DNA sequencing is encouraging, it has also created difficulty in genomics data management. For example, the National Center for Biotechnology Information’s (NCBI) Sequence Read Archive (SRA) database hosts 42 petabytes of publicly accessible DNA sequence data.[12] Scientists desiring to use public data must discover (or locate) the data and move it from globally distributed sites to on-premize clusters and distributed computing platforms, including public and commercial clouds. Public repositories such as the NCBI SRA contain a subset of all available genomics data.[13] Similar repositories are hosted by NASA, the National Institutes of Health (NIH), and other organizations. Even though these datasets are highly curated, each public repository uses their own standards for data naming, retrieval, and discovery that makes locating and utilizing these datasets difficult.

Moreover, data management problems require the community to build and the scientists to spend time learning complex infrastructures (e.g., cloud platforms, grids) and creating tools, scripts, and workflows that can (semi-) automate their research. The current trend of moving from localized institutional storage and computing to an on-demand cloud computing model adds another layer of complexity to the workflows. The next generation of scientific breakthroughs may require massive data. Our ability to manage, distribute, and utilize these types of extreme-scale datasets and securely integrate them with computational platforms may dictate our success (or failure) in future scientific research.

Our experience in designing and deploying protocols for "big data" science[8][14][15][16][17] [18] suggests that using hierarchical and community-developed names for storing, discovering, and accessing data can dramatically simplify scientific data management systems (SDMSs), and that the network is the ideal place for integrating domain workflows with distributed services. In this work, we propose a named ecosystem over an evolving but well-researched future internet architecture: named data networking (NDN). NDN utilizes content names for all data management operations such as content addressing, content discovery, and retrieval. Utilizing content names for all network operations massively simplifies data management infrastructure. Users simply ask for the content by name (e.g., “/ncbi/homo/sapiens/hg38”) and the network delivers the content to the user.

Using content names that are understood by the end-user over an NDN network provides multiple advantages: natural caching of popular content near the users, unified access mechanisms, and location-agnostic publication of data and services. For example, a dataset properly named can be downloaded by, for example, NCBI or GeneLab at NASA, whichever is closer to the researcher. Additionally, the derived data (results, annotations, publications) are easily publishable into the network (possibly after vetting and quality control by NCBI or NASA) and immediately discoverable if appropriate naming conventions are agreed upon and followed. Finally, NDN shifts the trust to content itself; each piece of content is cryptographically signed by the data producer and verifiable by anyone for provenance.

In this work, we first introduce NDN and the architectural constructs that make it attractive for the genomics community. We then discuss the data management and cyberinfrastructure challenges faced by the genomics community and how NDN can help alleviate them. We then present our pilot study applying NDN to a contemporary genomics workflow GEMmaker[19] and evaluate the integration. Finally, we discuss future research directions and an integration roadmap with cloud computing services.

Named data networking

NDN[20] is a new networking paradigm that adopts a drastically different communication model than that current IP model. In NDN, data is accessed by content names (e.g., “/Human/DNA/Genome/hg38”) rather than through the host where it resides (e.g., ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz). Naming the data allows the network to participate in operations that were not feasible before. Specifically, the network can take part in discovering and local caching of the data, merging similar requests, retrieval from multiple distributed data sources, and more. In NDN, the communication primitive is straightforward (Figures 1A): the consumer asks for the content by content name (an “Interest” in NDN terminology), and the network forwards the request toward the publisher.


Fig1 Ogle FrontBigData2021 4.jpg

Figure 1. NDN Forwarding. The two servers on the right announce a namespace (/google) for the data they serve. The routers make a note of this incoming announcement. When the laptops ask for /google/index.html, the routers forward the requests on the appropriate interfaces (31, 32, or both, depending on configuration). Data follows the reverse path.[20]




References

  1. Cinquini, L.; Chrichton, D.; Mattmann, C. et al. (2014). "The Earth System Grid Federation: An open infrastructure for access to distributed geospatial data". Future Generation Computer Systems 36: 400–17. doi:10.1016/j.future.2013.07.002. 
  2. ATLAS Collaboration; Aad, G.; Abat, E. et al. (2008). "The ATLAS Experiment at the CERN Large Hadron Collider". Journal of Instrumentation 3: S08003. doi:10.1088/1748-0221/3/08/S08003. 
  3. Dewdney, P.E.; Hall, P.J.; Schillizzi, R.T. et al. (2009). "The Square Kilometre Array". Proceedings of the IEEE 97 (8): 1482-1496. doi:10.1109/JPROC.2009.2021005. 
  4. LSST Dark Energy Science Collaboration; Abate, A.; Aldering, G. et al. (2012). "Large Synoptic Survey Telescope Dark Energy Science Collaboration". arXiv: 1–133. https://arxiv.org/abs/1211.0310v1. 
  5. 5.0 5.1 Sayers, E.W.; Beck, J.; Brister, J.R. et al. (2020). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Research (D1): D9–D16. doi:10.1093/nar/gkz899. PMC PMC6943063. PMID 31602479. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6943063. 
  6. Tsuchiya, S.; Sakamoto, Y.; Tsuchimoto, Y. et al. (2012). "Big Data Processing in Cloud Environments" (PDF). Fujitsu Scientific & Technical Journal 48 (2): 159–68. https://www.fujitsu.com/downloads/MAG/vol48-2/paper09.pdf. 
  7. Luo, J.; Wu, M.; Gopukumar, D. et al. (2016). "Big Data Application in Biomedical Research and Health Care: A Literature Review". Biomedical Informatics Insights 8: 1–10. doi:10.4137/BII.S31559. PMC PMC4720168. PMID 26843812. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720168. 
  8. 8.0 8.1 Shannigrahi, S.; Fan, C.; Papadopoulos, C. et al. (2018). "NDN-SCI for managing large scale genomics data". Proceedings of the 5th ACM Conference on Information-Centric Networking: 204–05. doi:10.1145/3267955.3269022. 
  9. McCombie, W.R.; McPherson, J.D.; Mardis, E.R. (2016). "Next-Generation Sequencing Technologies". Cold Spring Harbor Perspectives in Medicine 9 (11): 1–10. doi:10.4137/BII.S31559. PMC PMC4720168. PMID 26843812. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720168. 
  10. National Human Genome Research Institute (2020). "The Cost of Sequencing a Human Genome". National Institutes of Health. https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost. Retrieved 16 March 2020. 
  11. Lowy-Gallego, E.; Fairley, S.; Zheng-Bradley, X. et al. (2019). "Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project". Wellcome Open Research 4: 50. doi:10.12688/wellcomeopenres.15126.2. PMC PMC7059836. PMID 32175479. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7059836. 
  12. NCBI (2020). "Sequence Read Archive". NCBI. https://trace.ncbi.nlm.nih.gov/Traces/sra/. Retrieved 04 February 2020. 
  13. Stephens, Z.D.; Lee, S.Y.; Faghri, F. et al. (2015). "Big Data: Astronomical or Genomical?". PLoS Biology 13 (7): e1002195. doi:10.1371/journal.pbio.1002195. PMC PMC4494865. PMID 26151137. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4494865. 
  14. Olschanowsky, C.; Shannigrahi, S. Papadopolous, C. et al. (2014). "Supporting climate research using named data networking". Proceedings of the IEEE 20th International Workshop on Local & Metropolitan Area Networks: 1–6. doi:10.1109/LANMAN.2014.7028640. 
  15. Fan, C.l Shannigrahi, S.; DiBenedetto, S. et al. (2015). "Managing scientific data with named data networking". Proceedings of the Fifth International Workshop on Network-Aware Data Management: 1–7. doi:10.1145/2832099.2832100. 
  16. Shannigrahi, S.; Papadopolous, C.; Yeh, E. et al. (2015). "Named Data Networking in Climate Research and HEP Applications". Journal of Physics: Conference Series 664: 052033. doi:10.1088/1742-6596/664/5/052033. 
  17. Shannigrahi, S.; Fan, C.; Papadopolous, C. (2017). "Request aggregation, caching, and forwarding strategies for improving large climate data distribution with NDN: a case study". Proceedings of the 4th ACM Conference on Information-Centric Networking: 54–65. doi:10.1145/3125719.3125722. 
  18. Shannigrahi, S.; Fan, C.; Papadopolous, C. et al. (2018). "Named Data Networking Strategies for Improving Large Scientific Data Transfers". Proceedings of the 2018 IEEE International Conference on Communications Workshops. doi:10.1109/ICCW.2018.8403576. 
  19. Hadish, J.; Biggs, T.; Shealy, B. et al. (22 January 2020). "SystemsGenetics/GEMmaker: Release v1.1". Zenodo. doi:10.5281/zenodo.3620945. https://zenodo.org/record/3620945. 
  20. 20.0 20.1 Zhang, L.; Afanasyev, A.; Burke, J. et al. (2014). "Named data networking". ACM SIGCOMM Computer Communication Review 44 (3): 66-73. doi:10.1145/2656877.2656887. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original paper listed references alphabetically; this wiki lists them by order of appearance, by design. The two footnotes were turned into inline references for convenience.