https://www.limswiki.org/index.php?title=Journal:MaPSeq,_a_service-oriented_architecture_for_genomics_research_within_an_academic_biomedical_research_institution&feed=atom&action=historyJournal:MaPSeq, a service-oriented architecture for genomics research within an academic biomedical research institution - Revision history2024-03-28T10:09:51ZRevision history for this page on the wikiMediaWiki 1.36.1https://www.limswiki.org/index.php?title=Journal:MaPSeq,_a_service-oriented_architecture_for_genomics_research_within_an_academic_biomedical_research_institution&diff=24421&oldid=prevShawndouglas: Added rest of content.2016-03-03T21:41:25Z<p>Added rest of content.</p>
<a href="https://www.limswiki.org/index.php?title=Journal:MaPSeq,_a_service-oriented_architecture_for_genomics_research_within_an_academic_biomedical_research_institution&diff=24421&oldid=24416">Show changes</a>Shawndouglashttps://www.limswiki.org/index.php?title=Journal:MaPSeq,_a_service-oriented_architecture_for_genomics_research_within_an_academic_biomedical_research_institution&diff=24416&oldid=prevShawndouglas: Created stub. Adding more.2016-03-03T20:23:21Z<p>Created stub. Adding more.</p>
<p><b>New page</b></p><div>{{Infobox journal article<br />
|name = <br />
|image = <br />
|alt = <!-- Alternative text for images --><br />
|caption = <br />
|title_full = MaPSeq, a service-oriented architecture for genomics research within an academic biomedical research institution<br />
|journal = ''Informatics''<br />
|authors = Reilly, J.; Ahalt, S.; McGee, J.; Owen, P.; Schmitt, C.; Wilhelmsen, K.<br />
|affiliations = University of North Carolina at Chapel Hill<br />
|contact = Phone: +1 919-445-9619 (Wilhelmsen)<br />
|editors = Bryant, A.<br />
|pub_year = 2015<br />
|vol_iss = '''2''' (3)<br />
|pages = 20–30<br />
|doi = [http://doi.org/10.3390/informatics2030020 10.3390/informatics2030020]<br />
|issn = 2227-9709<br />
|license = [http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]<br />
|website = [http://www.mdpi.com/2227-9709/2/3/20/htm http://www.mdpi.com/2227-9709/2/3/20/htm]<br />
|download = [hhttp://www.mdpi.com/2227-9709/2/3/20/pdf http://www.mdpi.com/2227-9709/2/3/20/pdf] (PDF)<br />
}}<br />
<br />
==Abstract==<br />
[[Genomics]] research presents technical, computational, and analytical challenges that are well recognized. Less recognized are the complex sociological, psychological, cultural, and political challenges that arise when genomics research takes place within a large, decentralized academic institution. In this paper, we describe a Service-Oriented Architecture (SOA) — MaPSeq — that was conceptualized and designed to meet the diverse and evolving computational workflow needs of genomics researchers at our large, [[hospital]]-affiliated, academic research institution. We present the institutional challenges that motivated the design of MaPSeq before describing the architecture and functionality of MaPSeq. We then discuss SOA solutions and conclude that approaches such as MaPSeq enable efficient and effective computational workflow execution for genomics research and for any type of academic biomedical research that requires complex, computationally-intense workflows. <br />
<br />
'''Keywords''': service-oriented architecture; genomics; massively parallel sequencing; computational workflow; academic biomedical research; decentralized organization; distributed decision-making<br />
<br />
==Introduction==<br />
Genomics research presents well-recognized technical, computational, and analytical challenges.<ref name="KoboldtChal10">{{cite journal |title=Challenges of sequencing human genomes |journal=Briefings in Bioinformatics |author=Koboldt, D.C.; Ding, L.; Mardis, E.R.; Wilson, R.K. |volume=11 |issue=5 |pages=484-498 |year=2010 |doi=10.1093/bib/bbq016 |pmid=20519329 |pmc=PMC2980933}}</ref><ref name="KahnOn11">{{cite journal |title=On the future of genomic data |journal=Science |author=Kahn, S.D. |volume=331 |issue=6018 |pages=728-729 |year=2011 |doi=10.1126/science.1197891 |pmid=21311016}}</ref><ref name="GreenClin14">{{cite book |chapter=Chapter 9: Clinical genome sequencing |title=Genomic and Personalized Medicine |author=Green, R.C.; Rehm, H.L.; Kohane, I.S. |editor=Willard, H.F.; Ginsburg, G.S. |publisher=Academic Press |location=Oxford, UK |edition=2nd |pages=102–122 |year=2014 |isbn=9780123822277}}</ref><ref name="DeweyClin14">{{cite journal |title=Clinical interpretation and implications of whole-genome sequencing |journal=JAMA |author=Dewey, F.E.; Grove, M.E.; Pan, C. et al. |volume=311 |issue=10 |pages=1035-1045 |year=2014 |doi=10.1001/jama.2014.1717 |pmid=24618965 |pmc=PMC4119063}}</ref> For example, while the technology for massively parallel genomic [[sequencing]] has progressed to the point where large amounts of data can be generated at a rapid pace and for a reasonable cost, the analytical burden presented by this massive amount of data can quickly overwhelm the genomic analyst. Indeed, the analysis and interpretation of genetic findings is generally considered the rate-limiting step in the translation of genomic sequencing data into clinical practice and patient care.<ref name="DeweyClin14" /><br />
<br />
Less recognized challenges to research in genomics and any biomedical field are the sociological, psychological, cultural, and political barriers, many of which arise from the organizational structure within which the research takes place. Indeed, research organizations tend to fall somewhere on a continuum between completely centralized and completely decentralized.<ref name="OrlikowskiTech01">{{cite journal |title=Technology and institutions: What can research on information technology and research on organizations learn from each other? |journal=MIS Quarterly |author=Orlikowski, W.J.; Barley, S.R. |volume=25 |issue=2 |pages=145-165 |year=2001 |doi=10.2307/3250927 |pmid=20519329 |pmc=PMC2980933}}</ref><ref name="HeidenCent07">{{cite web |url=http://www.clomedia.com/articles/centralization_versus_decentralization_a_closer_look_at_how_to_blend_both |title=Centralization Versus Decentralization: A Closer Look at How to Blend Both |author=Heiden, S. |work=Chief Learning Officer |publisher=MediaTec Publishing, Inc |date=10 December 2007 |accessdate=16 April 2015}}</ref><ref name="JainToCent13">{{cite web |url=http://www.forbes.com/sites/piyankajain/2013/02/15/to-centralize-analytics-or-not/ |title=To Centralize Analytics or Not, That is the Question |author=Jain, P. |work=Forbes |publisher=Forbes.com, LLC |date=15 February 2013 |accessdate=16 April 2015}}</ref><ref name="IngramCent15">{{cite web |url=http://smallbusiness.chron.com/centralized-vs-decentralized-organizational-design-11476.html |title=Centralized Vs. Decentralized Organizational Design |author=Ingram, D. |work=Houston Chronicle |publisher=Hearst Newspapers, LLC |date=2015 |accessdate=13 July 2015}}</ref> Each of these extremes has advantages and disadvantages. Centralized organizations traditionally function within a simple organizational design, with singular decision-making, top-level operational control, a consolidated budget, strong/clear communication channels, uniform culture and politics, and a high degree of efficiency, but at the expense of flexibility. Decentralized organizations, in contrast, generally operate within a complex organizational design, with distributed decision-making, local operational control, regionalized budgets, numerous weak or broken communication channels, inconsistent (and sometimes conflicting) culture and politics, and a high degree of flexibility, but at the expense of efficiency. The conceptualization, design, development, and implementation of information technology (IT) solutions for research in genomics and any biomedical field must therefore involve careful consideration of not only the needs of the user base, but also the organizational structure within which the research takes place.<br />
<br />
Herein, we present a Service-Oriented Architecture (SOA) application — termed MaPSeq — that was conceptualized and designed to address the organizational challenges of computation-intensive biomedical research within a decentralized academic institution. In this article, we first describe the challenges that contributed to the conceptualization and design of MaPSeq. We then provide an overview of the technical architecture and capabilities of MaPSeq. Finally, we provide a discussion of service-oriented solutions such as MaPSeq.<br />
<br />
==Challenges driving the conceptualization and SOA design of MaPSeq==<br />
The design of MaPSeq was motivated by challenges that arose during the implementation of a genomic sequencing project titled “North Carolina Clinical Genomic Evaluation by NextGen Exome Sequencing” (NCGENES). This project, which is funded by the National Human Genome Resource Institute, aims to conduct whole exome sequencing of 500 patient samples drawn from multiple disease categories. NCGENES is a complex project, with both research and clinical arms. Soon after the project was initiated, the research and clinical teams realized that there were numerous barriers and roadblocks that needed to be overcome in order to achieve the analytical goals of the project. (See Table 1 for overview.)<br />
<br />
{| <br />
| STYLE="vertical-align:top;"|<br />
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="60%"<br />
|-<br />
| style="background-color:white; padding-left:10px; padding-right:10px;" colspan="4"|'''Table 1.''' An overview of the challenges that contributed to the architectural design of MaPSeq<br />
|-<br />
! style="padding-left:10px; padding-right:10px;"|Challenge<br />
! style="padding-left:10px; padding-right:10px;"|Description<br />
! style="padding-left:10px; padding-right:10px;"|MaPSeq SOA Solution<br />
! style="padding-left:10px; padding-right:10px;"|Benefits<br />
|-<br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Challenge 1<br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Diverse and evolving computational workflow needs; expanding complexity of workflows<br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Different services designed to address different needs<br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Flexibility; scalability; extensibility<br />
|- <br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Challenge 2<br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Silos of distributed, uncoordinated compute resources; network idiosyncrasies<br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Opportunistic use of distributed compute resources without need for a cloud-based software stack<br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Interoperability; extensibility; generalizability<br />
|- <br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Challenge 3<br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Political and cultural resistance to change; human roadblocks in the automation of workflow pipelines<br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Reusable automated attributes to gradually replace human workflow processes<br />
| style="background-color:white; padding-left:10px; padding-right:10px;"|Achievability; accessibility; functionality<br />
|- <br />
|}<br />
|}<br />
<br />
===Challenge 1===<br />
Academic institutions face the challenge of balancing the needs of large, funded, research projects that typically support the development of an [[Informatics (academic field)|informatics]] infrastructure with the needs of smaller, often unfunded, research projects that cannot afford significant development costs. Furthermore, few research projects are sufficiently funded to support future development needs. Our institution faced these challenges when trying to balance the needs of the NCGENES investigative team with those of other investigative teams and anticipate future needs. The scale, general applicability, and complexity of massively parallel sequencing favored the development of an SOA approach to support both current and future needs related to genomic and non-genomic computationally-intense serial workflows.<br />
<br />
===Challenge 2===<br />
As is typical for an academic institution, our genomics infrastructure developed in an ''ad hoc'' manner, with multiple investigative teams working independently across the university campus. The result was a burgeoning, uncoordinated cluster of distributed compute resources. Compounding this challenge were the numerous network idiosyncrasies that prevented administrators within one network from accessing compute resources within a different network; thus, access privileges to campus compute resources were determined locally and required on-site (rather than remote) access.<br />
<br />
===Challenge 3===<br />
Decision-making at large academic institutions tends to be decentralized, with numerous decision makers enforcing different (and often conflicting) policies and procedures. This organizational structure inevitably leads to political and cultural conflicts and resistance to change, particularly when “external” IT teams attempt to change the processes in place among “central” investigative teams. Political and cultural resistance to the NCGENES project was encountered early on as the investigative team identified many barriers to the automation of human user-controlled workflow processes. While the existing human user-run workflows met the needs of small genomic sequencing projects and user groups, these workflows were inefficient for the computationally-demanding, whole-exome sequencing needs of NCGENES. Moreover, the use of a human contact as the point of access to an existing workflow created a roadblock to the execution of NCGENES, reduced the efficiency of genomic analysis, and threatened the security of sensitive patient data.<br />
<br />
==Existing solutions==<br />
Numerous Workflow Management Systems and workflow pipelines for genomic analysis exist, including COSMOS<ref name="GafniCOS14">{{cite journal |title=COSMOS: Python library for massively parallel workflows |journal=Bioinformatics |author=Gafni, E.; Luquette, L.J.; Lancasster, A.K. et al. |volume=30 |issue=20 |pages=2956-2958 |year=2014 |doi=10.1093/bioinformatics/btu385 |pmid=24982428 |pmc=PMC4184253}}</ref>, Ergatis<ref name="OrvisErg10">{{cite journal |title=Ergatis: A web interface and scalable software system for bioinformatics workflows |journal=Bioinformatics |author=Orvis, J.; Crabtree, J.; Galens, K. et al. |volume=26 |issue=12 |pages=1488-1492 |year=2010 |doi=10.1093/bioinformatics/btq167 |pmid=20413634 |pmc=PMC2881353}}</ref>, i2b2<ref name="KohaneATrans11">{{cite journal |title=A translational engine at the national scale: Informatics for integrating biology and the bedside |journal=Journal of the American Medical Informatics Association |author=Kohane, I.S.; Churchill, S.E.; Murphy, S.N. |volume=19 |issue=2 |pages=181-185 |year=2011 |doi=10.1136/amiajnl-2011-000492 |pmid=22081225 |pmc=PMC3277623}}</ref>, LONI<ref name="DinovApp11">{{cite journal |title=Applications of the pipeline environment for visual informatics and genomics computations |journal=BMC Bioinformatics |author=Dinov, I.D.; Torri, F.; Macciardi, F. et al. |volume=12 |pages=304 |year=2011 |doi=10.1186/1471-2105-12-304 |pmid=21791102 |pmc=PMC3199760}}</ref>, NG6<ref name="MarietteInt12">{{cite journal |title=Integrated next generation sequencing storage and processing environment |journal=BMC Genomics |author=Mariette, J.; Escudié, F.; Allias, N. et al. |volume=13 |pages=462 |year=2012 |doi=10.1186/1471-2164-13-462 |pmid=22958229 |pmc=PMC3444930}}</ref>, NGSANE<ref name="BuskeNG14">{{cite journal |title=NGSANE: A lightweight production informatics framework for high-throuput data analysis |journal=Bioinformatics |author=Buske, F.A.; French, H.J.; Smith, M.A. et al. |volume=30 |issue=10 |pages=1471-1472 |year=2014 |doi=10.1093/bioinformatics/btu036 |pmid=24470576 |pmc=PMC4016703}}</ref>, Orione<ref name="CuccuruOrione14">{{cite journal |title=Orione, a web-based framework for NGS analysis in microbiology |journal=Bioinformatics |author=Cuccuru, G.; Orsini, M.; Pinna, A. et al. |volume=30 |issue=13 |pages=1928-1929 |year=2014 |doi=10.1093/bioinformatics/btu135 |pmid=24618473 |pmc=PMC4071203}}</ref>, RUbioSeq<ref name="Rubio-CamarilloRU13">{{cite journal |title=RUbioSeq: A suite of parallelized pipelines to automate exome variation and bisulfite-seq analyses |journal=Bioinformatics |author=Rubio-Camarillo, M.; Gómex-López, G.; Fernández, J.M. et al. |volume=29 |issue=13 |pages=1687-1689 |year=2013 |doi=10.1093/bioinformatics/btt203 |pmid=23630175 |pmc=PMC3694642}}</ref>, SeqInCloud<ref name="MohamedAcc13">{{cite web |url=http://synergy.cs.vt.edu/pubs/papers/nabeel-bicob13-genome-analysis-cloud.pdf |format=PDF |title=Accelerating Data-Intensive Genome Analysis in the Cloud |author=Mohamed, N.M.; Lin, H.; Feng, W.C. |publisher=Virginia Tech |date=2013 |accessdate=16 April 2015}}</ref>, STATegra EMS<ref name="DeDiegoStat14">{{cite journal |title=STATegra EMS: An experiment management system for complex next-generation omics experiments |journal=BMC Systems Biology |author=De Diego, R.H.; Boix-Chova, N.; Gómez-Cabrero, D. et al. |volume=8 |issue=Suppl 2 |pages=S9 |year=2014 |doi=10.1186/1752-0509-8-S2-S9 |pmid=25033091 |pmc=PMC4101697}}</ref>, TREVA<ref name="LiBio14">{{cite journal |title=Bioinformatics pipelines for targeted resequencing and whole-exome sequencing of human and mouse genomes: A virtual appliance approach for instant deployment |journal=PLOS ONE |author=Li, J.; Doyle, M.A.; Saeed, I. et al. |volume=9 |issue=4 |pages=e95217 |year=2014 |doi=10.1371/journal.pone.0095217 |pmid=24752294 |pmc=PMC3994043}}</ref>, and Pegasus.<ref name="DeelmanPeg15">{{cite journal |title=Pegasus: A workflow management system for science automation |journal=Future Generation Computer Systems |author=Deelman, E.; Vahi, K.; Juve, G. et al. |volume=46 |issue=5 |pages=17–35 |year=2015 |doi=10.1016/j.future.2014.10.008}}</ref><br />
<br />
Our team evaluated each of these systems for their ability to overcome the challenges described above. We found that existing solutions could address some, but not all, of the roadblocks and barriers that were hindering progress on the NCGENES project and that a new solution was needed. While all of the existing workflow systems and pipelines have proven to be effective, each has limitations [21]. MaPSeq is not unique in this regard, but it is responsive to the key features of a decentralized research organization. Specifically, as an SOA, MaPSeq allows for integration with multiple clients and distributed systems, whether local, open source, or commercial, and provides tailored, reusable, automated service solutions that address the varying and evolving needs and preferences of decentralized decision-makers. MaPSeq is scalable and can support both small- and large-scale projects and thus is responsive to the computational needs of all investigators. MaPSeq is efficient and allows for seamless, opportunistic use of distributed compute resources. Finally, the service-oriented, automated approach requires little coordination or communication among individual user groups and thus avoids local nuances in politics and culture. <br />
<br />
==References==<br />
{{Reflist|colwidth=30em}}<br />
<br />
==Notes==<br />
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. <br />
<br />
<!--Place all category tags here--><br />
[[Category:LIMSwiki journal articles (added in 2016)]]<br />
[[Category:LIMSwiki journal articles (all)]]<br />
[[Category:LIMSwiki journal articles on bioinformatics]]</div>Shawndouglas