Journal:Popularity and performance of bioinformatics software: The case of gene set analysis

From LIMSWiki
Revision as of 19:20, 19 April 2021 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Popularity and performance of bioinformatics software: The case of gene set analysis
Journal BMC Bioinformatics
Author(s) Xie, Chengshu; Jauhari, Shaurya; Mora, Antonio
Author affiliation(s) Guangzhou Medical University, Guangzhou Institutes of Biomedicine and Health
Primary contact Online form
Year published 2017
Volume and issue 22
Article # 191
DOI 10.1186/s12859-021-04124-5
ISSN 1471-2105
Distribution license Creative Commons Attribution 4.0 International
Website https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04124-5
Download https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-021-04124-5.pdf (PDF)

Abstract

Background: Gene set analysis (GSA) is arguably the method of choice for the functional interpretation of omics results. This work explores the popularity and the performance of all the GSA methodologies and software published during the 20 years since its inception. "Popularity" is estimated according to each paper's citation counts, while "performance" is based on a comprehensive evaluation of the validation strategies used by papers in the field, as well as the consolidated results from the existing benchmark studies.

Results: Regarding popularity, data is collected into an online open database ("GSARefDB") which allows browsing bibliographic and method-descriptive information from 503 GSA paper references; regarding performance, we introduce a repository of Jupyter Notebook workflows and shiny apps for automated benchmarking of GSA methods (“GSA-BenchmarKING”). After comparing popularity versus performance, results show discrepancies between the most popular and the best performing GSA methods.

Conclusions: The above-mentioned results call our attention towards the nature of the tool selection procedures followed by researchers and raise doubts regarding the quality of the functional interpretation of biological datasets in current biomedical studies. Suggestions for the future of the functional interpretation field are made, including strategies for education and discussion of GSA tools, better validation and benchmarking practices, reproducibility, and functional re-analysis of previously reported data.

Keywords: bioinformatics, pathway analysis, gene set analysis, benchmark, GSEA

Background

Bioinformatics method and software selection is an important problem in biomedical research, due to the possible consequences of choosing the wrong methods among the existing myriad of computational methods and software available. Errors in software selection may include the use of outdated or suboptimal methods (or reference databases) or misunderstanding the parameters and assumptions behind the chosen methods. Such errors may affect the conclusions of the entire research project and nullify the efforts made on the rest of the experimental and computational pipeline.[1]

This paper discusses two main factors that motivate researchers to make method or software choices, that is, the popularity (defined as the perceived frequency of use of a tool among members of the community) and the performance (defined as a quantitative quality indicator measured and compared to alternative tools). This study is focused on the field of “gene set analysis” (GSA), where the popularity and performance of bioinformatics software show discrepancies, and therefore the question appears whether biomedical sciences have been using the best available GSA methods or not.

GSA is arguably the most common procedure for functional interpretation of omics data, and, for the purposes of this paper, we define it as the comparison of a query gene set (a list or a rank of differentially expressed genes, for example) to a reference database, using a particular statistical method, in order to interpret it as a rank of significant pathways, functionally related gene sets, or ontology terms. Such definition includes the categories that have been traditionally called "gene set analysis," "pathway analysis," "ontology analysis," and "enrichment analysis." All GSA methods have a common goal, which is the interpretation of biomolecular data in terms of annotated gene sets, while they differ depending on characteristics of the computational approach (for more details, see the "Methods" section, as well as Fig. 1 of Mora[2]).

GSA has arrived to 20 years of existence since the original paper of Tavazoie et al. [3], and many statistical methods and software tools have been developed during this time. A popular review paper listed 68 GSA tools [4], while a second review reported an additional 33 tools [5], and a third paper 22 tools. [6] We have built the most comprehensive list of references to date (503 papers), and we have quantified each paper’s influence according to their current number of citations (see Additional file 1 and Mora's GSARefDB [7]). The most common GSA methods include Over-Representation Analysis (ORA), such as that found with DAVID [8]; Functional Class Scoring (FCS), such as that found with GSEA [9]; and Pathway-Topology-based (PT) methods, such as that found with SPIA [10]. All been extensively reviewed. In order to know more about them, the reader may consult any of the 62 published reviews documented in Additional file 1. We have also recently reviewed other types of GSA methods.[2]


References

  1. Dixson, L.; Walter, H.; Schneider, M. et al. (2014). "Retraction for Dixson et al., Identification of gene ontologies linked to prefrontal-hippocampal functional coupling in the human brain". Proceedings of the National Academy of Sciences of the United States of America 111 (37): 13582. doi:10.1073/pnas.1414905111. PMC PMC4169929. PMID 25197092. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4169929. 
  2. 2.0 2.1 Mora, A. (2020). "Gene set analysis methods for the functional interpretation of non-mRNA data-Genomic range and ncRNA data". Briefings in Bioinformatics 21 (5): 1495-1508. doi:10.1093/bib/bbz090. PMID 31612220. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, spelling, and grammar. In some cases important information was missing from the references, and that information was added.