Journal:Analyzing the field of bioinformatics with the multi-faceted topic modeling technique

From LIMSWiki
Revision as of 20:59, 12 June 2017 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Analyzing the field of bioinformatics with the multi-faceted topic modeling technique
Journal BMC Bioinformatics
Author(s) Heo, Go Eun; Kang, Keun Young; Song, Min; Lee, Jeong-Hoon
Author affiliation(s) Yonsei University, POSTECH
Primary contact Email: min dot song at yonsei dot ac dot kr
Year published 2017
Volume and issue 18 (Suppl 7)
Page(s) 251
DOI 10.1186/s12859-017-1640-x
ISSN 1471-2105
Distribution license Creative Commons Attribution 4.0 International
Download (PDF)


Background: Bioinformatics is an interdisciplinary field at the intersection of molecular biology and computing technology. To characterize the field as a convergent domain, researchers have used bibliometrics, augmented with text-mining techniques for content analysis. In previous studies, Latent Dirichlet Allocation (LDA) was the most representative topic modeling technique for identifying topic structure of subject areas. However, as opposed to revealing the topic structure in relation to metadata such as authors, publication date, and journals, LDA only displays the simple topic structure.

Methods: In this paper, we adopt the Author-Conference-Topic (ACT) model of Tang et al. to study the field of bioinformatics from the perspective of keyphrases, authors, and journals. The ACT model is capable of incorporating the paper, author, and conference into the topic distribution simultaneously. To obtain more meaningful results, we used journals and keyphrases instead of conferences and the bag-of-words. For analysis, we used PubMed to collect forty-six bioinformatics journals from the MEDLINE database. We conducted time series topic analysis over four periods from 1996 to 2015 to further examine the interdisciplinary nature of bioinformatics.

Results: We analyzed the ACT Model results in each period. Additionally, for further integrated analysis, we conducted a time series analysis among the top-ranked keyphrases, journals, and authors according to their frequency. We also examined the patterns in the top journals by simultaneously identifying the topical probability in each period, as well as the top authors and keyphrases. The results indicate that in recent years diversified topics have become more prevalent, and convergent topics have become more clearly represented.

Conclusion: The results of our analysis imply that over time the field of bioinformatics becomes more interdisciplinary where there is a steady increase in peripheral fields such as conceptual, mathematical, and system biology. These results are confirmed by integrated analysis of topic distribution as well as top ranked keyphrases, authors, and journals.

Keywords: bioinformatics, text mining, topic modeling, ACT model, keyphrase extraction


Over the years, academic subject areas have converged to form a variety of new, interdisciplinary fields. Bioinformatics is one example. Research domains from molecular biology to machine learning are used in conjunction to better understand complex biological systems such as cells, tissues, and the human body. Due to the complexity and broadness of the field, bibliometric analysis is often adopted to assess the current knowledge structure of a subject area, specify the current research themes, and identify the core literature of that area.[1]

Bibliometrics identifies research trends using quantitative measures such as a researcher’s number of publications and citations, journal impact factors, and other indices that can measure impact or productivity of author or journal.[2][3][4][5] In addition, other factors such as the affiliation of authors, collaborations, and citation data are often incorporated into bibliometric analysis.[6][7][8][9]

Previous studies mainly rely on quantitative measures and suffer from the lack of content analysis. To incorporate content analysis into bibliometrics, text-mining techniques are applied. Topic-modeling techniques are mostly adopted to identify the topics of a subject area while analyzing that area more abundantly.[10][11][12][13] These techniques allow for enriched content analysis. As an extension of Latent Dirichlet Allocation (LDA), which is the best received topic-modeling technique, Steyvers et al.[14] proposed the author-topic modeling technique that analyzes authors and topics simultaneously. They identify the authors’ impact or productivity of researchers in a given subject area.[15][16] By adding multiple conditions to LDA, Tang et al.[17] suggested a new methodology, called the Author-Conference-Topic (ACT) model that analyzes the author, conference, and topic in one model to understand the subject area in an integrated manner.


  1. Dong, D.; Chen, M.-L. (2015). "Publication trends and co-citation mapping of translation studies between 2000 and 2015". Scientometrics 105 (2): 1111–1128. doi:10.1007/s11192-015-1769-1. 
  2. Chen, H.; Wan, Y.; Jiang, S.; Cheng, Y. (2014). "Alzheimer’s disease research in the future: bibliometric analysis of cholinesterase inhibitors from 1993 to 2012". Scientometrics 98 (3): 1865–1877. doi:10.1007/s11192-013-1132-3. 
  3. Soteriades, E.S.; Falagas, M.E. (2006). "Alzheimer’s disease research in the future: Bibliometric analysis of cholinesterase inhibitors from 1993 to 2012". BMC Public Health 6: 301. doi:10.1186/1471-2458-6-301. PMC PMC1766930. PMID 17173665. 
  4. Ugolini, D.; Puntoni, R.; Perera, F.P. et al. (2007). "A bibliometric analysis of scientific production in cancer molecular epidemiology". Carcinogenesis 28: 8. doi:10.1093/carcin/bgm129. PMID 17548902. 
  5. Wang, L.; Chen, X.; Bao, A. et al. (2015). "A bibliometric analysis of research on Central Asia during 1990–2014". Scientometrics 105 (2): 1223–1237. doi:10.1007/s11192-015-1727-y. 
  6. Bornmann, L.; Mutz, R. (2015). "Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references". Journal of the Association for Information Science and Technology 66 (11): 2215–2222. doi:10.1002/asi.23329. 
  7. Geaney, F.; Scutaru, C.; Kelly, C. et al. (2015). "Type 2 Diabetes Research Yield, 1951-2012: Bibliometrics Analysis and Density-Equalizing Mapping". PLoS One 10: 7. doi:10.1371/journal.pone.0133009. PMC PMC4514795. PMID 26208117. 
  8. Macías-Chapula, C.A.; Mijangos-Nolasco, A. (2002). "Bibliometric analysis of AIDS literature in Central Africa". Scientometrics 54 (2): 309–317. doi:10.1023/A:1016074230843. 
  9. Seglen, P.O.; Aksnes, D.W. (2000). "Scientific Productivity and Group Size: A Bibliometric Analysis of Norwegian Microbiological Research". Scientometrics 49 (1): 125–143. doi:10.1023/A:1005665309719. 
  10. Jeong, D.-H.; Song, M. (2014). "Time gap analysis by the topic model-based temporal technique". Journal of Infometrics 8 (3): 776–790. doi:10.1016/j.joi.2014.07.005. 
  11. Song, M.; Kim, S.Y. (2013). "Detecting the knowledge structure of bioinformatics by mining full-text collections". Scientometrics 96 (1): 183–201. doi:10.1007/s11192-012-0900-9. 
  12. Yan, E. (2013). "Productivity and influence in bioinformatics: A bibliometric analysis using PubMed central". Journal of the Association for Information Science and Technology 66 (11): 352–371. doi:10.1002/asi.22970. 
  13. Song, M.; Kim, S.Y.; Zhang, G. et al. (2015). "Research dynamics, impact, and dissemination: A topic-level analysis". Journal of the Association for Information Science and Technology 65 (2): 2357–2372. doi:10.1002/asi.23324. 
  14. Steyvers, M.; Smyth, P.; Rosen-Zvi, M.; Griffiths, T. (2004). "Probabilistic author-topic models for information discovery". Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2004: 306–315. doi:10.1145/1014052.1014087. 
  15. Li, D.; Okamoto, J.; Liu, H.; Leischow, S. (2015). "A bibliometric analysis on tobacco regulation investigators". BioData Mining 8: 11. doi:10.1186/s13040-015-0043-7. PMC PMC4432889. PMID 25984237. 
  16. Rosen-Zvi, M.; Griffiths, T., Steyvers, M.; Smyth, P. (2004). "The author-topic model for authors and documents". Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence 2004: 487–494. ISBN 0974903906. 
  17. Tang, J.; Zhang, J.; Yao, L. et al. (2008). "ArnetMiner: Extraction and mining of academic social networks". Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2008: 990-998. doi:10.1145/1401890.1402008. 


This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Grammar and word used were updated to make the text easier to read.