Journal:Analyzing the field of bioinformatics with the multi-faceted topic modeling technique

From LIMSWiki
Revision as of 22:58, 12 June 2017 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Analyzing the field of bioinformatics with the multi-faceted topic modeling technique
Journal BMC Bioinformatics
Author(s) Heo, Go Eun; Kang, Keun Young; Song, Min; Lee, Jeong-Hoon
Author affiliation(s) Yonsei University, POSTECH
Primary contact Email: min dot song at yonsei dot ac dot kr
Year published 2017
Volume and issue 18 (Suppl 7)
Page(s) 251
DOI 10.1186/s12859-017-1640-x
ISSN 1471-2105
Distribution license Creative Commons Attribution 4.0 International
Download (PDF)


Background: Bioinformatics is an interdisciplinary field at the intersection of molecular biology and computing technology. To characterize the field as a convergent domain, researchers have used bibliometrics, augmented with text-mining techniques for content analysis. In previous studies, Latent Dirichlet Allocation (LDA) was the most representative topic modeling technique for identifying topic structure of subject areas. However, as opposed to revealing the topic structure in relation to metadata such as authors, publication date, and journals, LDA only displays the simple topic structure.

Methods: In this paper, we adopt the Author-Conference-Topic (ACT) model of Tang et al. to study the field of bioinformatics from the perspective of keyphrases, authors, and journals. The ACT model is capable of incorporating the paper, author, and conference into the topic distribution simultaneously. To obtain more meaningful results, we used journals and keyphrases instead of conferences and the bag-of-words. For analysis, we used PubMed to collect forty-six bioinformatics journals from the MEDLINE database. We conducted time series topic analysis over four periods from 1996 to 2015 to further examine the interdisciplinary nature of bioinformatics.

Results: We analyzed the ACT Model results in each period. Additionally, for further integrated analysis, we conducted a time series analysis among the top-ranked keyphrases, journals, and authors according to their frequency. We also examined the patterns in the top journals by simultaneously identifying the topical probability in each period, as well as the top authors and keyphrases. The results indicate that in recent years diversified topics have become more prevalent, and convergent topics have become more clearly represented.

Conclusion: The results of our analysis imply that over time the field of bioinformatics becomes more interdisciplinary where there is a steady increase in peripheral fields such as conceptual, mathematical, and system biology. These results are confirmed by integrated analysis of topic distribution as well as top ranked keyphrases, authors, and journals.

Keywords: bioinformatics, text mining, topic modeling, ACT model, keyphrase extraction


Over the years, academic subject areas have converged to form a variety of new, interdisciplinary fields. Bioinformatics is one example. Research domains from molecular biology to machine learning are used in conjunction to better understand complex biological systems such as cells, tissues, and the human body. Due to the complexity and broadness of the field, bibliometric analysis is often adopted to assess the current knowledge structure of a subject area, specify the current research themes, and identify the core literature of that area.[1]

Bibliometrics identifies research trends using quantitative measures such as a researcher’s number of publications and citations, journal impact factors, and other indices that can measure impact or productivity of author or journal.[2][3][4][5] In addition, other factors such as the affiliation of authors, collaborations, and citation data are often incorporated into bibliometric analysis.[6][7][8][9]

Previous studies mainly rely on quantitative measures and suffer from the lack of content analysis. To incorporate content analysis into bibliometrics, text-mining techniques are applied. Topic-modeling techniques are mostly adopted to identify the topics of a subject area while analyzing that area more abundantly.[10][11][12][13] These techniques allow for enriched content analysis. As an extension of Latent Dirichlet Allocation (LDA), which is the best received topic-modeling technique, Steyvers et al.[14] proposed the author-topic modeling technique that analyzes authors and topics simultaneously. They identify the authors’ impact or productivity of researchers in a given subject area.[15][16] By adding multiple conditions to LDA, Tang et al.[17] suggested a new methodology, called the Author-Conference-Topic (ACT) model that analyzes the author, conference, and topic in one model to understand the subject area in an integrated manner.

In this paper, we apply the ACT model to examine the interdisciplinary nature of bioinformatics. Unlike studies that use extended versions of LDA for topic analysis, the ACT model enables us to analyze topic, author, and journal at one time, providing an integrated view for understanding bioinformatics. The research questions that we are to investigate in this paper are: 1) What are the topical trends of bioinformatics over time? 2) Who are the key contributors in major topics of bioinformatics?, and 3) Which journal is leading which topic?

To address these questions, we collect PubMed articles in XML format and extract metadata and content such as the PMID, author, year, journal, title, and abstract. From the title and abstract, we extract keyphrases, which provide more meaningful interpretations than single words, as an input of the ACT model. We also divide the collected datasets into four time periods to examine the topic changes over time. The results of ACT model–based analysis show that various topics begin to appear and mixed subject topics become more apparent over time.

The rest of the paper is organized as follows. In the "Related work" section, we discuss work related to bibliometric analysis and topic modeling. We then describe the proposed method in the "Methods" section. We analyze and discuss the results of leading topics, authors, and journals in the "Results" and "Discussion" sections. Finally, we conclude the paper and suggest future lines of inquiry in the "Conclusion" section.

Related work

Bibliometric analysis

Bibliometric analysis identifies the research trends in a given subject area and core journals or documents, and it helps with contrasting analysis. Many bibliometric studies use the number of published articles or journal impact factors to measure research productivity or to identify core journals in a specific field. Soteriades and Falalgas[3] applied quantitative and qualitative measurements to analyze the fields of preventive medicine, occupational and environmental medicine, epidemiology, and public health using the number of articles and impact factor. Ugolini et al.[4] measured research productivity and evaluated the publication trends in the field of cancer molecular epidemiology. To quantify productivity, they used the number of articles and the average and sum of impact factors. To evaluate publication trends, they collected and divided the keywords from MeSH terms about the publication into six groups. Ramos et al.[18] measured the national research activity of the tuberculosis field, using impact factor and the first author’s address. Claude et al.[19] examined research productivity by using distribution of publications related to medicine and ANN, the subfield of biology. They used the number of publications, impact factor, and journal category compared with national gross domestic product (GDP). In the bioinformatics field, Patra and Mishra[20] used the number of articles, publication of each journal, publication type, and the impact factor of journals to understand the growth of bioinformatics. They also found the core journals in the bioinformatics fields. Using author affiliation, they applied Lotka’s law to assess the distribution of each author’s productivity. Chen et al.[2] identified research trends using statistical methods based on the type of publication, language, and distribution of nation or institution. They measured h-index, adding statistical materials with the number of citations. Through this, they analyzed the research productivity by topic, institution, and journal. In addition, they conducted a keyword analysis to comprehend the research trend in a macroscopic view.

Mainstream bibliometrics research focuses on identifying the knowledge structure of a certain field with quantitative measures. In addition, some studies use author information or the collaboration pattern among authors to understand the certain field. Seglen and Aksnes[9] used the size and the productivity of research groups in the microbiology field in Norway as a measurement for bibliometric analysis. Geaney et al.[7] performed bibliometric analysis and density-equalizing mapping on scientific publications related to type 2 diabetes mellitus. They collected citation data and used various citation-oriented measures such as the number of citations, the average number of citations per journal, the total number of publications, impact factor, and Eigenfactor score. To conduct content analysis and study the collaboration pattern between authors and the core sub-field of AIDS, Macías-Chapula and Mijangos-Nolasco[8] analyzed MeSH thesaurus using check tags, main headings, and subheadings of each MeSH term hierarchy. In addition, to measure the national research productivity, they used the authors’ address information. Bornmann and Mutz[6] recently identified the development of modern science by bibliometric analysis. They divide the data into three time periods to analyze the changes of fields over time.

Text mining applied to bibliometrics

Recently, there have been many attempts to apply text-mining techniques to bibliometric analysis to identify the knowledge structure of the field or measure its influence on other researchers and their fields and productivity. Song and Kim[11] collected full-text articles from PubMed Central and computed their citation relation. They infer the knowledge structure and understand the trend of the bioinformatics field. In a similar vein, Song et al.[12] measured the influence and productivity of bioinformatics by mining full-text articles retrieved from PubMed Central. To calculate the field’s productivity, they identified the most productive author, nation, institution, and topic word; to calculate its influence, they identified the most-cited paper, author, and rising researcher. Song et al.[21] analyzed topic evolution in the bioinformatics field using DBLP data in the field of computer science. To identify topic trends over time, they divided a dozen years (2000–2011) into four periods and applied Markov Random Field-based topic clustering. For automatic clustering labeling, they calculated topic similarity based on Within-Period Cluster Similarity (WPCS) and Between-Period Cluster Similarity (BPCS). Their approach created topic graphs that show interaction among topics over some period of time. Lee et al.[22] mapped the Alzheimer’s disease field into three different perspectives: indexer, author, and citer. They applied entity-metrics[23] the extended notion of bibliometrics, to analyze the field by constructing four kinds of networks that convey these three perspectives.


  1. Dong, D.; Chen, M.-L. (2015). "Publication trends and co-citation mapping of translation studies between 2000 and 2015". Scientometrics 105 (2): 1111–1128. doi:10.1007/s11192-015-1769-1. 
  2. 2.0 2.1 Chen, H.; Wan, Y.; Jiang, S.; Cheng, Y. (2014). "Alzheimer’s disease research in the future: bibliometric analysis of cholinesterase inhibitors from 1993 to 2012". Scientometrics 98 (3): 1865–1877. doi:10.1007/s11192-013-1132-3. 
  3. 3.0 3.1 Soteriades, E.S.; Falagas, M.E. (2006). "Alzheimer’s disease research in the future: Bibliometric analysis of cholinesterase inhibitors from 1993 to 2012". BMC Public Health 6: 301. doi:10.1186/1471-2458-6-301. PMC PMC1766930. PMID 17173665. 
  4. 4.0 4.1 Ugolini, D.; Puntoni, R.; Perera, F.P. et al. (2007). "A bibliometric analysis of scientific production in cancer molecular epidemiology". Carcinogenesis 28: 8. doi:10.1093/carcin/bgm129. PMID 17548902. 
  5. Wang, L.; Chen, X.; Bao, A. et al. (2015). "A bibliometric analysis of research on Central Asia during 1990–2014". Scientometrics 105 (2): 1223–1237. doi:10.1007/s11192-015-1727-y. 
  6. 6.0 6.1 Bornmann, L.; Mutz, R. (2015). "Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references". Journal of the Association for Information Science and Technology 66 (11): 2215–2222. doi:10.1002/asi.23329. 
  7. 7.0 7.1 Geaney, F.; Scutaru, C.; Kelly, C. et al. (2015). "Type 2 Diabetes Research Yield, 1951-2012: Bibliometrics Analysis and Density-Equalizing Mapping". PLoS One 10: 7. doi:10.1371/journal.pone.0133009. PMC PMC4514795. PMID 26208117. 
  8. 8.0 8.1 Macías-Chapula, C.A.; Mijangos-Nolasco, A. (2002). "Bibliometric analysis of AIDS literature in Central Africa". Scientometrics 54 (2): 309–317. doi:10.1023/A:1016074230843. 
  9. 9.0 9.1 Seglen, P.O.; Aksnes, D.W. (2000). "Scientific Productivity and Group Size: A Bibliometric Analysis of Norwegian Microbiological Research". Scientometrics 49 (1): 125–143. doi:10.1023/A:1005665309719. 
  10. Jeong, D.-H.; Song, M. (2014). "Time gap analysis by the topic model-based temporal technique". Journal of Infometrics 8 (3): 776–790. doi:10.1016/j.joi.2014.07.005. 
  11. 11.0 11.1 Song, M.; Kim, S.Y. (2013). "Detecting the knowledge structure of bioinformatics by mining full-text collections". Scientometrics 96 (1): 183–201. doi:10.1007/s11192-012-0900-9. 
  12. 12.0 12.1 Song, M.; Kim, S.Y.; Zhang, G. et al. (2014). "Productivity and influence in bioinformatics: A bibliometric analysis using PubMed central". Journal of the Association for Information Science and Technology 65 (2): 352–371. doi:10.1002/asi.22970. 
  13. Yan, E. (2015). "Research dynamics, impact, and dissemination: A topic-level analysis". Journal of the Association for Information Science and Technology 66 (1): 2357–2372. doi:10.1002/asi.23324. 
  14. Steyvers, M.; Smyth, P.; Rosen-Zvi, M.; Griffiths, T. (2004). "Probabilistic author-topic models for information discovery". Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2004: 306–315. doi:10.1145/1014052.1014087. 
  15. Li, D.; Okamoto, J.; Liu, H.; Leischow, S. (2015). "A bibliometric analysis on tobacco regulation investigators". BioData Mining 8: 11. doi:10.1186/s13040-015-0043-7. PMC PMC4432889. PMID 25984237. 
  16. Rosen-Zvi, M.; Griffiths, T., Steyvers, M.; Smyth, P. (2004). "The author-topic model for authors and documents". Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence 2004: 487–494. ISBN 0974903906. 
  17. Tang, J.; Zhang, J.; Yao, L. et al. (2008). "ArnetMiner: Extraction and mining of academic social networks". Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2008: 990-998. doi:10.1145/1401890.1402008. 
  18. Ramos, J.M.; Padilla, S.; Masiá, M.; Gutiérrez, F. (2008). "A bibliometric analysis of tuberculosis research indexed in PubMed, 1997-2006". International Journal of Tuberculosis and Lung Disease 2008: 1461–8. PMID 19017458. 
  19. Claude, R.; Charles-Daniel, A.; Jean, A.; Jean-Francois, G. (2004). "Bibliometric overview of the utilization of artificial neural networks in medicine and biology". Scientometrics 59 (1): 117–130. doi:10.1023/B:SCIE.0000013302.59845.34. 
  20. Patra, S.K.; Mishra, S. (2006). "Bibliometric study of bioinformatics literature". Scientometrics 67 (3): 477–489. doi:10.1556/Scient.67.2006.3.9. 
  21. Song, M.; Heo, G.E.; Kim, S.Y. (2014). "Analyzing topic evolution in bioinformatics: Investigation of dynamics of the field with conference data in DBLP". Scientometrics 101 (1): 397–428. doi:10.1007/s11192-014-1246-2. 
  22. Lee, D.; Kim, W.C.; Charidimou, A.; Song, M. (2015). "A Bird's-Eye View of Alzheimer's Disease Research: Reflecting Different Perspectives of Indexers, Authors, or Citers in Mapping the Field". Journal of Alzheimer's Disease 45 (4): 1207-22. doi:10.3233/JAD-142688. PMID 25697702. 
  23. Ding, Y.; Song, M.; Han, J. et al. (2013). "Entitymetrics: measuring the impact of entities". PLoS One 8 (8): e71416. doi:10.1371/journal.pone.0071416. PMC PMC3756961. PMID 24009660. 


This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Grammar and word used were updated to make the text easier to read.