Journal:Analyzing the field of bioinformatics with the multi-faceted topic modeling technique

From LIMSWiki
Revision as of 23:59, 12 June 2017 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Analyzing the field of bioinformatics with the multi-faceted topic modeling technique
Journal BMC Bioinformatics
Author(s) Heo, Go Eun; Kang, Keun Young; Song, Min; Lee, Jeong-Hoon
Author affiliation(s) Yonsei University, POSTECH
Primary contact Email: min dot song at yonsei dot ac dot kr
Year published 2017
Volume and issue 18 (Suppl 7)
Page(s) 251
DOI 10.1186/s12859-017-1640-x
ISSN 1471-2105
Distribution license Creative Commons Attribution 4.0 International
Download (PDF)


Background: Bioinformatics is an interdisciplinary field at the intersection of molecular biology and computing technology. To characterize the field as a convergent domain, researchers have used bibliometrics, augmented with text-mining techniques for content analysis. In previous studies, Latent Dirichlet Allocation (LDA) was the most representative topic modeling technique for identifying topic structure of subject areas. However, as opposed to revealing the topic structure in relation to metadata such as authors, publication date, and journals, LDA only displays the simple topic structure.

Methods: In this paper, we adopt the Author-Conference-Topic (ACT) model of Tang et al. to study the field of bioinformatics from the perspective of keyphrases, authors, and journals. The ACT model is capable of incorporating the paper, author, and conference into the topic distribution simultaneously. To obtain more meaningful results, we used journals and keyphrases instead of conferences and the bag-of-words. For analysis, we used PubMed to collect forty-six bioinformatics journals from the MEDLINE database. We conducted time series topic analysis over four periods from 1996 to 2015 to further examine the interdisciplinary nature of bioinformatics.

Results: We analyzed the ACT Model results in each period. Additionally, for further integrated analysis, we conducted a time series analysis among the top-ranked keyphrases, journals, and authors according to their frequency. We also examined the patterns in the top journals by simultaneously identifying the topical probability in each period, as well as the top authors and keyphrases. The results indicate that in recent years diversified topics have become more prevalent, and convergent topics have become more clearly represented.

Conclusion: The results of our analysis imply that over time the field of bioinformatics becomes more interdisciplinary where there is a steady increase in peripheral fields such as conceptual, mathematical, and system biology. These results are confirmed by integrated analysis of topic distribution as well as top ranked keyphrases, authors, and journals.

Keywords: bioinformatics, text mining, topic modeling, ACT model, keyphrase extraction


Over the years, academic subject areas have converged to form a variety of new, interdisciplinary fields. Bioinformatics is one example. Research domains from molecular biology to machine learning are used in conjunction to better understand complex biological systems such as cells, tissues, and the human body. Due to the complexity and broadness of the field, bibliometric analysis is often adopted to assess the current knowledge structure of a subject area, specify the current research themes, and identify the core literature of that area.[1]

Bibliometrics identifies research trends using quantitative measures such as a researcher’s number of publications and citations, journal impact factors, and other indices that can measure impact or productivity of author or journal.[2][3][4][5] In addition, other factors such as the affiliation of authors, collaborations, and citation data are often incorporated into bibliometric analysis.[6][7][8][9]

Previous studies mainly rely on quantitative measures and suffer from the lack of content analysis. To incorporate content analysis into bibliometrics, text-mining techniques are applied. Topic-modeling techniques are mostly adopted to identify the topics of a subject area while analyzing that area more abundantly.[10][11][12][13] These techniques allow for enriched content analysis. As an extension of Latent Dirichlet Allocation (LDA), which is the best received topic-modeling technique, Steyvers et al.[14] proposed the author-topic modeling technique that analyzes authors and topics simultaneously. They identify the authors’ impact or productivity of researchers in a given subject area.[15][16] By adding multiple conditions to LDA, Tang et al.[17] suggested a new methodology, called the Author-Conference-Topic (ACT) model that analyzes the author, conference, and topic in one model to understand the subject area in an integrated manner.

In this paper, we apply the ACT model to examine the interdisciplinary nature of bioinformatics. Unlike studies that use extended versions of LDA for topic analysis, the ACT model enables us to analyze topic, author, and journal at one time, providing an integrated view for understanding bioinformatics. The research questions that we are to investigate in this paper are: 1) What are the topical trends of bioinformatics over time? 2) Who are the key contributors in major topics of bioinformatics?, and 3) Which journal is leading which topic?

To address these questions, we collect PubMed articles in XML format and extract metadata and content such as the PMID, author, year, journal, title, and abstract. From the title and abstract, we extract keyphrases, which provide more meaningful interpretations than single words, as an input of the ACT model. We also divide the collected datasets into four time periods to examine the topic changes over time. The results of ACT model–based analysis show that various topics begin to appear and mixed subject topics become more apparent over time.

The rest of the paper is organized as follows. In the "Related work" section, we discuss work related to bibliometric analysis and topic modeling. We then describe the proposed method in the "Methods" section. We analyze and discuss the results of leading topics, authors, and journals in the "Results" and "Discussion" sections. Finally, we conclude the paper and suggest future lines of inquiry in the "Conclusion" section.

Related work

Bibliometric analysis

Bibliometric analysis identifies the research trends in a given subject area and core journals or documents, and it helps with contrasting analysis. Many bibliometric studies use the number of published articles or journal impact factors to measure research productivity or to identify core journals in a specific field. Soteriades and Falalgas[3] applied quantitative and qualitative measurements to analyze the fields of preventive medicine, occupational and environmental medicine, epidemiology, and public health using the number of articles and impact factor. Ugolini et al.[4] measured research productivity and evaluated the publication trends in the field of cancer molecular epidemiology. To quantify productivity, they used the number of articles and the average and sum of impact factors. To evaluate publication trends, they collected and divided the keywords from MeSH terms about the publication into six groups. Ramos et al.[18] measured the national research activity of the tuberculosis field, using impact factor and the first author’s address. Claude et al.[19] examined research productivity by using distribution of publications related to medicine and ANN, the subfield of biology. They used the number of publications, impact factor, and journal category compared with national gross domestic product (GDP). In the bioinformatics field, Patra and Mishra[20] used the number of articles, publication of each journal, publication type, and the impact factor of journals to understand the growth of bioinformatics. They also found the core journals in the bioinformatics fields. Using author affiliation, they applied Lotka’s law to assess the distribution of each author’s productivity. Chen et al.[2] identified research trends using statistical methods based on the type of publication, language, and distribution of nation or institution. They measured h-index, adding statistical materials with the number of citations. Through this, they analyzed the research productivity by topic, institution, and journal. In addition, they conducted a keyword analysis to comprehend the research trend in a macroscopic view.

Mainstream bibliometrics research focuses on identifying the knowledge structure of a certain field with quantitative measures. In addition, some studies use author information or the collaboration pattern among authors to understand the certain field. Seglen and Aksnes[9] used the size and the productivity of research groups in the microbiology field in Norway as a measurement for bibliometric analysis. Geaney et al.[7] performed bibliometric analysis and density-equalizing mapping on scientific publications related to type 2 diabetes mellitus. They collected citation data and used various citation-oriented measures such as the number of citations, the average number of citations per journal, the total number of publications, impact factor, and Eigenfactor score. To conduct content analysis and study the collaboration pattern between authors and the core sub-field of AIDS, Macías-Chapula and Mijangos-Nolasco[8] analyzed MeSH thesaurus using check tags, main headings, and subheadings of each MeSH term hierarchy. In addition, to measure the national research productivity, they used the authors’ address information. Bornmann and Mutz[6] recently identified the development of modern science by bibliometric analysis. They divide the data into three time periods to analyze the changes of fields over time.

Text mining applied to bibliometrics

Recently, there have been many attempts to apply text-mining techniques to bibliometric analysis to identify the knowledge structure of the field or measure its influence on other researchers and their fields and productivity. Song and Kim[11] collected full-text articles from PubMed Central and computed their citation relation. They infer the knowledge structure and understand the trend of the bioinformatics field. In a similar vein, Song et al.[12] measured the influence and productivity of bioinformatics by mining full-text articles retrieved from PubMed Central. To calculate the field’s productivity, they identified the most productive author, nation, institution, and topic word; to calculate its influence, they identified the most-cited paper, author, and rising researcher. Song et al.[21] analyzed topic evolution in the bioinformatics field using DBLP data in the field of computer science. To identify topic trends over time, they divided a dozen years (2000–2011) into four periods and applied Markov Random Field-based topic clustering. For automatic clustering labeling, they calculated topic similarity based on Within-Period Cluster Similarity (WPCS) and Between-Period Cluster Similarity (BPCS). Their approach created topic graphs that show interaction among topics over some period of time. Lee et al.[22] mapped the Alzheimer’s disease field into three different perspectives: indexer, author, and citer. They applied entity-metrics[23] the extended notion of bibliometrics, to analyze the field by constructing four kinds of networks that convey these three perspectives.

These studies identify the knowledge structure of a certain field by constructing bibliometric networks or databases with text-mining techniques. The most prevalent approach is to apply topic modeling to content analysis as a part of bibliometrics. Starting from the probabilistic Latent Semantic Indexing (pLSI)[24] model, Latent Dirichlet Allocation (LDA)[25] is the most accepted topic modeling technique for bibliometrics. While each document consists of a set of topics in pLSI, using the LDA model results in a more precise manipulation being added to organize the topics. Yan[13] used the LDA model to measure the influence and popularity of library and information science. He also identified the most-cited area and the patterns in this field. Jeong and Song’s[10] research measured the time gap among three different resources — web, patent, and scientific publication — in two research domains by applying the LDA model. The basic input unit for LDA is a set of documents. To organize author information into topics, Rosen-Zvi et al.[16] and Steyvers et al.[14] proposed the author-topic model with different theoretical background. Li et al.[15] identified the relations between authors and topics by using the author-topic model. They analyzed the topic distribution to examine how many authors are associated with a certain topic. Also through the number of authors, they identified topics that are studied by many researchers. Tang et al.[17] proposed the ACT model which identifies paper, author, and conference simultaneously. Additionally, they developed the ArnetMiner system for mining academic research social networks. Tang et al.[26] also supplement ArnetMiner for a topic level expertise search over heterogeneous networks using the ACT model. It generates the most issued topics, author’s interestedness, paper search, academic suggestion, and experts in a specific field. Kim et al.[27] adopted the ACT model in terms of citation analysis. They collected their dataset in the field of oncology from PubMed Central, which provides the full-text articles in the biomedical field. They utilized the ACT model for analyzing citation sentences and journals instead of abstracts and conferences.

In conclusion, most previous studies identified knowledge structures by adopting not only bibliometric analysis but text-mining techniques such as the LDA model. To supplement bibliometric analysis, there are many attempts to incorporate content analysis into bibliometrics by adopting the LDA model text-mining techniques. However, the main limitation of this application of the LDA model, the representative method for trend analysis, is that it only explains topical trends by using one parameter such as the bag-of-words on documents via topical terms. It is not sufficient to conduct comprehensive analysis for understanding knowledge disciplines. Therefore, in this paper, we apply the ACT model to the bioinformatics field for integrated analysis. Applying the ACT model, we aim to explore the importance of authors and journals in relation to topics. We divided the collected datasets into four periods to trace the changes of topic, author, and journal ranking over time, combining the results with bibliometric analysis.


In this section, we describe data collection, preprocessing, and keyphrase extraction to feed input into the ACT model. Figure 1 illustrates the overflow of our approach; detailed descriptions of each component are provided in the following section.

Fig1 Heo BMCBioinformatics2017 18.gif

Figure 1. Research overflow. Research overflow of our approach consists of data collection, preprocessing, keyphrase extraction, ACT model application, and topic analysis

Data collection

For analysis, we collected the 48 journals in the bioinformatics field as used by Song and Kim.[11] Forty-six out of the 48 journals were found via the advanced search tool provided by PubMed. Two journals, Advanced Bioinformatics and Genome Integration, were not retrieved from PubMed. We downloaded the 46 PubMed-listed journals in XML format (Table 1). The total number of papers indexed in these journals was 241,569; Biochemistry had the greatest number of papers with 62,270, accounting for 25.78% of the collected publications.

Table 1. Statistics of collected publications
Ranking Journal name Number of papers Ratio (%)
1 Biochemistry 62,270 25.78
2 Journal of Molecular Biology 29,968 12.41
3 The EMBO Journal 17,296 7.16
4 Journal of Theoretical Biology 12,200 5.05
5 Bioinformatics 9,847 4.08
6 Human Molecular Genetics 9,347 3.87
7 Genomics 8,316 3.44
8 BMC Genomics 7,741 3.20
9 BMC Bioinformatics 6,780 2.81
10 Protein Science: A publication of the Protein Society 6,047 2.50
11 Journal of Proteome Research 5,575 2.31
12 Proteomics 5,545 2.30
13 Journal of Biotechnology 5,204 2.15
14 PLOS Genetics 5,139 2.13
15 PLOS Computational Biology 3,852 1.59
16 BMC Research Notes 3,743 1.55
17 Mammalian Genome 3,499 1.45
18 Genome Biology 3,411 1.41
19 PLOS Biology 3,280 1.36
20 Trends in Biochemical Sciences 3,171 1.31


  1. Dong, D.; Chen, M.-L. (2015). "Publication trends and co-citation mapping of translation studies between 2000 and 2015". Scientometrics 105 (2): 1111–1128. doi:10.1007/s11192-015-1769-1. 
  2. 2.0 2.1 Chen, H.; Wan, Y.; Jiang, S.; Cheng, Y. (2014). "Alzheimer’s disease research in the future: bibliometric analysis of cholinesterase inhibitors from 1993 to 2012". Scientometrics 98 (3): 1865–1877. doi:10.1007/s11192-013-1132-3. 
  3. 3.0 3.1 Soteriades, E.S.; Falagas, M.E. (2006). "Alzheimer’s disease research in the future: Bibliometric analysis of cholinesterase inhibitors from 1993 to 2012". BMC Public Health 6: 301. doi:10.1186/1471-2458-6-301. PMC PMC1766930. PMID 17173665. 
  4. 4.0 4.1 Ugolini, D.; Puntoni, R.; Perera, F.P. et al. (2007). "A bibliometric analysis of scientific production in cancer molecular epidemiology". Carcinogenesis 28: 8. doi:10.1093/carcin/bgm129. PMID 17548902. 
  5. Wang, L.; Chen, X.; Bao, A. et al. (2015). "A bibliometric analysis of research on Central Asia during 1990–2014". Scientometrics 105 (2): 1223–1237. doi:10.1007/s11192-015-1727-y. 
  6. 6.0 6.1 Bornmann, L.; Mutz, R. (2015). "Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references". Journal of the Association for Information Science and Technology 66 (11): 2215–2222. doi:10.1002/asi.23329. 
  7. 7.0 7.1 Geaney, F.; Scutaru, C.; Kelly, C. et al. (2015). "Type 2 Diabetes Research Yield, 1951-2012: Bibliometrics Analysis and Density-Equalizing Mapping". PLoS One 10: 7. doi:10.1371/journal.pone.0133009. PMC PMC4514795. PMID 26208117. 
  8. 8.0 8.1 Macías-Chapula, C.A.; Mijangos-Nolasco, A. (2002). "Bibliometric analysis of AIDS literature in Central Africa". Scientometrics 54 (2): 309–317. doi:10.1023/A:1016074230843. 
  9. 9.0 9.1 Seglen, P.O.; Aksnes, D.W. (2000). "Scientific Productivity and Group Size: A Bibliometric Analysis of Norwegian Microbiological Research". Scientometrics 49 (1): 125–143. doi:10.1023/A:1005665309719. 
  10. 10.0 10.1 Jeong, D.-H.; Song, M. (2014). "Time gap analysis by the topic model-based temporal technique". Journal of Infometrics 8 (3): 776–790. doi:10.1016/j.joi.2014.07.005. 
  11. 11.0 11.1 11.2 Song, M.; Kim, S.Y. (2013). "Detecting the knowledge structure of bioinformatics by mining full-text collections". Scientometrics 96 (1): 183–201. doi:10.1007/s11192-012-0900-9. 
  12. 12.0 12.1 Song, M.; Kim, S.Y.; Zhang, G. et al. (2014). "Productivity and influence in bioinformatics: A bibliometric analysis using PubMed central". Journal of the Association for Information Science and Technology 65 (2): 352–371. doi:10.1002/asi.22970. 
  13. 13.0 13.1 Yan, E. (2015). "Research dynamics, impact, and dissemination: A topic-level analysis". Journal of the Association for Information Science and Technology 66 (1): 2357–2372. doi:10.1002/asi.23324. 
  14. 14.0 14.1 Steyvers, M.; Smyth, P.; Rosen-Zvi, M.; Griffiths, T. (2004). "Probabilistic author-topic models for information discovery". Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2004: 306–315. doi:10.1145/1014052.1014087. 
  15. 15.0 15.1 Li, D.; Okamoto, J.; Liu, H.; Leischow, S. (2015). "A bibliometric analysis on tobacco regulation investigators". BioData Mining 8: 11. doi:10.1186/s13040-015-0043-7. PMC PMC4432889. PMID 25984237. 
  16. 16.0 16.1 Rosen-Zvi, M.; Griffiths, T., Steyvers, M.; Smyth, P. (2004). "The author-topic model for authors and documents". Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence 2004: 487–494. ISBN 0974903906. 
  17. 17.0 17.1 Tang, J.; Zhang, J.; Yao, L. et al. (2008). "ArnetMiner: Extraction and mining of academic social networks". Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2008: 990-998. doi:10.1145/1401890.1402008. 
  18. Ramos, J.M.; Padilla, S.; Masiá, M.; Gutiérrez, F. (2008). "A bibliometric analysis of tuberculosis research indexed in PubMed, 1997-2006". International Journal of Tuberculosis and Lung Disease 2008: 1461–8. PMID 19017458. 
  19. Claude, R.; Charles-Daniel, A.; Jean, A.; Jean-Francois, G. (2004). "Bibliometric overview of the utilization of artificial neural networks in medicine and biology". Scientometrics 59 (1): 117–130. doi:10.1023/B:SCIE.0000013302.59845.34. 
  20. Patra, S.K.; Mishra, S. (2006). "Bibliometric study of bioinformatics literature". Scientometrics 67 (3): 477–489. doi:10.1556/Scient.67.2006.3.9. 
  21. Song, M.; Heo, G.E.; Kim, S.Y. (2014). "Analyzing topic evolution in bioinformatics: Investigation of dynamics of the field with conference data in DBLP". Scientometrics 101 (1): 397–428. doi:10.1007/s11192-014-1246-2. 
  22. Lee, D.; Kim, W.C.; Charidimou, A.; Song, M. (2015). "A Bird's-Eye View of Alzheimer's Disease Research: Reflecting Different Perspectives of Indexers, Authors, or Citers in Mapping the Field". Journal of Alzheimer's Disease 45 (4): 1207-22. doi:10.3233/JAD-142688. PMID 25697702. 
  23. Ding, Y.; Song, M.; Han, J. et al. (2013). "Entitymetrics: Measuring the impact of entities". PLoS One 8 (8): e71416. doi:10.1371/journal.pone.0071416. PMC PMC3756961. PMID 24009660. 
  24. Hofmann, T. (1999). "Probabilistic latent semantic indexing". Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1999: 50–57. doi:10.1145/312624.312649. 
  25. Blei, D.M.; Ng, A.Y.; Jordan, M.I. (2003). "Latent dirichlet allocation". Journal of Machine Learning Research 3 (1): 993–1022. 
  26. Tang, J.; Zhang, J.; Jin, R. et al. (2011). "Topic level expertise search over heterogeneous networks". Machine Learning 82 (2): 211–237. doi:10.1007/s10994-010-5212-9. 
  27. Kim, H.J.; An, J.; Jeong, Y.K.; Song, M. (2016). "Exploring the Leading Authors and Journals in Major Topics by Citation Sentences and Topic Modeling". Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) co-located with the Joint Conference on Digital Libraries 2016 2016: 42–50. 


This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Grammar and word used were updated to make the text easier to read.