Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering

Davagdorj, Khishigsuren; Wang, Ling; Li, Meijing; Van-Huy, Pham; Ryu, Keun Ho; Theera-Umpon, Nipon

Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering

Khishigsuren Davagdorj, Ling Wang, Meijing Li, Pham Van-Huy, Keun Ho Ryu and Nipon Theera-Umpon
Additional contact information
Khishigsuren Davagdorj: School of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Korea
Ling Wang: School of Computer Science, Northeast Electric Power University, Jilin 132013, China
Meijing Li: College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
Pham Van-Huy: Data Science Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, Vietnam
Keun Ho Ryu: Data Science Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, Vietnam
Nipon Theera-Umpon: Biomedical Engineering Institute, Chiang Mai University, Chiang Mai 50200, Thailand

IJERPH, 2022, vol. 19, issue 10, 1-21

Abstract: The increasing expansion of biomedical documents has increased the number of natural language textual resources related to the current applications. Meanwhile, there has been a great interest in extracting useful information from meaningful coherent groupings of textual content documents in the last decade. However, it is challenging to discover informative representations and define relevant articles from the rapidly growing biomedical literature due to the unsupervised nature of document clustering. Moreover, empirical investigations demonstrated that traditional text clustering methods produce unsatisfactory results in terms of non-contextualized vector space representations because that neglect the semantic relationship between biomedical texts. Recently, pre-trained language models have emerged as successful in a wide range of natural language processing applications. In this paper, we propose the Gaussian Mixture Model-based efficient clustering framework that incorporates substantially pre-trained (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) BioBERT domain-specific language representations to enhance the clustering accuracy. Our proposed framework consists of main three phases. First, classic text pre-processing techniques are used biomedical document data, which crawled from the PubMed repository. Second, representative vectors are extracted from a pre-trained BioBERT language model for biomedical text mining. Third, we employ the Gaussian Mixture Model as a clustering algorithm, which allows us to assign labels for each biomedical document. In order to prove the efficiency of our proposed model, we conducted a comprehensive experimental analysis utilizing several clustering algorithms while combining diverse embedding techniques. Consequently, the experimental results show that the proposed model outperforms the benchmark models by reaching performance measures of Fowlkes mallows score, silhouette coefficient, adjusted rand index, Davies-Bouldin score of 0.7817, 0.3765, 0.4478, 1.6849, respectively. We expect the outcomes of this study will assist domain specialists in comprehending thematically cohesive documents in the healthcare field.

Keywords: natural language processing; pre-trained language representation model; document clustering (search for similar items in EconPapers)
JEL-codes: I I1 I3 Q Q5 (search for similar items in EconPapers)
Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.mdpi.com/1660-4601/19/10/5893/pdf (application/pdf)
https://www.mdpi.com/1660-4601/19/10/5893/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jijerp:v:19:y:2022:i:10:p:5893-:d:814188

Access Statistics for this article

IJERPH is currently edited by Ms. Jenna Liu

More articles in IJERPH from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().