A Bayesian Semisupervised Approach to Keyword Extraction with Only Positive and Unlabeled Data

Wang, Guanshen; Cheng, Yichen; Xia, Yusen; Ling, Qiang; Wang, Xinlei

A Bayesian Semisupervised Approach to Keyword Extraction with Only Positive and Unlabeled Data

Guanshen Wang (), Yichen Cheng (), Yusen Xia (), Qiang Ling () and Xinlei Wang ()
Additional contact information
Guanshen Wang: Department of Statistical Science, Southern Methodist University, Dallas, Texas 75205
Yichen Cheng: Institute for Insight, Georgia State University, Atlanta, Georgia 30303
Yusen Xia: Institute for Insight, Georgia State University, Atlanta, Georgia 30303
Qiang Ling: Department of Automation, University of Science and Technology of China, Hefei, Anhui 230026, China
Xinlei Wang: Department of Statistical Science, Southern Methodist University, Dallas, Texas 75205; Department of Mathematics, University of Texas at Arlington, Arlington, Texas 76019; Center for Data Science Research and Education, College of Science, University of Texas at Arlington, Arlington, Texas 76019

INFORMS Journal on Computing, 2023, vol. 35, issue 3, 675-691

Abstract: In the era of big data, people benefit from the existence of tremendous amounts of information. However, availability of said information may pose great challenges. For instance, one big challenge is how to extract useful yet succinct information in an automated fashion. As one of the first few efforts, keyword extraction methods summarize an article by identifying a list of keywords. Many existing keyword extraction methods focus on the unsupervised setting, with all keywords assumed unknown. In reality, a (small) subset of the keywords may be available for a particular article. To use such information, we propose a rigorous probabilistic model based on a semisupervised setup. Our method incorporates the graph-based information of an article into a Bayesian framework via an informative prior so that our model facilitates formal statistical inference, which is often absent from existing methods. To overcome the difficulty arising from high-dimensional posterior sampling, we develop two Markov chain Monte Carlo algorithms based on Gibbs samplers and compare their performance using benchmark data. We use a false discovery rate (FDR)-based approach for selecting the number of keywords, whereas the existing methods use ad hoc threshold values. Our numerical results show that the proposed method compared favorably with state-of-the-art methods for keyword extraction.

Keywords: Gibbs sampler; graph-based prior; high-dimensional posterior sampling; semi-supervised learning; TextRank (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
http://dx.doi.org/10.1287/ijoc.2023.1283 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:orijoc:v:35:y:2023:i:3:p:675-691

Access Statistics for this article

More articles in INFORMS Journal on Computing from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().