A domain keyword analysis approach extending Term Frequency-Keyword Active Index with Google Word2Vec model
Kai Hu (),
Huayi Wu (),
Kunlun Qi (),
Jie Zheng and
Bo Liu ()
Additional contact information
Kai Hu: Wuhan University
Huayi Wu: Wuhan University
Kunlun Qi: China University of Geosciences (Wuhan)
Jingmin Yu: Changjiang Spatial Information Technology Engineering CO., LTD
Siluo Yang: Wuhan University
Tianxing Yu: Wuhan University
Jie Zheng: Wuhan University
Bo Liu: East China Institute of Technology
Scientometrics, 2018, vol. 114, issue 3, 1031-1068
Abstract In bibliometric research, keyword analysis of publications provides an effective way not only to investigate the knowledge structure of research domains, but also to explore the developing trends within domains. To identify the most representative keywords, many approaches have been proposed. Most of them focus on using statistical regularities, syntax, grammar, or network-based characteristics to select representative keywords for the domain analysis. In this paper, we argue that the domain knowledge is reflected by the semantic meanings behind keywords rather than the keywords themselves. We apply the Google Word2Vec model, a model of a word distribution using deep learning, to represent the semantic meanings of the keywords. Based on this work, we propose a new domain knowledge approach, the Semantic Frequency-Semantic Active Index, similar to Term Frequency-Inverse Document Frequency, to link domain and background information and identify infrequent but important keywords. We adopt a semantic similarity measuring process before statistical computation to compute the frequencies of “semantic units” rather than keyword frequencies. Semantic units are generated by word vector clustering, while the Inverse Document Frequency is extended to include the semantic inverse document frequency; thus only words in the inverse documents with a certain similarity will be counted. Taking geographical natural hazards as the domain and natural hazards as the background discipline, we identify the domain-specific knowledge that distinguishes geographical natural hazards from other types of natural hazards. We compare and discuss the advantages and disadvantages of the proposed method in relation to existing methods, finding that by introducing the semantic meaning of the keywords, our method supports more effective domain knowledge analysis.
Keywords: Keyword extraction; Word2Vec; Semantic clustering; Semantic similarity; Frequency; Domain knowledge (search for similar items in EconPapers)
References: View references in EconPapers View complete reference list from CitEc
Citations: Track citations by RSS feed
Downloads: (external link)
http://link.springer.com/10.1007/s11192-017-2574-9 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
Persistent link: https://EconPapers.repec.org/RePEc:spr:scient:v:114:y:2018:i:3:d:10.1007_s11192-017-2574-9
Ordering information: This journal article can be ordered from
Access Statistics for this article
Scientometrics is currently edited by Wolfgang Glänzel
More articles in Scientometrics from Springer, Akadémiai Kiadó
Bibliographic data for series maintained by Sonal Shukla ().