A domain keyword analysis approach extending Term Frequency-Keyword Active Index with Google Word2Vec model

Hu, Kai; Wu, Huayi; Qi, Kunlun; Yu, Jingmin; Yang, Siluo; Yu, Tianxing; Zheng, Jie; Liu, Bo

A domain keyword analysis approach extending Term Frequency-Keyword Active Index with Google Word2Vec model

Kai Hu (), Huayi Wu (), Kunlun Qi (), Jingmin Yu, Siluo Yang, Tianxing Yu, Jie Zheng and Bo Liu ()
Additional contact information
Kai Hu: Wuhan University
Huayi Wu: Wuhan University
Kunlun Qi: China University of Geosciences (Wuhan)
Jingmin Yu: Changjiang Spatial Information Technology Engineering CO., LTD
Siluo Yang: Wuhan University
Tianxing Yu: Wuhan University
Jie Zheng: Wuhan University
Bo Liu: East China Institute of Technology

Scientometrics, 2018, vol. 114, issue 3, No 14, 1068 pages

Abstract: Abstract In bibliometric research, keyword analysis of publications provides an effective way not only to investigate the knowledge structure of research domains, but also to explore the developing trends within domains. To identify the most representative keywords, many approaches have been proposed. Most of them focus on using statistical regularities, syntax, grammar, or network-based characteristics to select representative keywords for the domain analysis. In this paper, we argue that the domain knowledge is reflected by the semantic meanings behind keywords rather than the keywords themselves. We apply the Google Word2Vec model, a model of a word distribution using deep learning, to represent the semantic meanings of the keywords. Based on this work, we propose a new domain knowledge approach, the Semantic Frequency-Semantic Active Index, similar to Term Frequency-Inverse Document Frequency, to link domain and background information and identify infrequent but important keywords. We adopt a semantic similarity measuring process before statistical computation to compute the frequencies of “semantic units” rather than keyword frequencies. Semantic units are generated by word vector clustering, while the Inverse Document Frequency is extended to include the semantic inverse document frequency; thus only words in the inverse documents with a certain similarity will be counted. Taking geographical natural hazards as the domain and natural hazards as the background discipline, we identify the domain-specific knowledge that distinguishes geographical natural hazards from other types of natural hazards. We compare and discuss the advantages and disadvantages of the proposed method in relation to existing methods, finding that by introducing the semantic meaning of the keywords, our method supports more effective domain knowledge analysis.

Keywords: Keyword extraction; Word2Vec; Semantic clustering; Semantic similarity; Frequency; Domain knowledge (search for similar items in EconPapers)
Date: 2018
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (13)

Downloads: (external link)
http://link.springer.com/10.1007/s11192-017-2574-9 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:scient:v:114:y:2018:i:3:d:10.1007_s11192-017-2574-9

Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/11192

DOI: 10.1007/s11192-017-2574-9

Access Statistics for this article

Scientometrics is currently edited by Wolfgang Glänzel

More articles in Scientometrics from Springer, Akadémiai Kiadó
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().