Word Embeddings as Statistical Estimators
Neil Dey (),
Matthew Singer (),
Jonathan P. Williams () and
Srijan Sengupta ()
Additional contact information
Neil Dey: North Carolina State University
Matthew Singer: North Carolina State University
Jonathan P. Williams: North Carolina State University
Srijan Sengupta: North Carolina State University
Sankhya B: The Indian Journal of Statistics, 2024, vol. 86, issue 2, No 4, 415-441
Abstract:
Abstract Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). We further illustrate the utility of this statistical model by using it to develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy and Goldberg (Adv. Neural Inf. Process. Syst., 27, 2177–2185 2014). The resulting estimator also is comparable to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set and a part-of-speech tagging task on the OntoNotes data set.
Keywords: Copula; Word2Vec; distributed representation; statistical linguistics; language modeling; missing values SVD; Primary 68T50; Secondary 62M99 (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s13571-024-00331-1 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:sankhb:v:86:y:2024:i:2:d:10.1007_s13571-024-00331-1
Ordering information: This journal article can be ordered from
http://www.springer.com/statistics/journal/13571
DOI: 10.1007/s13571-024-00331-1
Access Statistics for this article
Sankhya B: The Indian Journal of Statistics is currently edited by Dipak Dey
More articles in Sankhya B: The Indian Journal of Statistics from Springer, Indian Statistical Institute
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().