EconPapers    
Economics at your fingertips  
 

Statistical Depth for Text Data: An Application to the Classification of Healthcare Data

Sergio Bolívar, Alicia Nieto-Reyes () and Heather L. Rogers
Additional contact information
Sergio Bolívar: Department of Mathematics, Statistics and Computer Science, Universidad de Cantabria, 39005 Santander, Spain
Alicia Nieto-Reyes: Department of Mathematics, Statistics and Computer Science, Universidad de Cantabria, 39005 Santander, Spain
Heather L. Rogers: Biocruces Bizkaia Health Research Institute, 48903 Barakaldo, Spain

Mathematics, 2023, vol. 11, issue 1, 1-20

Abstract: This manuscript introduces a new concept of statistical depth function: the compositional D -depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the tf-idf (term frequency–inverse document frequency) statistic, which results in most vector entries taking a value of zero. The proposed data depth consists of considering the inverse discrete Fourier transform of the vectorized text fragments and then applying a statistical depth for functional data, D . This depth is intended to address the problem of sparsity of numerical features resulting from the transformation of qualitative text data into quantitative data, which is a common procedure in most natural language processing frameworks. Indeed, this sparsity hinders the use of traditional statistical depths and machine learning techniques for classification purposes. In order to demonstrate the potential value of this new proposal, it is applied to a real-world case study which involves mapping Consolidated Framework for Implementation and Research (CFIR) constructs to qualitative healthcare data. It is shown that the D D G -classifier yields competitive results and outperforms all studied traditional machine learning techniques (logistic regression with LASSO regularization, artificial neural networks, decision trees, and support vector machines) when used in combination with the newly defined compositional D -depth.

Keywords: compositional depth; multivariate data; natural language processing; qualitative data; statistical depth; supervised classification; text mining (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.mdpi.com/2227-7390/11/1/228/pdf (application/pdf)
https://www.mdpi.com/2227-7390/11/1/228/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:11:y:2023:i:1:p:228-:d:1022691

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-03-19
Handle: RePEc:gam:jmathe:v:11:y:2023:i:1:p:228-:d:1022691