Document analysis via combined vectorization and machine learning approaches
Dinara Kaibassova (),
Bigul Mukhametzhanova (),
Dinara Tokseit (),
Aigul Kubegenova () and
Murad Kozhanov ()
International Journal of Innovative Research and Scientific Studies, 2025, vol. 8, issue 4, 2195-2204
Abstract:
The purpose of this study is to develop an effective hybrid model for automatic document classification by combining statistical and semantic text vectorization techniques with machine learning algorithms. The methodology integrates Term Frequency–Inverse Document Frequency (TF-IDF) and Word2Vec embeddings with classifiers such as Support Vector Machine (SVM) and Random Forest. The proposed approach includes data preprocessing (tokenization, normalization, stop word removal, and lemmatization), feature extraction, model training, and evaluation using classification metrics such as accuracy, F1-score, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa. Experimental results demonstrate that the Word2Vec + SVM model outperforms other configurations, achieving 90.2% accuracy and an F1-score of 82.52%, thus highlighting the advantage of incorporating semantic context into vector representation. The study concludes that hybrid methods combining TF-IDF and Word2Vec with robust classifiers improve both the precision and generalizability of document analysis models. Practical implications include potential applications in sentiment analysis, topic modeling, text classification for legal and healthcare domains, and multilingual contexts. This research provides a foundation for developing high-performance text analysis systems applicable to various real-world natural language processing tasks.
Keywords: Automatic document analysis; Contextual word embedding; Machine learning; Natural language processing; Semantic matching. (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://ijirss.com/index.php/ijirss/article/view/8356/1874 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:aac:ijirss:v:8:y:2025:i:4:p:2195-2204:id:8356
Access Statistics for this article
International Journal of Innovative Research and Scientific Studies is currently edited by Natalie Jean
More articles in International Journal of Innovative Research and Scientific Studies from Innovative Research Publishing
Bibliographic data for series maintained by Natalie Jean ().