Investigating the impact of preprocessing on document embedding: an empirical comparison
Nourelhouda Yahi,
Hacene Belhadef and
Mathieu Roche
International Journal of Data Mining, Modelling and Management, 2021, vol. 13, issue 4, 351-363
Abstract:
Digital representation of text documents is a crucial task in machine learning and natural language processing (NLP). It aims to transform unstructured text documents into mathematically-computable elements. In recent years, several methods have been proposed and implemented to encode text documents into fixed-length feature vectors. This operation is known as document embedding and it has become an interesting and open area of research. Paragraph vector (Doc2vec) is one of the most used document embedding methods. It has gained a good reputation thanks to its good results. To overcome its limits, Doc2vec, was extended by proposing the document through corruption (Doc2vecC) technique. To get a deep view of these two methods, this work presents a study on the impact of morphosyntactic text preprocessing on these two document embedding methods. We have done this analysis by applying the most-used text preprocessing techniques, such as cleaning, stemming and lemmatisation, and their different combinations. The experimental analysis on the Microsoft Research Paraphrase dataset (MSRP), reveals that the preprocessing techniques serve to improve the classifier accuracy; and that the stemming method outperforms the other techniques.
Keywords: natural language preprocessing; document embedding; paragraph vector; document through corruption; text preprocessing; semantic similarity. (search for similar items in EconPapers)
Date: 2021
References: Add references at CitEc
Citations:
Downloads: (external link)
http://www.inderscience.com/link.php?id=119631 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ids:ijdmmm:v:13:y:2021:i:4:p:351-363
Access Statistics for this article
More articles in International Journal of Data Mining, Modelling and Management from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().