Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification

Skondras, Panagiotis; Zervas, Panagiotis; Tzimas, Giannis

Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification

Panagiotis Skondras, Panagiotis Zervas () and Giannis Tzimas
Additional contact information
Panagiotis Skondras: Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22100 Tripoli, Greece
Panagiotis Zervas: Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22100 Tripoli, Greece
Giannis Tzimas: Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22100 Tripoli, Greece

Future Internet, 2023, vol. 15, issue 11, 1-12

Abstract: In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.

Keywords: metadata extraction; resumes; CV; big data; multiclass classification; ChatGPT; large language models; deep learning; embeddings; labor market analysis (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/15/11/363/pdf (application/pdf)
https://www.mdpi.com/1999-5903/15/11/363/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:15:y:2023:i:11:p:363-:d:1277030

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().