SCC-GPT: Source Code Classification Based on Generative Pre-Trained Transformers

Alahmadi, Mohammad D.; Alshangiti, Moayad; Alsubhi, Jumana

SCC-GPT: Source Code Classification Based on Generative Pre-Trained Transformers

Mohammad D. Alahmadi (), Moayad Alshangiti and Jumana Alsubhi
Additional contact information
Mohammad D. Alahmadi: Department of Software Engineering, College of Computer Science and Engineering, University of Jeddah, Jeddah 23890, Saudi Arabia
Moayad Alshangiti: Department of Software Engineering, College of Computer Science and Engineering, University of Jeddah, Jeddah 23890, Saudi Arabia
Jumana Alsubhi: School of Computing, University of Georgia, Athens, GA 30602, USA

Mathematics, 2024, vol. 12, issue 13, 1-12

Abstract: Developers often rely on online resources, such as Stack Overflow (SO), to seek assistance for programming tasks. To facilitate effective search and resource discovery, manual tagging of questions and posts with the appropriate programming language is essential. However, accurate tagging is not consistently achieved, leading to the need for the automated classification of code snippets into the correct programming language as a tag. In this study, we introduce a novel approach to automated classification of code snippets from Stack Overflow (SO) posts into programming languages using generative pre-trained transformers (GPT). Our method, which does not require additional training on labeled data or dependency on pre-existing labels, classifies 224,107 code snippets into 19 programming languages. We employ the text-davinci-003 model of ChatGPT-3.5 and postprocess its responses to accurately identify the programming language. Our empirical evaluation demonstrates that our GPT-based model (SCC-GPT) significantly outperforms existing methods, achieving a median F1-score improvement that ranges from +6% to +31%. These findings underscore the effectiveness of SCC-GPT in enhancing code snippet classification, offering a cost-effective and efficient solution for developers who rely on SO for programming assistance.

Keywords: SCC (Source Code Classification); NLP (Natural Language Processing); Large Lagnuage Model (LLM) (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/12/13/2128/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/13/2128/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:13:p:2128-:d:1430346

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().