Commit-Level Software Change Intent Classification Using a Pre-Trained Transformer-Based Code Model
Tjaša Heričko (),
Boštjan Šumak and
Sašo Karakatič
Additional contact information
Tjaša Heričko: Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška cesta 46, 2000 Maribor, Slovenia
Boštjan Šumak: Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška cesta 46, 2000 Maribor, Slovenia
Sašo Karakatič: Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška cesta 46, 2000 Maribor, Slovenia
Mathematics, 2024, vol. 12, issue 7, 1-38
Abstract:
Software evolution is driven by changes made during software development and maintenance. While source control systems effectively manage these changes at the commit level, the intent behind them are often inadequately documented, making understanding their rationale challenging. Existing commit intent classification approaches, largely reliant on commit messages, only partially capture the underlying intent, predominantly due to the messages’ inadequate content and neglect of the semantic nuances in code changes. This paper presents a novel method for extracting semantic features from commits based on modifications in the source code, where each commit is represented by one or more fine-grained conjoint code changes, e.g., file-level or hunk-level changes. To address the unstructured nature of code, the method leverages a pre-trained transformer-based code model, further trained through task-adaptive pre-training and fine-tuning on the downstream task of intent classification. This fine-tuned task-adapted pre-trained code model is then utilized to embed fine-grained conjoint changes in a commit, which are aggregated into a unified commit-level vector representation. The proposed method was evaluated using two BERT-based code models, i.e., CodeBERT and GraphCodeBERT, and various aggregation techniques on data from open-source Java software projects. The results show that the proposed method can be used to effectively extract commit embeddings as features for commit intent classification and outperform current state-of-the-art methods of code commit representation for intent categorization in terms of software maintenance activities undertaken by commits.
Keywords: software maintenance; code commit; mining software repositories; adaptive pre-training; fine-tuning; semantic code embedding; CodeBERT; GraphCodeBERT; classification; code intelligence (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/12/7/1012/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/7/1012/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:7:p:1012-:d:1365754
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().