JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs)

Cisneros-González, Jorge; Gordo-Herrera, Natalia; Barcia-Santos, Iván; Sánchez-Soriano, Javier

JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs)

Jorge Cisneros-González, Natalia Gordo-Herrera, Iván Barcia-Santos and Javier Sánchez-Soriano ()
Additional contact information
Jorge Cisneros-González: Advanced Artificial Intelligence Group (A 2 IG), Escuela Politécnica Superior, Universidad Francisco de Vitoria, 28223 Pozuelo de Alarcón, Madrid, Spain
Natalia Gordo-Herrera: Advanced Artificial Intelligence Group (A 2 IG), Escuela Politécnica Superior, Universidad Francisco de Vitoria, 28223 Pozuelo de Alarcón, Madrid, Spain
Iván Barcia-Santos: Advanced Artificial Intelligence Group (A 2 IG), Escuela Politécnica Superior, Universidad Francisco de Vitoria, 28223 Pozuelo de Alarcón, Madrid, Spain
Javier Sánchez-Soriano: Advanced Artificial Intelligence Group (A 2 IG), Escuela Politécnica Superior, Universidad Francisco de Vitoria, 28223 Pozuelo de Alarcón, Madrid, Spain

Future Internet, 2025, vol. 17, issue 6, 1-21

Abstract: This paper explores the application of large language models (LLMs) to automate the evaluation of programming assignments in an undergraduate “Introduction to Programming” course. This study addresses the challenges of manual grading, including time constraints and potential inconsistencies, by proposing a system that integrates several LLMs to streamline the assessment process. The system utilizes a graphic interface to process student submissions, allowing instructors to select an LLM and customize the grading rubric. A comparative analysis, using LLMs from OpenAI, Google, DeepSeek and ALIBABA to evaluate student code submissions, revealed a strong correlation between LLM-generated grades and those assigned by human instructors. Specifically, the reduced model using statistically significant variables demonstrates a high explanatory power, with an adjusted R 2 of 0.9156 and a Mean Absolute Error of 0.4579, indicating that LLMs can effectively replicate human grading. The findings suggest that LLMs can automate grading when paired with human oversight, drastically reducing the instructor workload, transforming a task estimated to take more than 300 h of manual work into less than 15 min of automated processing and improving the efficiency and consistency of assessment in computer science education.

Keywords: academic assessment; automated assessment; generative artificial intelligence; large language models; automated code assessment; AI-helped feedback; code grading (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/17/6/265/pdf (application/pdf)
https://www.mdpi.com/1999-5903/17/6/265/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:17:y:2025:i:6:p:265-:d:1681439

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().