GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

Fachada, Nuno; Fernandes, Daniel; Fernandes, Carlos M.; Ferreira-Saraiva, Bruno D.; Matos-Carvalho, João P.

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

Nuno Fachada (), Daniel Fernandes, Carlos M. Fernandes, Bruno D. Ferreira-Saraiva and João P. Matos-Carvalho
Additional contact information
Nuno Fachada: Copelabs, Lusófona University, Campo Grande, 376, 1749-024 Lisboa, Portugal
Daniel Fernandes: Copelabs, Lusófona University, Campo Grande, 376, 1749-024 Lisboa, Portugal
Carlos M. Fernandes: Copelabs, Lusófona University, Campo Grande, 376, 1749-024 Lisboa, Portugal
Bruno D. Ferreira-Saraiva: Copelabs, Lusófona University, Campo Grande, 376, 1749-024 Lisboa, Portugal
João P. Matos-Carvalho: Center of Technology and Systems (UNINOVA-CTS) and Associated Lab of Intelligent Systems (LASI), 2829-516 Caparica, Portugal

Future Internet, 2025, vol. 17, issue 9, 1-28

Abstract: Large language models (LLMs) have advanced rapidly as tools for automating code generation in scientific research, yet their ability to interpret and use unfamiliar Python APIs for complex computational experiments remains poorly characterized. This study systematically benchmarks a selection of state-of-the-art LLMs in generating functional Python code for two increasingly challenging scenarios: conversational data analysis with the ParShift library, and synthetic data generation and clustering using pyclugen and scikit-learn . Both experiments use structured, zero-shot prompts specifying detailed requirements but omitting in-context examples. Model outputs are evaluated quantitatively for functional correctness and prompt compliance over multiple runs, and qualitatively by analyzing the errors produced when code execution fails. Results show that only a small subset of models consistently generate correct, executable code. GPT-4.1 achieved a 100% success rate across all runs in both experimental tasks, whereas most other models succeeded in fewer than half of the runs, with only Grok-3 and Mistral-Large approaching comparable performance. In addition to benchmarking LLM performance, this approach helps identify shortcomings in third-party libraries, such as unclear documentation or obscure implementation bugs. Overall, these findings highlight current limitations of LLMs for end-to-end scientific automation and emphasize the need for careful prompt design, comprehensive library documentation, and continued advances in language model capabilities.

Keywords: large language models; code generation; Python libraries (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/17/9/412/pdf (application/pdf)
https://www.mdpi.com/1999-5903/17/9/412/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:17:y:2025:i:9:p:412-:d:1745033

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().