TY - GEN
T1 - Una evaluacion comparativa de ChatGPT, DeepSeek y Gemini en la generacion automatica de pruebas unitarias
T2 - 8th International Congress on Environmental Intelligence, Software Engineering, and Electronic and Mobile Health, AmITIC 2025
AU - Trevino-Villalobos, Marlen
AU - Quesada-Lopez, Christian
AU - Jimenez-Delgado, Efren
AU - Quiros-Oviedo, Rocio
AU - Diaz-Oreiro, Ignacio
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The advancement of large-scale language models (LLMs) has opened up new possibilities for automating unit test generation, a traditionally manual and expensive task. This quantitative study evaluates the performance of three LLMs-ChatGPT 4o mini, DeepSeek v3, and Gemini 2.5 Flash Pro-in generating test cases for methods in C# developed in Unity. The execution success rate of the generated tests was measured using real and synthetic data. The synthetic data was intentionally created to represent common structures, while the real data came from existing project functions. The experimental design was controlled and included the factors LLM and data type and the blocks cyclomatic complexity and contextual memory with four replicates per combination, for a total of 96 experimental treatments. The results show that LLMs have a high potential to support the automatic generation of unit tests. Furthermore, it was evidenced that the choice of model has a significant effect on the success rate of the generated tests.
AB - The advancement of large-scale language models (LLMs) has opened up new possibilities for automating unit test generation, a traditionally manual and expensive task. This quantitative study evaluates the performance of three LLMs-ChatGPT 4o mini, DeepSeek v3, and Gemini 2.5 Flash Pro-in generating test cases for methods in C# developed in Unity. The execution success rate of the generated tests was measured using real and synthetic data. The synthetic data was intentionally created to represent common structures, while the real data came from existing project functions. The experimental design was controlled and included the factors LLM and data type and the blocks cyclomatic complexity and contextual memory with four replicates per combination, for a total of 96 experimental treatments. The results show that LLMs have a high potential to support the automatic generation of unit tests. Furthermore, it was evidenced that the choice of model has a significant effect on the success rate of the generated tests.
KW - automatic testing
KW - LLM
KW - prompt
KW - unit testing
KW - Unity
UR - https://www.scopus.com/pages/publications/105025029539
U2 - 10.1109/AmITIC68284.2025.11214621
DO - 10.1109/AmITIC68284.2025.11214621
M3 - Contribución a la conferencia
AN - SCOPUS:105025029539
T3 - 8th Congreso Internacional en Inteligencia Ambiental, Ingenieria de Software y Salud Electronica y Movil, AmITIC 2025
BT - 8th Congreso Internacional en Inteligencia Ambiental, Ingenieria de Software y Salud Electronica y Movil, AmITIC 2025
A2 - Villarreal, Vladimir
A2 - Munoz, Lilia
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 24 September 2025 through 26 September 2025
ER -