TY - GEN
T1 - Performance Evaluation of Deep Learning Formats
T2 - 7th IEEE International Conference on BioInspired Processing, BIP 2025
AU - Mercado, Alfredo
AU - Villalobos, Johansell
AU - Meneses, Esteban
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The high computational cost of Large Language Models (LLMs) requires post-training optimization for efficient deployment. This paper presents a rigorous empirical comparison of inference performance between the native PyTorch framework and the optimized ONNX format for contemporary LLMs, including Gemma-7B, Phi-3-Medium, and Llama-3.1-8B, on NVIDIA GPU hardware. Using low-level profiling with NVIDIA's Nsight Compute suite, we found that ONNX conversion yields a substantial and statistically significant performance advantage. Our results show a latency reduction of up to 27.6% and an increase in computational throughput (GFLOP/s) of up to 48.3%. We conclude that adopting an ONNX-based strategy is not merely a portability choice but a critical optimization step for deploying high-performance, efficient LLMs in production environments.
AB - The high computational cost of Large Language Models (LLMs) requires post-training optimization for efficient deployment. This paper presents a rigorous empirical comparison of inference performance between the native PyTorch framework and the optimized ONNX format for contemporary LLMs, including Gemma-7B, Phi-3-Medium, and Llama-3.1-8B, on NVIDIA GPU hardware. Using low-level profiling with NVIDIA's Nsight Compute suite, we found that ONNX conversion yields a substantial and statistically significant performance advantage. Our results show a latency reduction of up to 27.6% and an increase in computational throughput (GFLOP/s) of up to 48.3%. We conclude that adopting an ONNX-based strategy is not merely a portability choice but a critical optimization step for deploying high-performance, efficient LLMs in production environments.
KW - Deep Learning
KW - Large Language Models
KW - Performance Profiling
KW - Weight Formats
UR - https://www.scopus.com/pages/publications/105038772548
U2 - 10.1109/BIP68491.2025.11489132
DO - 10.1109/BIP68491.2025.11489132
M3 - Contribución a la conferencia
AN - SCOPUS:105038772548
T3 - 2025 IEEE 7th International Conference on BioInspired Processing, BIP 2025
BT - 2025 IEEE 7th International Conference on BioInspired Processing, BIP 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 3 December 2025 through 5 December 2025
ER -