TY - GEN
T1 - Uncertainty Quantification for Speech-To-Text in Spanish
AU - Rodriguez-Rivas, Daniel
AU - Calderon-Ramirez, Saul
AU - Solis, Martin
AU - Morales-Munoz, Walter
AU - Perez-Hidalgo, J. Esteban
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Speech-to-text is a task which has been recently boosted its performance in practical applications thanks to the advent of deep neural network architectures. Moreover, recent speech-to-text models such as Whisper, have benefited from extensive pre-training with very large datasets. However, its usage in different target scenarios (in less-represented languages, speech recorded in noisy environments, etc.) might yield unsatisfactory results. Uncertainty quantification in the previously described context is important for the safe usage of speech-totext models. This study evaluates three methods for uncertainty quantification: Monte-Carlo dropout, temperature scaling, and feature density estimation. The analysis is conducted using Spanish audio datasets to assess these methods in the context of a less-represented language. A novel metric, analogous to the expected calibration error, is introduced to measure the correlation between predicted uncertainty and word error rate. We provide a detailed description of the dataset construction and experimental parameters. The findings indicate that Whisper demonstrates strong performance with Monte-Carlo dropout and temperature scaling, while the feature density estimation method shows comparatively lower efficacy. Finally, we propose enhancements to the evaluation procedures to further reduce prediction uncertainty.
AB - Speech-to-text is a task which has been recently boosted its performance in practical applications thanks to the advent of deep neural network architectures. Moreover, recent speech-to-text models such as Whisper, have benefited from extensive pre-training with very large datasets. However, its usage in different target scenarios (in less-represented languages, speech recorded in noisy environments, etc.) might yield unsatisfactory results. Uncertainty quantification in the previously described context is important for the safe usage of speech-totext models. This study evaluates three methods for uncertainty quantification: Monte-Carlo dropout, temperature scaling, and feature density estimation. The analysis is conducted using Spanish audio datasets to assess these methods in the context of a less-represented language. A novel metric, analogous to the expected calibration error, is introduced to measure the correlation between predicted uncertainty and word error rate. We provide a detailed description of the dataset construction and experimental parameters. The findings indicate that Whisper demonstrates strong performance with Monte-Carlo dropout and temperature scaling, while the feature density estimation method shows comparatively lower efficacy. Finally, we propose enhancements to the evaluation procedures to further reduce prediction uncertainty.
KW - BERT
KW - Deep Learning
KW - Safe Artificial Intelligence
KW - Text complex prediction
KW - Transformers
KW - Uncertainty Estimation
UR - http://www.scopus.com/inward/record.url?scp=86000002304&partnerID=8YFLogxK
U2 - 10.1109/BIP63158.2024.10885385
DO - 10.1109/BIP63158.2024.10885385
M3 - Contribución a la conferencia
AN - SCOPUS:86000002304
T3 - 6th IEEE International Conference on BioInspired Processing, BIP 2024
BT - 6th IEEE International Conference on BioInspired Processing, BIP 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th IEEE International Conference on BioInspired Processing, BIP 2024
Y2 - 4 December 2024 through 6 December 2024
ER -