TY - GEN
T1 - Reference-Based Metric Analysis for Evaluating Spanish Text Simplification
AU - Pérez-Rojas, Nelson
AU - Moncada, Santiago Castrillo
AU - Arias, Ana Laura Mora
AU - Calderón-Ramírez, Saúl
AU - Solís, Martín
AU - Castro, Monserrat Ramírez
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Reliable evaluation remains a bottleneck for Spanish text simplification. We examine how five reference-based automatic metrics Bilingual Evaluation Understudy (BLEU), RecallOriented Understudy for Gisting Evaluation (ROUGE), Metric for Evaluation of Translation with Explicit Ordering (METEOR), System output Against References and the Input sentence (SARI), and Bidirectional Encoder Representations from Transformers Score (BERTScore)) respond to diverse, linguistically valid edits in human simplifications. Using the FEINA-test corpus (financial education texts simplified for visually impaired readers, with four human simplifications per segment and attribute annotations), we conduct a two-stage analysis. First, we compute each metric per segment: for BLEU, ROUGE, METEOR, and BERTScore, each simplification is compared to the complex source and the scores are averaged across annotators; for SARI, we rotate the four simplifications as hypotheses and use the remaining ones as references. Second, we introduce the Attribute Diversity Index (ADI), defined as the number of distinct linguistic attributes modified in the references for each segment, and assess metric sensitivity via Pearson correlation with ADI. All metrics show a negative association with edit diversity; BERTScore and ROUGE are the most sensitive, while SARI is comparatively more tolerant. METEOR and BERTScore yield higher mean scores overall, yet they also decline as diversity increases. These findings provide empirical evidence that commonly used reference-based metrics can penalize valid transformations in Spanish, particularly when multiple edit types are present, and suggest the potential for combining overlap-based metrics with SARI in accessibilityfocused evaluations.
AB - Reliable evaluation remains a bottleneck for Spanish text simplification. We examine how five reference-based automatic metrics Bilingual Evaluation Understudy (BLEU), RecallOriented Understudy for Gisting Evaluation (ROUGE), Metric for Evaluation of Translation with Explicit Ordering (METEOR), System output Against References and the Input sentence (SARI), and Bidirectional Encoder Representations from Transformers Score (BERTScore)) respond to diverse, linguistically valid edits in human simplifications. Using the FEINA-test corpus (financial education texts simplified for visually impaired readers, with four human simplifications per segment and attribute annotations), we conduct a two-stage analysis. First, we compute each metric per segment: for BLEU, ROUGE, METEOR, and BERTScore, each simplification is compared to the complex source and the scores are averaged across annotators; for SARI, we rotate the four simplifications as hypotheses and use the remaining ones as references. Second, we introduce the Attribute Diversity Index (ADI), defined as the number of distinct linguistic attributes modified in the references for each segment, and assess metric sensitivity via Pearson correlation with ADI. All metrics show a negative association with edit diversity; BERTScore and ROUGE are the most sensitive, while SARI is comparatively more tolerant. METEOR and BERTScore yield higher mean scores overall, yet they also decline as diversity increases. These findings provide empirical evidence that commonly used reference-based metrics can penalize valid transformations in Spanish, particularly when multiple edit types are present, and suggest the potential for combining overlap-based metrics with SARI in accessibilityfocused evaluations.
KW - automatic evaluation
KW - BERTScore
KW - BLEU
KW - METEOR
KW - reference-based metrics
KW - ROUGE
KW - SARI
KW - text simplification
UR - https://www.scopus.com/pages/publications/105038743970
U2 - 10.1109/BIP68491.2025.11489141
DO - 10.1109/BIP68491.2025.11489141
M3 - Contribución a la conferencia
AN - SCOPUS:105038743970
T3 - 2025 IEEE 7th International Conference on BioInspired Processing, BIP 2025
BT - 2025 IEEE 7th International Conference on BioInspired Processing, BIP 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 7th IEEE International Conference on BioInspired Processing, BIP 2025
Y2 - 3 December 2025 through 5 December 2025
ER -