Investigating the Robustness of End-to-End Speech Recognition Systems Against Adversarial Audio Perturbations in Noisy Environments

Ingrid Svensson

Authors

Ingrid Svensson Sweden Author

Keywords:

Automatic Speech Recognition (ASR), adversarial audio attacks, noise robustness, deep learning, end-to-end speech systems, environmental noise, Carlini & Wagner attack

Abstract

End-to-end automatic speech recognition (ASR) systems have significantly advanced through the adoption of deep learning architectures such as Connectionist Temporal Classification (CTC), sequence-to-sequence models, and Transformer-based approaches. However, these systems remain susceptible to adversarial audio perturbations—imperceptible modifications that can mislead recognition models. This study investigates the robustness of state-of-the-art ASR systems under adversarial conditions, particularly when exposed to additive environmental noise. Using attacks such as Carlini & Wagner (C&W) and Fast Gradient Sign Method (FGSM), we evaluate performance degradation across various noise levels. Our findings reveal that environmental noise both exacerbates and, paradoxically, sometimes mitigates the impact of adversarial perturbations. The results underscore the need for more resilient ASR architectures and training methodologies in adversarial and noisy settings.

References

Carlini, Nicholas, and David Wagner. "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text." 2018 IEEE Security and Privacy Workshops (SPW), IEEE, 2018, pp. 1–7.

Yakura, Hiromu, and Jun Sakuma. "Robust Audio Adversarial Example for a Physical Attack." Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 5334–5341.

Rajaratnam, Karthik, and Jugal Kalita. "Noise Flooding for Detecting Audio Adversarial Examples Against Automatic Speech Recognition." arXiv preprint arXiv:1811.03609, 2018.

Schönherr, Lea, et al. "Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems." 2020 IEEE Symposium on Security and Privacy (SP), IEEE, 2020, pp. 804–819.

Hannun, Awni, et al. "Deep Speech: Scaling Up End-to-End Speech Recognition." arXiv preprint arXiv:1412.5567, 2014.

Baevski, Alexei, et al. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 12449–12460.

Kurakin, Alexey, Ian Goodfellow, and Samy Bengio. "Adversarial Examples in the Physical World." arXiv preprint arXiv:1607.02533, 2016.

Cisse, Moustapha, et al. "Houdini: Fooling Deep Structured Prediction Models." arXiv preprint arXiv:1707.05373, 2017.

Zhang, Chiyuan, et al. "Understanding Deep Learning Requires Rethinking Generalization." International Conference on Learning Representations (ICLR), 2017.

Wang, Chengyue, et al. "Adversarial Examples for Automatic Speech Recognition: Attacks and Countermeasures." IEEE Communications Magazine, vol. 57, no. 10, 2019, pp. 105–111.

Ko, Tom, et al. "Audio Augmentation for Speech Recognition." Proceedings of Interspeech, 2015, pp. 3586–3589.

Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and Harnessing Adversarial Examples." International Conference on Learning Representations (ICLR), 2015.

Madry, Aleksander, et al. "Towards Deep Learning Models Resistant to Adversarial Attacks." International Conference on Learning Representations (ICLR), 2018.

Das, Dipjyoti Paul, et al. "Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey." ACM Transactions on Internet Technology (TOIT), vol. 21, no. 4, 2021, pp. 1–27.

Yuan, Xiaoyong, et al. "Adversarial Examples: Attacks and Defenses for Deep Learning." IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 9, 2019, pp. 2805–2824

Investigating the Robustness of End-to-End Speech Recognition Systems Against Adversarial Audio Perturbations in Noisy Environments

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Logo