Developing Resilient Speech Emotion Recognition Systems through Deep Learning and Audio Augmentation for Enhanced Emotion Detection

Authors

  • Irfan Chaugule, Satish R Sankaye

Keywords:

Speech Emotion Recognition (SER); Deep Learning; Convolutional Neural Networks (CNN); Recurrent Neural Networks (RNN); Long Short-Term Memory (LSTM); Audio Data Augmentation; Gaussian Noise; Pitch Shifting; Time Stretching; Time Shifting; Robustness to Noise; Human-Computer Interaction (HCI); Emotion-Aware Systems; Hybrid CNN-RNN Model

Abstract

Speech Emotion Recognition (SER) has emerged as a critical area in human-computer interaction, aiming to enable systems to recognize and respond to human emotions expressed through speech. This research focuses on utilizing deep learning techniques to advance the performance of SER systems, particularly in noisy and variable conditions. We present a comprehensive approach, starting with the preparation of audio datasets, followed by the application of various augmentation techniques such as Gaussian noise, pitch shifting, time stretching, and time shifting, aimed at simulating real-world distortions. These augmentations, implemented using the audiomentations library, enhance the robustness of machine learning models by diversifying the training data.

We further explore the efficacy of deep learning models, particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), in recognizing emotional states across different speech patterns. Initial results demonstrate significant improvements in model generalization, particularly in handling diverse audio conditions. This study contributes to the growing body of work on SER by improving model robustness through data augmentation, with promising results that lay the groundwork for more adaptive and emotion-aware systems.

DOI: https://doi.org/10.17762/ijisae.v12i23s.7219

Downloads

Download data is not yet available.

References

Haq, N., et al. (2020). Temporal Dependencies in Speech Emotion Recognition Using LSTM. IEEE Transactions on Neural Networks.

Yang, L., & Li, M. (2019). Impact of Data Augmentation on Robust SER. Proceedings of the International Conference on Audio Signal Processing.

Zhao, X., et al. (2021). CNN Architectures for Emotion Detection in Speech. Journal of Audio Engineering.

El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Emotions, features, methods, and databases. IEEE Transactions on Audio, Speech, and Language Processing, 19(3), 572–587.

Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM networks. In Proceedings of the International Joint Conference on Neural Networks, 2005. IJCNN'05. (Vol. 4, pp. 2047–2052). IEEE.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation techniques. Journal of Big Data, 6(1), 1–48.

Chollet, F. (2017). Deep learning with Python. Manning Publications.

Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.

Downloads

Published

25.12.2024

How to Cite

Irfan Chaugule. (2024). Developing Resilient Speech Emotion Recognition Systems through Deep Learning and Audio Augmentation for Enhanced Emotion Detection. International Journal of Intelligent Systems and Applications in Engineering, 12(23s), 1999–2003. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7219

Issue

Section

Research Article