Developing Resilient Speech Emotion Recognition Systems through Deep Learning and Audio Augmentation for Enhanced Emotion Detection

Irfan Chaugule

Developing Resilient Speech Emotion Recognition Systems through Deep Learning and Audio Augmentation for Enhanced Emotion Detection

Authors

Irfan Chaugule, Satish R Sankaye

Keywords:

Speech Emotion Recognition (SER); Deep Learning; Convolutional Neural Networks (CNN); Recurrent Neural Networks (RNN); Long Short-Term Memory (LSTM); Audio Data Augmentation; Gaussian Noise; Pitch Shifting; Time Stretching; Time Shifting; Robustness to Noise; Human-Computer Interaction (HCI); Emotion-Aware Systems; Hybrid CNN-RNN Model

Abstract

Speech Emotion Recognition (SER) has emerged as a critical area in human-computer interaction, aiming to enable systems to recognize and respond to human emotions expressed through speech. This research focuses on utilizing deep learning techniques to advance the performance of SER systems, particularly in noisy and variable conditions. We present a comprehensive approach, starting with the preparation of audio datasets, followed by the application of various augmentation techniques such as Gaussian noise, pitch shifting, time stretching, and time shifting, aimed at simulating real-world distortions. These augmentations, implemented using the audiomentations library, enhance the robustness of machine learning models by diversifying the training data.

We further explore the efficacy of deep learning models, particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), in recognizing emotional states across different speech patterns. Initial results demonstrate significant improvements in model generalization, particularly in handling diverse audio conditions. This study contributes to the growing body of work on SER by improving model robustness through data augmentation, with promising results that lay the groundwork for more adaptive and emotion-aware systems.

DOI: https://doi.org/10.17762/ijisae.v12i23s.7219

Downloads

Download data is not yet available.

References

Haq, N., et al. (2020). Temporal Dependencies in Speech Emotion Recognition Using LSTM. IEEE Transactions on Neural Networks.

Yang, L., & Li, M. (2019). Impact of Data Augmentation on Robust SER. Proceedings of the International Conference on Audio Signal Processing.

Zhao, X., et al. (2021). CNN Architectures for Emotion Detection in Speech. Journal of Audio Engineering.

El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Emotions, features, methods, and databases. IEEE Transactions on Audio, Speech, and Language Processing, 19(3), 572–587.

Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM networks. In Proceedings of the International Joint Conference on Neural Networks, 2005. IJCNN'05. (Vol. 4, pp. 2047–2052). IEEE.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation techniques. Journal of Big Data, 6(1), 1–48.

Chollet, F. (2017). Deep learning with Python. Manning Publications.

Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.

Downloads

Published

25.12.2024

How to Cite

Irfan Chaugule. (2024). Developing Resilient Speech Emotion Recognition Systems through Deep Learning and Audio Augmentation for Enhanced Emotion Detection. International Journal of Intelligent Systems and Applications in Engineering, 12(23s), 1999–2003. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7219

Download Citation

Issue

Vol. 12 No. 23s (2024)

Section

Research Article

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.

IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.

Developing Resilient Speech Emotion Recognition Systems through Deep Learning and Audio Augmentation for Enhanced Emotion Detection

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

ijisae

Information

Indexed By

Developing Resilient Speech Emotion Recognition Systems through Deep Learning and Audio Augmentation for Enhanced Emotion Detection

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By