Automatic Speech Emotion Recognition Using Hybrid Deep Learning Techniques


  • Bilal Hikmat Rasheed Department of Computer Science, Cihan University-Duhok, Iraq.
  • D. Yuvaraj Department of Computer Science, Cihan University-Duhok, Iraq.
  • Saif Saad Alnuaimi Department of Computer Science, Cihan University-Duhok, Iraq. Email:
  • S. Shanmuga Priya Department of Computer Science Engineering, SRM Institute of Science and Technology, Trichy, India


Automatic Speech Emotion Recognition, Deep Learning, Human-Computer Interaction, Convolutional Neural Network, Long Short Term Memory


An emerging field of research is the advancement of deep learning techniques for speech emotion recognition. The current scenario of human-computer interaction is being significantly impacted by and altered by speech recognition technologies. In human-computer interaction, developing an interface that can sense and react accurately like a human is one of the main crucial challenges. As a result, the Automatic Speech Emotion Recognition (ASER) system has been developed. It extracts and identifies important data from voice signals to classify various emotional categories. The novel advancements in deep learning have also led to a major improvement in the ASER system's performance. Numerous methods, including some well-known speech analysis and classification approaches, have been used to derive emotions from signals in the literature on ASER. Recently, deep learning methods have been suggested as an alternative to conventional methods in ASER. The main goal of this research is to use deep learning techniques to analyze different emotions from speech. Because deep learning networks have sophisticated feature extraction processes, they are frequently utilized for emotional classification, in advance of traditional/machine learning systems that depend on manual feature extraction before classifying the emotional state. To extract features and identify different emotions depending on input data, the authors have implemented the most efficient hybrid deep learning algorithms, CNN+LSTM. By training and testing the suggested network algorithm with the standard dataset, the authors, accordingly, achieved the highest accuracy.


Download data is not yet available.


Li, S., Li, J., Liu, Q., & Gong, Z. (2022, June). Adversarial speech generation and natural speech recovery for speech content protection. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 7291-7297).

Langari, S., Marvi, H., & Zahedi, M. (2020). Efficient speech emotion recognition using modified feature extraction. Informatics in Medicine Unlocked, 20, 100424.

Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.

Mittal, S., Agarwal, S., & Nigam, M. J. (2018, November). Real-time multiple face recognition: A deep learning approach. In Proceedings of the 2018 International Conference on Digital Medicine and Image Processing (pp. 70-76).

Huang, K. Y., Wu, C. H., Hong, Q. B., Su, M. H., & Chen, Y. H. (2019, May). Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5866-5870). IEEE.

Mokgonyane, T. B., Sefara, T. J., Modipa, T. I., Mogale, M. M., Manamela, M. J., & Manamela, P. J. (2019, January). Automatic speaker recognition system based on machine learning algorithms. In 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA) (pp. 141-146). IEEE.

Sokolov, A., & Savchenko, A. V. (2019, January). Voice command recognition in intelligent systems using deep neural networks. In 2019 IEEE 17th world symposium on applied machine intelligence and informatics (SAMI) (pp. 113-116). IEEE.

Kwon, M., & Choi, H. J. (2019, February). Automatic speech recognition dataset augmentation with pre-trained model and script. In 2019 IEEE International Conference on Big Data and Smart Computing (BigComp) (pp. 1-3). IEEE.

Singh, A. P., Nath, R., & Kumar, S. (2018, November). A survey: Speech recognition approaches and techniques. In 2018 5th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON) (pp. 1-4). IEEE.

Byun, K., Song, E., Kim, J., Kim, J. M., & Kang, H. G. (2019, June). Excitation-by-SampleRNN Model for Text-to-Speech. In 2019 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC) (pp. 1-4). IEEE.

Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., ... & Ochiai, T. (2018). Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016, March). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4960-4964). IEEE.

Hussain, M., Abishek, S., Ashwanth, K. P., Bharanidharan, C., & Girish, S. (2021, May). Feature Specific Hybrid Framework on composition of Deep learning architecture for speech emotion recognition. In Journal of Physics: Conference Series (Vol. 1916, No. 1, p. 012094). IOP Publishing.

Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440-1444.

Peng, Z., Zhu, Z., Unoki, M., Dang, J., & Akagi, M. (2018, July). Auditory-inspired end-to-end speech emotion recognition using 3D convolutional recurrent neural networks based on spectral-temporal representation. In 2018 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE.

Peng, Z., Li, X., Zhu, Z., Unoki, M., Dang, J., & Akagi, M. (2020). Speech emotion recognition using 3d convolutions and attention-based sliding recurrent networks with auditory front-ends. IEEE Access, 8, 16560-16572.

Zhao, Z., Zheng, Y., Zhang, Z., Wang, H., Zhao, Y., & Li, C. (2018). Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition.

Zhao, Z., Bao, Z., Zhao, Y., Zhang, Z., Cummins, N., Ren, Z., & Schuller, B. (2019). Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition. IEEE Access, 7, 97515-97525.

Li, R., Wu, Z., Jia, J., Zhao, S., & Meng, H. (2019, May). Dilated residual network with multi-head self-attention for speech emotion recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6675-6679). IEEE.

Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical signal processing and control, 47, 312-323.

Lotfian, R., & Busso, C. (2019). Over-sampling emotional speech data based on subjective evaluations provided by multiple individuals. IEEE Transactions on Affective Computing, 12(4), 870-882.

Li, Y., Zhao, T., & Kawahara, T. (2019, September). Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Interspeech (pp. 2803-2807).

Zayene, B., Jlassi, C., & Arous, N. (2020, September). 3D convolutional recurrent global neural network for speech emotion recognition. In 2020 5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP) (pp. 1-5). IEEE.

Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.

Muruganandam, S., Joshi, R., Suresh, P., Balakrishna, N., Kishore, K. H., & Manikanthan, S. V. (2023). A deep learning based feed forward artificial neural network to predict the K-barriers for intrusion detection using a wireless sensor network. Measurement: Sensors, 25, 100613.




How to Cite

Rasheed, B. H. ., Yuvaraj, D. ., Alnuaimi, S. S. ., & Priya, S. S. . (2024). Automatic Speech Emotion Recognition Using Hybrid Deep Learning Techniques. International Journal of Intelligent Systems and Applications in Engineering, 12(15s), 87–96. Retrieved from



Research Article