A Comparative Analysis of CNN-LSTM and MFCC-LSTM for Sentiment Recognition from Speech Signals

Authors

  • Suman Lata, Neha Kishore, Pardeep Sangwan

Keywords:

Convolution neural network, Long short-term memory network, Sentiment recognition, and Deep learning, IoT (Internet of Things), Hearing aids.

Abstract

Sentiment analysis of speech is a rapidly evolving field with immense potential for Human Computer Interaction (HCI). As technology improves and addresses current challenges, we can expect a future where computers interact with us on a deeper emotional level, creating a more natural and intuitive user experience. Sentiment analysis of speech allows computers to understand the emotional tone behind a user's words. This unveils a powerful tool for designing more natural and empathetic HCI systems. While sentiment analysis often focuses on written text, speech offers a richer sentimental landscape. Voice tone for example- sarcasm, frustration, excitement, speech patterns of speaking speed, hesitation, emphasis & non-verbal cues like laughs, sighs, or grunts can add emotional context missed in the text. This research proposes a hybrid architecture that combines Convolution Neural Network (CNN) with Long Short Term Memory (LSTM) and leverages linear stack of deep stride layers to enhance the accuracy metrics of sentiment recognition system by speech signals. Convolution neural network capture spatial features efficiently from spectrograms while LSTM networks excel at modeling temporal dependencies. This system classifies seven sentiments such as happiness, disgust, sadness, angry, neutral, fear and pleasant surprise from the speaker's utterances. The proposed work utilizes Toronto Emotional Speech Set (TESS) dataset. Experimental results demonstrate that the hybrid CNN-LSTM architecture achieves high accuracy rate of 98 % which is slight improvement in our previous work utilizing MFCC+LSTM having 96% accuracy in recognizing sentiments, outperforming other state-of-the-art methods. Notably, the model achieved these results utilizing a relatively smaller size (1.8 MB), highlighting its computational efficiency.

Downloads

Download data is not yet available.

References

J. Rong, G. Li, and Y. P. P. Chen, “Acoustic feature selection for automatic emotion recognition from speech,” Inf. Process. Manag., vol. 45, no. 3, pp. 315–328, 2009, doi: 10.1016/j.ipm.2008.09.003.

M. S. Likitha, S. R. R. Gupta, K. Hasitha, and A. U. Raju, “Speech based human emotion recognition using MFCC,” Proc. 2017 Int. Conf. Wirel. Commun. Signal Process. Networking, WiSPNET 2017, vol. 2018-Janua, pp. 2257–2260, 2017, doi: 10.1109/WiSPNET.2017.8300161.

J. Paul et al., “A survey and comparative study on negative sentiment analysis in social media data,” Multimed. Tools Appl., 2024, doi: 10.1007/s11042-024-18452-0.

R. Mohd Hanifa, K. Isa, and S. Mohamad, “A review on speaker recognition: Technology and challenges,” Comput. Electr. Eng., vol. 90, no. January, p. 107005, 2021, doi: 10.1016/j.compeleceng.2021.107005.

D. Deshwal, P. Sangwan, and D. Kumar, “A structured approach towards robust database collection for language identification,” Proc. - 2020 21st Int. Arab Conf. Inf. Technol. ACIT 2020, pp. 19–24, 2020, doi: 10.1109/ACIT50332.2020.9299963.

M. Gupta and S. Chandra, “Speech emotion recognition using MFCC and wide residual network,” ACM Int. Conf. Proceeding Ser., pp. 320–327, 2021, doi: 10.1145/3474124.3474171.

P. Sangwan, D. Deshwal, and N. Dahiya, “Performance of a language identification system using hybrid features and ANN learning algorithms,” Appl. Acoust., vol. 175, p. 107815, 2021, doi: 10.1016/j.apacoust.2020.107815.

B. T. Atmaja and A. Sasou, “Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations,” Sensors, vol. 22, no. 17, 2022, doi: 10.3390/s22176369.

S. Kwon, “A CNN-Assisted Enhanced Audio Signal Processing,” Sensors, 2020.

Namrata Dave, “Feature Extraction Methods LPC, PLP and MFCC In Speech Recognition,” Int. J. Adv. Res. Eng. Technol., vol. 1, no. Vi, pp. 1–5, 2013, [Online]. Available: www.ijaret.org

G. K. Liu, “Evaluating Gammatone Frequency Cepstral Coefficients with Neural Networks for Emotion Recognition from Speech,” pp. 2–6, 2018, [Online]. Available: http://arxiv.org/abs/1806.09010

Z. T. Liu, M. Wu, W. H. Cao, J. W. Mao, J. P. Xu, and G. Z. Tan, “Speech emotion recognition based on feature selection and extreme learning machine decision tree,” Neurocomputing, vol. 273, pp. 271–280, 2018, doi: 10.1016/j.neucom.2017.07.050.

M. S. Fahad, A. Deepak, G. Pradhan, and J. Yadav, “DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features,” Circuits, Syst. Signal Process., vol. 40, no. 1, pp. 466–489, 2021, doi: 10.1007/s00034-020-01486-8.

S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, “Adversarial auto-encoders for speech based emotion recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-August, no. ii, pp. 1243–1247, 2017, doi: 10.21437/Interspeech.2017-1421.

L. Zhu, L. Chen, D. Zhao, J. Zhou, and W. Zhang, “Emotion recognition from chinese speech for smart affective services using a combination of SVM and DBN,” Sensors (Switzerland), vol. 17, no. 7, 2017, doi: 10.3390/s17071694.

S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching,” IEEE Trans. Multimed., vol. 20, no. 6, pp. 1576–1590, 2018, doi: 10.1109/TMM.2017.2766843.

“CNN 6 72019.pdf.”

D. Yu, M. L. Seltzer, J. Li, J. T. Huang, and F. Seide, “Feature learning in deep neural networks – Studies on speech recognition tasks,” 1st Int. Conf. Learn. Represent. ICLR 2013 - Conf. Track Proc., pp. 1–9, 2013.

G. Wen, H. Li, J. Huang, D. Li, and E. Xun, “Random Deep Belief Networks for Recognizing Emotions from Speech Signals,” Comput. Intell. Neurosci., vol. 2017, 2017, doi: 10.1155/2017/1945630.

F. Bao, M. Neumann, and N. T. Vu, “CycleGAN-based emotion style transfer as data augmentation for speech emotion recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 2828–2832, 2019, doi: 10.21437/Interspeech.2019-2293.

N. Hajarolasvadi and H. Demirel, “3D CNN-based speech emotion recognition using k-means clustering and spectrograms,” Entropy, vol. 21, no. 5, 2019, doi: 10.3390/e21050479.

A. M. A. B, V. Palade, M. England, and R. Iqbal, A Combined CNN and LSTM Model. Springer International Publishing, 2018. doi: 10.1007/978-3-319-99740-7.

M. Zielonka, A. Piastowski, A. Czyżewski, P. Nadachowski, M. Operlejn, and K. Kaczor, “Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets,” Electron., vol. 11, no. 22, 2022, doi: 10.3390/electronics11223831.

M. H. Farouk, “Emotion Recognition from Speech,” SpringerBriefs Speech Technol., pp. 31–32, 2014, doi: 10.1007/978-3-319-02732-6_7.

C. Hema and F. P. Garcia Marquez, “Emotional speech Recognition using CNN and Deep learning techniques,” Appl. Acoust., vol. 211, p. 109492, 2023, doi: 10.1016/j.apacoust.2023.109492.

Downloads

Published

26.03.2024

How to Cite

Suman Lata. (2024). A Comparative Analysis of CNN-LSTM and MFCC-LSTM for Sentiment Recognition from Speech Signals. International Journal of Intelligent Systems and Applications in Engineering, 12(21s), 4392 –. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/6295

Issue

Section

Research Article