A Comparative Analysis of CNN-LSTM and MFCC-LSTM for Sentiment Recognition from Speech Signals
Keywords:
Convolution neural network, Long short-term memory network, Sentiment recognition, and Deep learning, IoT (Internet of Things), Hearing aids.Abstract
Sentiment analysis of speech is a rapidly evolving field with immense potential for Human Computer Interaction (HCI). As technology improves and addresses current challenges, we can expect a future where computers interact with us on a deeper emotional level, creating a more natural and intuitive user experience. Sentiment analysis of speech allows computers to understand the emotional tone behind a user's words. This unveils a powerful tool for designing more natural and empathetic HCI systems. While sentiment analysis often focuses on written text, speech offers a richer sentimental landscape. Voice tone for example- sarcasm, frustration, excitement, speech patterns of speaking speed, hesitation, emphasis & non-verbal cues like laughs, sighs, or grunts can add emotional context missed in the text. This research proposes a hybrid architecture that combines Convolution Neural Network (CNN) with Long Short Term Memory (LSTM) and leverages linear stack of deep stride layers to enhance the accuracy metrics of sentiment recognition system by speech signals. Convolution neural network capture spatial features efficiently from spectrograms while LSTM networks excel at modeling temporal dependencies. This system classifies seven sentiments such as happiness, disgust, sadness, angry, neutral, fear and pleasant surprise from the speaker's utterances. The proposed work utilizes Toronto Emotional Speech Set (TESS) dataset. Experimental results demonstrate that the hybrid CNN-LSTM architecture achieves high accuracy rate of 98 % which is slight improvement in our previous work utilizing MFCC+LSTM having 96% accuracy in recognizing sentiments, outperforming other state-of-the-art methods. Notably, the model achieved these results utilizing a relatively smaller size (1.8 MB), highlighting its computational efficiency.
Downloads
References
J. Rong, G. Li, and Y. P. P. Chen, “Acoustic feature selection for automatic emotion recognition from speech,” Inf. Process. Manag., vol. 45, no. 3, pp. 315–328, 2009, doi: 10.1016/j.ipm.2008.09.003.
M. S. Likitha, S. R. R. Gupta, K. Hasitha, and A. U. Raju, “Speech based human emotion recognition using MFCC,” Proc. 2017 Int. Conf. Wirel. Commun. Signal Process. Networking, WiSPNET 2017, vol. 2018-Janua, pp. 2257–2260, 2017, doi: 10.1109/WiSPNET.2017.8300161.
J. Paul et al., “A survey and comparative study on negative sentiment analysis in social media data,” Multimed. Tools Appl., 2024, doi: 10.1007/s11042-024-18452-0.
R. Mohd Hanifa, K. Isa, and S. Mohamad, “A review on speaker recognition: Technology and challenges,” Comput. Electr. Eng., vol. 90, no. January, p. 107005, 2021, doi: 10.1016/j.compeleceng.2021.107005.
D. Deshwal, P. Sangwan, and D. Kumar, “A structured approach towards robust database collection for language identification,” Proc. - 2020 21st Int. Arab Conf. Inf. Technol. ACIT 2020, pp. 19–24, 2020, doi: 10.1109/ACIT50332.2020.9299963.
M. Gupta and S. Chandra, “Speech emotion recognition using MFCC and wide residual network,” ACM Int. Conf. Proceeding Ser., pp. 320–327, 2021, doi: 10.1145/3474124.3474171.
P. Sangwan, D. Deshwal, and N. Dahiya, “Performance of a language identification system using hybrid features and ANN learning algorithms,” Appl. Acoust., vol. 175, p. 107815, 2021, doi: 10.1016/j.apacoust.2020.107815.
B. T. Atmaja and A. Sasou, “Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations,” Sensors, vol. 22, no. 17, 2022, doi: 10.3390/s22176369.
S. Kwon, “A CNN-Assisted Enhanced Audio Signal Processing,” Sensors, 2020.
Namrata Dave, “Feature Extraction Methods LPC, PLP and MFCC In Speech Recognition,” Int. J. Adv. Res. Eng. Technol., vol. 1, no. Vi, pp. 1–5, 2013, [Online]. Available: www.ijaret.org
G. K. Liu, “Evaluating Gammatone Frequency Cepstral Coefficients with Neural Networks for Emotion Recognition from Speech,” pp. 2–6, 2018, [Online]. Available: http://arxiv.org/abs/1806.09010
Z. T. Liu, M. Wu, W. H. Cao, J. W. Mao, J. P. Xu, and G. Z. Tan, “Speech emotion recognition based on feature selection and extreme learning machine decision tree,” Neurocomputing, vol. 273, pp. 271–280, 2018, doi: 10.1016/j.neucom.2017.07.050.
M. S. Fahad, A. Deepak, G. Pradhan, and J. Yadav, “DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features,” Circuits, Syst. Signal Process., vol. 40, no. 1, pp. 466–489, 2021, doi: 10.1007/s00034-020-01486-8.
S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, “Adversarial auto-encoders for speech based emotion recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-August, no. ii, pp. 1243–1247, 2017, doi: 10.21437/Interspeech.2017-1421.
L. Zhu, L. Chen, D. Zhao, J. Zhou, and W. Zhang, “Emotion recognition from chinese speech for smart affective services using a combination of SVM and DBN,” Sensors (Switzerland), vol. 17, no. 7, 2017, doi: 10.3390/s17071694.
S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching,” IEEE Trans. Multimed., vol. 20, no. 6, pp. 1576–1590, 2018, doi: 10.1109/TMM.2017.2766843.
“CNN 6 72019.pdf.”
D. Yu, M. L. Seltzer, J. Li, J. T. Huang, and F. Seide, “Feature learning in deep neural networks – Studies on speech recognition tasks,” 1st Int. Conf. Learn. Represent. ICLR 2013 - Conf. Track Proc., pp. 1–9, 2013.
G. Wen, H. Li, J. Huang, D. Li, and E. Xun, “Random Deep Belief Networks for Recognizing Emotions from Speech Signals,” Comput. Intell. Neurosci., vol. 2017, 2017, doi: 10.1155/2017/1945630.
F. Bao, M. Neumann, and N. T. Vu, “CycleGAN-based emotion style transfer as data augmentation for speech emotion recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 2828–2832, 2019, doi: 10.21437/Interspeech.2019-2293.
N. Hajarolasvadi and H. Demirel, “3D CNN-based speech emotion recognition using k-means clustering and spectrograms,” Entropy, vol. 21, no. 5, 2019, doi: 10.3390/e21050479.
A. M. A. B, V. Palade, M. England, and R. Iqbal, A Combined CNN and LSTM Model. Springer International Publishing, 2018. doi: 10.1007/978-3-319-99740-7.
M. Zielonka, A. Piastowski, A. Czyżewski, P. Nadachowski, M. Operlejn, and K. Kaczor, “Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets,” Electron., vol. 11, no. 22, 2022, doi: 10.3390/electronics11223831.
M. H. Farouk, “Emotion Recognition from Speech,” SpringerBriefs Speech Technol., pp. 31–32, 2014, doi: 10.1007/978-3-319-02732-6_7.
C. Hema and F. P. Garcia Marquez, “Emotional speech Recognition using CNN and Deep learning techniques,” Appl. Acoust., vol. 211, p. 109492, 2023, doi: 10.1016/j.apacoust.2023.109492.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.
 
						 
											


