Multimodal Emotion Recognition: Integrating Audio and Visual Features Using Enhanced Deep Learning Techniques

Authors

  • Archna Kirar, Sumeet Gill, Vikas Jangra, Binny Sharma

Keywords:

Multimodal Emotion Recognition, Bi-Directional LSTM with self-attention mechanism, Bi directional LSTM, Autoencoder, ResNet, CNN.

Abstract

Emotion recognition is a pivotal area in human-computer interaction, crucial for enhancing system responsiveness and adaptability. The expression of human emotion depends on various verbal and non-verbal. Emotion recognition is thus well suited as a multimodal rather than single-modal learning problem. This study introduces Multimodal that integrates speech (Audio) and facial features to recognize three primary emotions: happiness, sadness, and surprise from a video dataset (MELD). In audio feature extraction, an autoencoder is used, which improves the model's capacity to identify subtle emotional subtleties from speech signals. Concurrently, ResNet is used to extract image features by transfer learning, using pre trained weights to identify intricate visual patterns from summary pictures. The Improved Zebra Algorithm (IZA) is used in feature selection to maximize discriminative feature subsets. Our suggested Bi- Directional LSTM with self-attention mechanism is evaluated by comparison with two baseline models, namely Bi Directional LSTM and Convolutional Neural Network (CNN). Our method achieves state-of-art results on MELD. More specifically, the highest accuracy was obtained by the Bi-LSTM-self attention model with 89.83%, followed by 85.15% by the Bi-LSTM, and 86.87% by the CNN respectively. These findings demonstrate the efficiency of the Bi-LSTM- SA model on multimodal emotion recognition.

Downloads

Download data is not yet available.

References

P. V. Rouast, M. Adam, and R. Chiong, Deep learning for human affect recognition: Insights and new developments, IEEE Transactions on Affective Computing, 12 (2019), 524-543, DOI: 10.1109/TAFFC.2018.2890471.

T. Baltruˇ saitis, C. Ahuja, and L.-P. Morency, Multimodal machine learning: A survey and taxonomy, IEEE Transactions on Pattern Analysis and Machine Intelligence, 41 (2019), 423–443, DOI: 10.1109/TPAMI.2018.2798607.

Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, A survey of affect recognition methods: Audio, visual, and spontaneous expressions, IEEE transactions on pattern analysis and machine intelligence, 31 (2009), 39– 58, DOI: 10.1145/1322192.1322216.

C.-H. Wu, Y.-M. Huang, and J.-P. Hwang, Review of affective com puting in education/learning: Trends and challenges, British Journal of Educational Technology, 47 (2016), 1304–1323, DOI: 10.1111/bjet.12324.

B. G. Lee, T. W. Chong and B. Kim, Detecting driving stress in physiological signals based on multimodal feature analysis and kernel classifiers, IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, (2017), DOI: 10.1016/j.eswa.2017.01.040.

S. Cosentino, E. I. Randria, J.-Y. Lin, T. Pellegrini, S. Sessa, and A. Takanishi, Group emotion recognition strategies for entertainment robots, International Conference on Intelligent Robots and Systems (IROS), (2018), 813–818, DOI: 10.1016/j.eswa.2017.01.040.

L. Y. Mano, B. S. Faic¸al, L. H. Nakamura, P. H. Gomes, G. L. Libralon, R. I. Meneguete, P. Geraldo Filho,

G. T. Giancristofaro, G. Pessin, B. Krishnamachari et al., Exploiting iot technologies for enhancing health smart homes through patient identification and emotion recognition, Computer Communications, 89 (2016), 178–190, DOI: 10.1016/j.comcom.2016.03.010.

Y. Zong, H. Lian, H. Chang, C. Lu, and C. Tang, Adapting Multiple Distributions for Bridging Emotions from Different Speech Corpora, Entropy, 24 (2022), 1–14, DOI: 10.3390/e24091250.

H. Fu, Z. Zhuang, Y. Wang, C. Huang, and W. Duan, Cross-Corpus Speech Emotion Recognition Based on Multi-Task Learning and Subdomain Adaptation, Entropy, 25 (2023), 1–10, DOI: 10.3390/e25010124.

S. Shaheen, W. El-Hajj, H. Hajj, and S. Elbassuoni, Emotion Recognition from Text Based on Automatically Generated Rules, International Conference on Data Mining Workshop, (2014), 383–392, DOI: 10.1109/ICDMW.2014.80.

C. H. Wu, Z. J. Chuang, and Y. C. Lin, Emotion recognition from text using semantic labels and separable mixture models, ACM Transactions on Asian Language Information Processing., 5 (2006), 165–182, DOI: 10.1145/1165255.1165259.

S. Li and W. Deng, Deep Facial Expression Recognition: A Survey, IEEE Transactions on Affective Computing, 13 (2022), 1195–1215, DOI: 10.1109/TAFFC.2020.2981446.

H. Yang, L. Xie, H. Pan, C. Li, Z. Wang, and J. Zhong, Multimodal Attention Dynamic Fusion Network for Facial Micro-Expression Recognition, Entropy, 25 (2023), DOI: 10.3390/e25091246.

J. Zeng, T. Liu, and J. Zhou, Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities, Association for Computing Machinery, 1 (2022), DOI: 10.1145/3477495.3532064.

Y. Li, Y. Wang, and Z. Cui, Decoupled Multimodal Distilling for Emotion Recognition, Proc. IEEE Comput. Soc. Conf. Computer Vision Pattern Recognition, (2023), 6631–6640, DOI: 10.1109/CVPR52729.2023.00641.

S. E. Kahou et al., Combining modality specific deep neural networks for emotion recognition in video, ICMI 2013 International Conference Multimodal Interaction, (2013), 543–550, DOI: 10.1145/2522848.2531745.021.03.058.

S. Lee, D. K. Han, H. Ko, Multimodal emotion recognition fusion analysis adapting bert with heterogeneous feature unification, IEEE Access, 9 (2021), 94557–94572, DOI: 10.1109/ACCESS.2021.3092735.

S. Dobrisek, R. Gajsek, F. Mihelic, N. Pavesic, V. Struc, Towards efficient multi-modal emotion recognition, Int J Adv Robotic Syst, 10 (2013), 1–10, DOI: 10.5772/54002.

Z. Sara, A. Zahid, C. E. Cigdem, Multimodal emotion recognition based on peak frame selection from video, Signal Image Video Process., 10 (2016), 827–843, DOI: 10.1007/s11760-015-0822-0.

S. K. Phooi, A. Li-Minn, O. C. Shing, A combined rule-based & machine learning audio-visual emotion recognition approach, IEEE Trans Affect Comput, 9 (2018), 3–13, DOI: 10.1109/TAFFC.2016.2588488.

A. Shrestha, A. Mahmood, Review of deep learning algorithms and architectures, IEEE Access, 7 (2019), 53040–53065, DOI: 10.1109/ACCESS.2019.2912200.

Z. Farhoudi, S. Setayeshib, Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition, Speech Commun, 127 (2020), 92–123, DOI: 10.1016/j.specom.2020.12.001.

E. Avots, T. Sapinski, M. Bachmann, D. Kaminska, Audiovisual emotion recognition in wild, Mach Vis Appl, 30 (2019), 975–985, DOI: 10.1007/s00138-018-0960-9.

F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, G. Anbarjafari, Audio-visual emotion recognition in video clips, IEEE Trans Affect Comput, 10 (2019), 60–75, DOI: 10.1109/TAFFC.2017.2713783.

E. Avots, T. Sapin, M. Bachmann, D. Kamin, Audiovisual emotion recognition in wild, Mach Vis Appl, 30 (2019), 975–985, DOI: 10.1007/s00138-018-0960-9.

T. Hussain, W. Wang, N. Bouaynaya, H. Fathallah-Shaykh, L. Mihaylova, Deep learning for audio visual emotion recognition, 2022 25th International Conference on Information Fusion (FUSION), (2022), 1–8, DOI: 10.23919/FUSION49751.2022.9841342.

W. Ding, M. Xu, D. Huang, W. Lin, M. Dong, X. Yu1, H. Li, Audio and face video emotion recognition in the wild using deep neural networks and small datasets, 2016 18th ACM International Conference on Multimodal Interaction (ICMI), (2016), 505–513, DOI: 10.1145/2993148.2997637.

G. P. Rajasekar, W. C. Melo, N. Ullah, H. Aslam, O. Zeeshan, T. Denorme, M. Pedersoli, A. Koerich, S. Bacon, P. Cardinal, E. Granger, A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition, arXiv, (2022), DOI: 10.48550/ARXIV.2203.14779.

S. Zhang, S. Zhang, T. Huang, Learning affective features with a hybrid deep model for audio-visual emotion recognition, IEEE Trans Circ Syst Video Technol, 28 (2018), 3030–3043, DOI: 10.1109/TCSVT.2017.2719043.

M. S. Hossain, G. Muhammad, Emotion recognition using deep learning approach from audio-visual emotional big data, Inf Fusion, 49 (2019), 69–78, DOI: 10.1016/j.inffus.2018.09.008.

Y. Ma, Y. Hao, M. Chen, J. Chen, P. Lu, A. Košir, Audio-visual emotion fusion (avef): A deep efficient weighted approach, Inf Fusion, 46 (2019), 184–192, DOI: 10.1016/j.inffus.2018.06.003.

E. Ghaleb, J. Niehues, S. Asteriadis, Joint modelling of audio-visual cues using attention mechanisms for emotion recognition, Multimed Tools Appl, 82 (2023), 11239–11264, DOI: 10.1007/s11042-022-13557-w.

S. Zhao et al., A two-stage 3D CNN based learning method for spontaneous micro-expression recognition, Neurocomputing, 448 (2021), 276–289, DOI: 10.1016/j.neucom.2021.03.058.

S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, Meld: A multimodal multi-party dataset for emotion recognition in conversations, arXiv preprint arXiv:1810.02508, (2018), DOI: 10.48550/arXiv.1810.02508.

J. He, A Multimodal Approach for Emotion Recognition in Conversations Using the MELD Dataset, 2025 Asia-Europe Conference on Cybersecurity, Internet of Things and Soft Computing (CITSC), (2025), 54-58, DOI: 10.1109/CITSC64390.2025.00016.

[H. F. T. Alsaadawı and R. Daş, Multimodal Emotion Recognition Using Bi-LG-GCN for MELD Dataset, Balkan Journal of Electrical and Computer Engineering, 12 (2024), 36-46, DOI: 10.17694/bajece.1372107

Downloads

Published

30.12.2024

How to Cite

Archna Kirar. (2024). Multimodal Emotion Recognition: Integrating Audio and Visual Features Using Enhanced Deep Learning Techniques. International Journal of Intelligent Systems and Applications in Engineering, 12(23s), 3870–3884. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7940

Issue

Section

Research Article