Enhanced Emotion Recognition for Women and Children Safety Prediction using Deep Network

Authors

  • Nanda R. Wagh Research Scholar, Dr. Babasaheb Ambedkar Technological University, Lonere (India),
  • Sanjay R. Sutar Professor and Head of Department, Dr. Babasaheb Ambedkar Technological University, Lonere (India)

Keywords:

Facial Expression Recognition, deep learning, multimodal, women safety, audio-visual media, fusion

Abstract

The most difficult research problem is ensuring the safety of women and children. Multimodal emotion recognition is a difficult task. One of the most important and widely used research domains in HCI is multimodal data, which includes audio, video, text, facial expression, body motions, bio-signals, and physiological data. This data is used to forecast the safety of women and children. Rigid research has been proposed in this context. To create the best multimodal model for emotion recognition combining picture, text, audio, and video modalities, a novel deep learning model is developed, and a thorough analysis of data, feature, and model-level fusion is undertaken. Separate innovative feature extractor networks are suggested specifically for picture, text, audio, and video data. Then, at the model level, an ideal multimodal emotion identification model is developed by combining information from images, text, voice, and video. Three benchmark multimodal datasets, IEMOCAP, Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion (SAVEE), are used to evaluate the performances of the suggested models. On the IEMOCAP, SAVEE, and RAVDESS datasets, the suggested models obtain high predicted accuracies of 96%, 97%, and 97%, respectively. By contrasting the outcomes with those of the current emotion recognition models, the models' efficacy and optimality are also confirmed. Women and Children Safety Prediction employs multimodal Enhanced emotion recognition.

 

Downloads

Download data is not yet available.

References

Puri, T., Soni, M., Dhiman, G., Ibrahim Khalaf, O. and Raza Khan, I., 2022. Detection of emotion of speech for RAVDESS audio using hybrid convolution neural network. Journal of Healthcare Engineering, 2022.

Singh, P., Srivastava, R., Rana, K.P.S. and Kumar, V., 2021. A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems, 229, p.107316.

Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S. and Narayanan, S.S., 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42, pp.335-359.

P. Vincent, A. Courville, Y. Bengio, R.C. Ferrari, et al., Combining modal- ity specific deep neural networks for emotion recognition in video, in: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, 2013, pp. 543–550.

N. Srivastava, R. Salakhutdinov, et al., Multimodal learning with deep Boltzmann machines, in: NIPS, Vol. 1, Citeseer, 2012, p. 2.

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Ng, Multimodal deep learn- ing, in: International Conference on Machine Learning (ICML), Bellevue, WA, 2011, pp. 689–696.

Y. Wang, L. Guan, Recognizing human emotional state from audiovisual signals, IEEE Trans. Multimed. 10 (5) (2008) 936–946.

C. Busso, Z. Deng, S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, S. Lee,

U. Neumann, S. Narayanan, Analysis of emotion recognition using facial expressions, speech and multimodal information, in: Proceedings of the 6th International Conference on Multimodal Interfaces, 2004, pp. 205–211.

Y. Yoshitomi, S.-I. Kim, T. Kawano, T. Kilazoe, Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face, in: Proceedings 9th IEEE International Workshop on Robot and Human Interactive Communication. IEEE RO-MAN 2000 (Cat. No. 00TH8499), IEEE, 2000, pp. 178–183.

Z. Wang, S.-B. Ho, E. Cambria, A review of emotion sensing: categoriza- tion models and algorithms, Multimedia Tools Appl. 79 (47–48) (2020) 35553–35582, http://dx.doi.org/10.1007/s11042-019-08328-z.

A. Ortony, T.J. Turner, What’s basic about basic emotions? Psychol. Rev. 97 (3) (1990) 315.

B.R. Steunebrink, M. Dastani, J.J.C. Meyer, The OCC model revisited, in: Proceedings of the 4th Workshop on Emotion and Computing, 2009.

Y. Li, J. Tao, L. Chao, W. Bao, Y. Liu, CHEAVD: a Chinese natural emotional audio–visual database, J. Ambient Intell. Humaniz. Comput. 8 (6) (2017) 913–924.

O. Martin, I. Kotsia, B. Macq, I. Pitas, The eNTERFACE’ 05 audio-visual emotion database, in: 22nd International Conference on Data Engineering Workshops (ICDEW’06), 2006, p. 8, http://dx.doi.org/10.1109/ICDEW.2006. 145.

E. Patterson, S. Gurbuz, Z. Tufekci, J. Gowdy, CUAVE: A new audio-visual database for multimodal human-computer interface research, in: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 2002, pp. II–2017–II–2020, http://dx.doi.org/10.1109/ICASSP.2002. 5745028.

I. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, Extraction of visual features for lipreading, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2) (2002) 198–213.

A. Dhall, R. Goecke, J. Joshi, M. Wagner, T. Gedeon, Emotion recognition in the wild challenge 2013, in: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, 2013, pp. 509–516.

J. Huang, J. Tao, B. Liu, Z. Lian, M. Niu, Multimodal transformer fusion for continuous emotion recognition, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, http://dx.doi.org/10.1109/icassp40776.2020.9053762.

J.-B. Delbrouck, N. Tits, M. Brousmiche, S. Dupont, A transformer-based joint-encoding for emotion recognition and sentiment analysis, 2020, arXiv preprint arXiv:2006.15955.

Y. Susanto, A.G. Livingstone, B.C. Ng, E. Cambria, The hourglass model revisited, IEEE Intell. Syst. 35 (5) (2020) 96–102.

R. Plutchik, The nature of emotions: Human emotions have deep evolu- tionary roots, a fact that may explain their complexity and provide tools for clinical practice, Am. Sci. 89 (4) (2001) 344–350.

K.R. Bagadi, A comprehensive analysis of multimodal speech emotion recognition, in: Journal of Physics: Conference Series, Vol. 1917, IOP Publishing, 2021, 012009.

M. Abdullah, M. Ahmad, D. Han, Facial expression recognition in videos: An CNN-LSTM based model for video classification, in: 2020 International Conference on Electronics, Information, and Communication (ICEIC), IEEE, 2020, pp. 1–3.

D. Krishna, A. Patil, Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks, in: Interspeech, 2020, pp. 4243–4247.

A. Jaratrotkamjorn, A. Choksuriwong, Bimodal emotion recognition using deep belief network, in: 2019 23rd International Computer Science and Engineering Conference (ICSEC), IEEE, 2019, pp. 103–109.

G. Sahu, Multimodal speech emotion recognition and ambiguity resolution, 2019, arXiv preprint arXiv:1904.06022.

K.P. Rao, M.C.S. Rao, N.H. Chowdary, An integrated approach to emotion recognition and gender classification, J. Vis. Commun. Image Represent. 60 (2019) 339–345.

S. Yoon, S. Byun, K. Jung, Multimodal speech emotion recognition using audio and text, in: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2018, pp. 112–118.

D. Nguyen, K. Nguyen, S. Sridharan, D. Dean, C. Fookes, Deep spatio- temporal feature fusion with compact bilinear pooling for multimodal emotion recognition, Comput. Vis. Image Underst. 174 (2018) 33–42.

H. Miao, Y. Zhang, W. Li, H. Zhang, D. Wang, S. Feng, Chinese multimodal emotion recognition in deep and traditional machine leaming approaches, in: 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), IEEE, 2018, pp. 1–6.

F. Xu, Z. Wang, Emotion recognition research based on integration of facial expression and voice, in: 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), IEEE, 2018, pp. 1–6.

J. Yan, W. Zheng, Q. Xu, G. Lu, H. Li, B. Wang, Sparse kernel reduced- rank regression for bimodal emotion recognition from facial expression and speech, IEEE Trans. Multimed. 18 (7) (2016) 1319–1329.

S.E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, Ç. Gülçehre, R. Memisevic,

P. Vincent, A. Courville, Y. Bengio, R.C. Ferrari, et al., Combining modal- ity specific deep neural networks for emotion recognition in video, in: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, 2013, pp. 543–550.

N. Srivastava, R. Salakhutdinov, et al., Multimodal learning with deep Boltzmann machines, in: NIPS, Vol. 1, Citeseer, 2012, p. 2.

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Ng, Multimodal deep learn- ing, in: International Conference on Machine Learning (ICML), Bellevue, WA, 2011, pp. 689–696.

Y. Wang, L. Guan, Recognizing human emotional state from audiovisual signals, IEEE Trans. Multimed. 10 (5) (2008) 936–946.

C. Busso, Z. Deng, S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, S. Narayanan, Analysis of emotion recognition using facial expressions, speech and multimodal information, in: Proceedings of the 6th International Conference on Multimodal Interfaces, 2004, pp. 205–211.

Y. Yoshitomi, S.-I. Kim, T. Kawano, T. Kilazoe, Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face, in: Proceedings 9th IEEE International Workshop on Robot and Human Interactive Communication. IEEE RO-MAN 2000 (Cat. No. 00TH8499), IEEE, 2000, pp. 178–183.

Z. Wang, S.-B. Ho, E. Cambria, A review of emotion sensing: categoriza- tion models and algorithms, Multimedia Tools Appl. 79 (47–48) (2020) 35553–35582, http://dx.doi.org/10.1007/s11042-019-08328-z.

A. Ortony, T.J. Turner, What’s basic about basic emotions? Psychol. Rev. 97 (3) (1990) 315.

B.R. Steunebrink, M. Dastani, J.J.C. Meyer, The OCC model revisited, in: Proceedings of the 4th Workshop on Emotion and Computing, 2009.

Y. Li, J. Tao, L. Chao, W. Bao, Y. Liu, CHEAVD: a Chinese natural emotional audio–visual database, J. Ambient Intell. Humaniz. Comput. 8 (6) (2017) 913–924.

O. Martin, I. Kotsia, B. Macq, I. Pitas, The eNTERFACE’ 05 audio-visual emotion database, in: 22nd International Conference on Data Engineering Workshops (ICDEW’06), 2006, p. 8, http://dx.doi.org/10.1109/ICDEW.2006. 145.

E. Patterson, S. Gurbuz, Z. Tufekci, J. Gowdy, CUAVE: A new audio-visual database for multimodal human-computer interface research, in: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 2002, pp. II–2017–II–2020, http://dx.doi.org/10.1109/ICASSP.2002. 5745028.

I. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, Extraction of visual features for lipreading, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2) (2002) 198–213.

A. Dhall, R. Goecke, J. Joshi, M. Wagner, T. Gedeon, Emotion recognition in the wild challenge 2013, in: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, 2013, pp. 509–516.

J. Huang, J. Tao, B. Liu, Z. Lian, M. Niu, Multimodal transformer fusion for continuous emotion recognition, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, http://dx.doi.org/10.1109/icassp40776.2020.9053762.

J.-B. Delbrouck, N. Tits, M. Brousmiche, S. Dupont, A transformer-based joint-encoding for emotion recognition and sentiment analysis, 2020, arXiv preprint arXiv:2006.15955.

Y. Susanto, A.G. Livingstone, B.C. Ng, E. Cambria, The hourglass model revisited, IEEE Intell. Syst. 35 (5) (2020) 96–102.

Y. Soon, S.N. Koh, C.K. Yeo, Noisy speech enhancement using discrete cosine transform, Speech Commun. 24 (3) (1998) 249–257.

D.-N. Jiang, L. Lu, H.-J. Zhang, J.-H. Tao, L.-H. Cai, Music type classification by spectral contrast feature, in: Proceedings. IEEE International Conference on Multimedia and Expo, Vol. 1, IEEE, 2002, pp. 113–116.

Y. Li, J. Yang, J. Wen, Entropy-based redundancy analysis and information screening, Digit. Commun. Netw. (2021) http://dx.doi.org/10.1016/j.dcan. 2021.12.001.

I. Kandel, M. Castelli, The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset, ICT Express 6 (4) (2020) 312–315.

E. Ghaleb, M. Popa, S. Asteriadis, Multimodal and temporal perception of audio-visual cues for emotion recognition, in: 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE, 2019, pp. 552–558.

Y. Zeng, H. Mao, D. Peng, Z. Yi, Spectrogram based multi-task audio classification, Multimedia Tools Appl. 78 (3) (2019) 3705–3722.

Z. Fu, F. Liu, H. Wang, J. Qi, X. Fu, A. Zhou, Z. Li, A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition, 2021, arXiv:2111.02172.

F. Rahdari, E. Rashedi, M. Eftekhari, A multimodal emotion recognition system using facial landmark analysis, Iran. J. Sci. Technol. Trans. Electr. Eng. 43 (1) (2019) 171–189.

L. Chen, K. Wang, M. Wu, W. Pedrycz, K. Hirota, K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition, IFAC-PapersOnLine 53 (2) (2020) 10250–10254., http://dx.doi.org/10.1109/mis.2021.3062200.

K. Zhang, Y. Li, J. Wang, E. Cambria, X. Li, Real-time video emotion recognition based on reinforcement learning and domain knowledge, IEEE Trans. Circuits Syst. Video Technol. (2021) 1, http://dx.doi.org/10.1109/ tcsvt.2021.3072412.

K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.

C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception- resnet and the impact of residual connections on learning, in: Thirty-First AAAI Conference on Artificial Intelligence, 2017.

OpenCV, Open Source Computer Vision Library, 2015.

Priya Metri, Jayshree Ghorpade, Ayesha Butalia,” Facial Emotion Recognition Using Context Based Multimodal Approach”, International Journal of Artificial Intelligence and Interactive Multimedia, Vol. 1, NO.4

Nanda R. Wagh, Dr. Sanjay R. Sutar and Dr. Abhay E. Wagh,” A Survey on the Recent Advances in the Development of IoT-based Devices for Women Safety”, Intelligent Computing in Information Technology for Engineering System

Proceedings of the International Conference on Intelligent Computing in Information Technology for Engineering System (ICICITES-2021), 25-26 June, 2021,https://www.routledge.com/Intelligent-Computing-in-Information-Technology-for-Engineering-System/Karande-Deshmukh-Mahalle/p/book/9781032270807.

Nanda R. Wagh, Dr. Sanjay R.Sutar,” An enhanced security of women and children using machine learning and data mining techniques”, Data Mining and Machine Learning Applications,423-446,28 Jan,2022 https://doi.org/10.1002/9781119792529.ch16

Nanda R. Wagh, Dr. Sanjay R.Sutar ,”A Smart Security Solution for Women{'}s and Children Using Wearable Iot Systems”,Preprint SSRN, ISSN: 15565068

AGYEI , I. T. . (2021). Simulating HRM Technology Operations in Contemporary Retailing . International Journal of New Practices in Management and Engineering, 10(02), 10–14. https://doi.org/10.17762/ijnpme.v10i02.132

Anupong, W., Azhagumurugan, R., Sahay, K. B., Dhabliya, D., Kumar, R., & Vijendra Babu, D. (2022). Towards a high precision in AMI-based smart meters and new technologies in the smart grid. Sustainable Computing: Informatics and Systems, 35 doi:10.1016/j.suscom.2022.100690

Downloads

Published

16.08.2023

How to Cite

Wagh, N. R. ., & Sutar, S. R. . (2023). Enhanced Emotion Recognition for Women and Children Safety Prediction using Deep Network. International Journal of Intelligent Systems and Applications in Engineering, 11(10s), 500–515. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/3305

Issue

Section

Research Article