A Comprehensive Multimodal Approach to Assessing Sentimental Intensity and Subjectivity using Unified MSE Model


  • Mohd Usman Khan, Faiyaz Ahamad


Multimodal Learning, Subjectivity Assessment, Audio & Text Analysis, Distinctiveness, Unified-modal Supervision.


In the dynamic realm of multimodal learning, where representation Learning serves as a pivotal key, our research introduces a groundbreaking approach to understanding sentiment and subjectivity in audio and text. Illustration from self-supervised learning, we've innovatively combined multi-modal and Unified--modal tasks, emphasizing the crucial aspects of consistency and distinctiveness. Our training techniques, likened to the art of fine-tuning an instrument, harmonize the learning process, prioritizing samples with distinctive supervisions. Addressing the pressing need for robust datasets and methodologies in combinational text and audio sentiment analysis, we offer the dataset for Multi-modal sentiment intensity assessment at the Opinion Level (MOSI). This meticulously annotated corpus offers insights into subjectivity, sentiment intensity, text features, and audio nuances, setting a benchmark for future research. Our method not only excels in generating Unified-modal supervisions but also stands resilient against benchmarks like MOSI and MOSEI, even competing human curated annotations on the challenging datasets. This pioneering work paves the way for deeper explorations and applications in the burgeoning field of sentiment analysis.


Download data is not yet available.


M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, and L. P. Morency, "Multimodal sentiment analysis with word-level fusion and reinforcement learning," in Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017, pp. 163–171.

M. Lin et al., "Modern dialogue system architectures,"Journal of Conversational AI, vol. 8, no. 2, pp. 45-60, 2020.

K. Lin and J. Xu, "Emotion recognition in conversational agents,"Dialogue Systems Journal, vol. 14, no. 1, pp. 15-29, 2019.

N. Majumder et al., "Multimodal sentiment analysis using hierarchical fusion with context modeling,"Knowledge-Based Systems, vol. 161, pp. 124–133, 2018.

T. Ahmad, S. U. Ahmed, and N. Ahmad, "Detection of Depression Signals from Social Media Data," in Smart Connected World: Technologies and Applications Shaping the Future, 2021, pp. 191-209.

J. Holler and S. C. Levinson, "Multimodal language processing in human communication,"Trends in Cognitive Sciences, 2019.

S. Dobrišek et al., "Towards efficient multi-modal emotion recognition,"International Journal of Advanced Robotic Systems, vol. 10, no. 1, p. 53, 2013.

B. Zadeh et al., "Tensor Fusion Network for Multimodal Sentiment Analysis," in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114.

Y. Tsai et al., "Cross-modality representation in sentiment analysis,"Multimodal Systems Journal, vol. 16, no. 3, pp. 40-54, 2019.

Zadeh et al., "multi-attention recurrent network for human communication comprehension," in Thirty-Second AAAI Conference on Artificial Intelligence.

R. Li et al., "Towards discriminative representation learning for speech emotion recognition," in Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019.

M. U. Khan and F. Ahamad, "An Affective Framework for Multimodal Sentiment Analysis to Navigate Emotional Terrains,"Telematique, vol. 23, no. 01, pp. 70-83, 2024.

Joshi et al., "Inter/intra dependencies modeling in dialogue systems,"Journal of Multimodal Systems, vol. 13, no. 1, pp. 12-28, 2022.

Li et al., "Contextual graph structures for emotion modeling,"Journal of Multimodal Systems, vol. 14, no. 3, pp. 56-71, 2021.

X. Tan, M. Zhuang, X. Lu, and T. Mao, "An Analysis of the Emotional Evolution of Large-Scale Internet Public Opinion Events Based on the BERT-LDA Hybrid Model," in IEEE Access, vol. 9, pp. 15860-15871, 2021, doi: 10.1109/ACCESS.2021.3052566.

S. Ghosh et al., "Context and Knowledge Enriched Transformer Framework for Emotion Recognition in Conversations," in 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 2021, pp. 1-8, doi: 10.1109/IJCNN52387.2021.9533452.

C. Raffel et al., "T5: A unified framework for NLP tasks,"Journal of Natural Language Processing, vol. 26, no. 4, pp. 1302-1317, 2020.

Zadeh et al., "MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos,"IEEE Intelligent Systems, vol. 31, no. 6, pp. 82-88, 2016, doi: 10.48550/arXiv.1606.06259.

S. Poria et al., "MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations," in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy: Association for Computational Linguistics, 2019, pp. 527–536.

C. Busso et al., "IEMOCAP: Interactive emotional dyadic motion capture database,"Language resources and evaluation, vol. 42, pp. 335-359, 2008.




How to Cite

Mohd Usman Khan, Faiyaz Ahamad. (2024). A Comprehensive Multimodal Approach to Assessing Sentimental Intensity and Subjectivity using Unified MSE Model . International Journal of Intelligent Systems and Applications in Engineering, 12(21s), 575–583. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5453



Research Article