Enhancing Multimodal Sentiment Prediction with Cross-Modal Attention and Adaptive Feature Weighting

Prashant Adakane; Amit Gaikwad

doi:10.14445/23488549/IJECE-V12I5P130

Enhancing Multimodal Sentiment Prediction with Cross-Modal Attention and Adaptive Feature Weighting

International Journal of Electronics and Communication Engineering

Volume 12 Issue 5

Year of Publication : 2025

Authors : Prashant Adakane, Amit Gaikwad

10.14445/23488549/IJECE-V12I5P130

How to Cite?

Prashant Adakane, Amit Gaikwad, "Enhancing Multimodal Sentiment Prediction with Cross-Modal Attention and Adaptive Feature Weighting," SSRG International Journal of Electronics and Communication Engineering, vol. 12, no. 5, pp. 363-378, 2025. Crossref, https://doi.org/10.14445/23488549/IJECE-V12I5P130

Abstract:

This study introduces a novel framework, Contextual Adaptive Cross-Modal Attention Fusion (CA-CMAF), designed to solve multimodal sentiment analysis’s difficulties. The framework leverages dynamic modality fusion and cross-modal attention mechanisms to effectually integrate textual and visual data, enabling a more nuanced understanding of sentiment in heterogeneous datasets. By focusing on the interplay between modalities, CA-CMAF aims to increase the accuracy and interpretability of the sentiment prediction system. The proposed approach combines textual features extracted through BERT, which has been both pre-trained and subsequently fine-tuned, with visual features derived from VGG-16. Through a cross-modal attention technique, these modalities are fused and aligned, capturing fine-grained interactions between text and images. The attention mechanism computes attention scores to prioritize the most relevant aspects of each modality, depending on the context provided by others. Additionally, an adaptive learning mechanism dynamically adjusts the contribution of each modality, ensuring optimal fusion for sentiment classification. The model is optimized using a blended loss approach. One part of this approach is based on cross-entropy principles for sentiment classification. Another component includes a regularization term to ensure balanced modality contributions. Experimental findings on MVSA-Single and MVSA-Multiple benchmark datasets indicate that CA-CMAF surpasses current baseline methods in state-of-the-art performance. The framework shows significant boosts in performance for metrics like accuracy, F1-score, precision, recall, with readings of 91%, 89%, 90%, and 90%, respectively, particularly in scenarios where one modality is more informative than the other.

Keywords:

Cross-Modal Attention, Adaptive learning, BERT, VGG-16, Dynamic modality integration.

References:

[1] Danlei Chen et al., “Joint Multimodal Sentiment Analysis Based on Information Relevance,” Information Processing & Management, vol. 60, no. 2, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Ringki Das, and Thoudam Doren Singh, “Image-Text Multimodal Sentiment Analysis Framework of Assamese News Articles Using Late Fusion,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 6, pp. 1-30, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Ankita Gandhi et al., “Multimodal Sentiment Analysis: A Systematic Review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges and Future Directions,” Information Fusion, vol. 91, pp. 424-444, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Israa Khalaf Salman Al-Tameemi et al., “A Comprehensive Review of Visual-Textual Sentiment Analysis from Social Media Networks,” arXiv Preprint, pp. 1-44, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Lianting Gong, Xingzhou He, and Jianzhong Yang, “An Image-Text Sentiment Analysis Method Using Multi-Channel Multi-Modal Joint Learning,” Applied Artificial Intelligence, vol. 38, no. 1, pp. 1-20, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Songning Lai et al., “Multimodal Sentiment Analysis: A Survey,” Displays, vol. 80, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[7] P. Ganesh Kumar, “A Context-Sensitive Multi-Tier Deep Learning Framework for Multimodal Sentiment Analysis,” Multimedia Tools and Applications, vol. 83, pp. 54249-54278, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Hongchan Li, Yantong Lu, and Haodong Zhu, “Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism,” Electronics, vol. 13, no. 11, pp. 1-22, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Ringki Das, and Thoudam Doren Singh, “Multimodal Sentiment Analysis: A Survey of Methods, Trends, and Challenges,” ACM Computing Surveys, vol. 55, no. 135, pp. 1-38, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Zhibang Quan et al., “Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks,” Computational Intelligence and Neuroscience, vol. 2022, no. 1, pp. 1-12, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Weijing Chen, Linli Yao, and Qin Jin, “Rethinking Benchmarks for Cross-Modal Image-Text Retrieval,” Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, pp. 1241-1251, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Priyanka Meel, and Dinesh Kumar Vishwakarma, “Multi-Modal Fusion Using Fine-Tuned Self-Attention and Transfer Learning for Veracity Analysis of Web Information,” Expert Systems with Applications, vol. 229, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Amir Zadeh et al., “Tensor Fusion Network for Multimodal Sentiment Analysis,” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1103-1114, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Tong Zhu et al., “Multimodal Emotion Classification with Multi-Level Semantic Reasoning Network,” IEEE Transactions on Multimedia, vol. 25, pp. 6868-6880, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Zhichao Liu et al., “CMAFusion: Cross Modal Attention Based End-to-End Infrared and Visible Image Fusion Network,” 2023 7th CAA International Conference on Vehicular Control and Intelligence, Changsha, China, pp. 1-6, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Tong Zhu et al., “Multimodal Sentiment Analysis with Image-Text Interaction Network,” IEEE Transactions on Multimedia, vol. 25, pp. 3375-3385, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Xiaodong Luo et al., “CMAFGAN: A Cross-Modal Attention Fusion Based Generative Adversarial Network for Attribute Word-to-Face Synthesis,” Knowledge-Based Systems, vol. 255, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Yangtao Wang et al., “Cross-Modal Fusion for Multi-Label Image Classification with Attention Mechanism,” Computers and Electrical Engineering, vol. 101, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Meng Xu et al., “CMJRT: Cross-Modal Joint Representation Transformer for Multimodal Sentiment Analysis,” IEEE Access, vol. 10, pp. 131671-131679, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Huimin Yu et al., “CACRM: Cross-Attention Based Image-Text Cross Modal Retrieval,” 2022 IEEE Eighth International Conference on Big Data Computing Service and Applications (BigDataService), Newark, CA, USA, pp. 137-142, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Tao Zhou et al., “Visual-Textual Sentiment Analysis Enhanced by Hierarchical Cross-Modality Interaction,” IEEE Systems Journal, vol. 15, no. 3, pp. 4303-4314, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Huanglu Wen, Shaodi You, and Ying Fu, “Cross-Modal Dynamic Convolution for Multi-Modal Emotion Recognition,” Journal of Visual Communication and Image Representation, vol. 78, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Selvarajah Thuseethan et al., “Multimodal Deep Learning Framework for Sentiment Analysis from Text-Image Web Data,” 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Melbourne, Australia, pp. 267-274, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Feiran Huang et al., “Image–Text Sentiment Analysis via Deep Multimodal Attentive Fusion,” Knowledge-Based Systems, vol. 167, pp. 26-37, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Zihao Wang et al., “CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval,” 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp. 5763-5772, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Jiuxiang Gu et al., “Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 7181-7189, 2018. [CrossRef] [Google Scholar] [Publisher Link]
[27] Javad Hassannataj Joloudari et al., “BERT-Deep CNN: State of the Art for Sentiment Analysis of COVID-19 Tweets,” Social Network Analysis and Mining, vol. 13, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Mahsa Hadikhah Mozhdehi, and AmirMasoud Eftekhari Moghadam, “Textual Emotion Detection Utilizing a Transfer Learning Approach,” The Journal of Supercomputing, vol. 79, pp. 13075-13089, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Raheel Siddiqi, “Effectiveness of Transfer Learning and Fine Tuning in Automated Fruit Image Classification,” Proceedings of the 2019 3rd International Conference on Deep Learning Technologies, Xiamen China, pp. 91-100, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[30] MVSA: Sentiment Analysis on Multi-View Social Data, MCR Lab. [Online]. Available: https://mcrlab.net/research/mvsa-sentiment-analysis-on-multi-view-social-data/
[31] Nan Xu, and Wenji Mao, “MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis,” Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, pp. 2399-2402, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Ranjan Satapathy et al., “Phonetic-Based Microtext Normalization for Twitter Sentiment Analysis,” 2017 IEEE International Conference on Data Mining Workshops, New Orleans, LA, USA, pp. 407-413, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Ashima Yadav, and Dinesh Kumar Vishwakarma, “A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 19, no. 1, pp. 1-19, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

IJECE MENUS

Call for Paper - Upcoming Issues

Enhancing Multimodal Sentiment Prediction with Cross-Modal Attention and Adaptive Feature Weighting

How to Cite?

Abstract:

Keywords:

References: