VersaModal: A Unified Multi-Modal Sentiment Analysis System for Text, Image, and Video

International Journal of Electronics and Communication Engineering |
© 2025 by SSRG - IJECE Journal |
Volume 12 Issue 6 |
Year of Publication : 2025 |
Authors : Siddhi Kadu, Bharti Joshi, Pratik Agrawal |
How to Cite?
Siddhi Kadu, Bharti Joshi, Pratik Agrawal, "VersaModal: A Unified Multi-Modal Sentiment Analysis System for Text, Image, and Video," SSRG International Journal of Electronics and Communication Engineering, vol. 12, no. 6, pp. 264-278, 2025. Crossref, https://doi.org/10.14445/23488549/IJECE-V12I6P121
Abstract:
Users increasingly express their opinions and experiences in the form of reviews due to the increase in the use of the internet and smartphones. These user-generated reviews are available on different platforms such as Amazon, TripAdvisor, and Yelp in many formats, such as Text (T), Image (I), and Video (V), and all these provide a very comprehensive approach to customer experiences. Sentiment analysis is important for understanding consumer opinion; however, traditional methodologies focus primarily on unimodal systems (T, I, and V). In real-world scenarios, users post images or videos and text to review products, restaurants, hotels, services, travel experiences, or movies. It becomes essential to analyze and fuse different modalities to accomplish a more accurate and contextually aware interpretation of sentiment. Realizing the significance of these multi-modal review formats, a streamlined system called VersaModal is proposed, which is a unified multi-modal sentiment analysis model for processing inputs in either unimodal (T, I, V) form or multi-modal (T+I, T+V, I+V, or T+I+V) form. The model employs a late fusion strategy, which integrates output from each unimodal individual classifier to generate a unified sentiment score, ensuring a complete consumer sentiment analysis. The experimental results verify that the model significantly improves sentiment prediction accuracy by combining T, I, and V into one analysis, thus being more relevant for the application concerning marketing, customer services, and product development.
Keywords:
Images, Late fusion, Multi-modal, Sentiment analysis, Text, Unimodal, Videos.
References:
[1] Linan Zhu et al., “Multimodal Sentiment Analysis Based on Fusion Methods: A Survey,” Information Fusion, vol. 95, pp. 306-325, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Ramandeep Kaur, and Sandeep Kautish, “Multimodal Sentiment Analysis: A Survey and Comparison,” International Journal of Service Science, Management, Engineering, and Technology, vol. 10, no. 2, pp. 1846-1870, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Jitendra V. Tembhurne, and Tausif Diwan, “Sentiment Analysis in Textual, Visual and Multimodal Inputs Using Recurrent Neural Networks,” Multimedia Tools and Applications, vol. 80, pp. 6871-6910, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Ananya Pandey, and Dinesh Kumar Vishwakarma, “Progress, Achievements, and Challenges in Multimodal Sentiment Analysis Using Deep Learning: A survey,” Applied Soft Computing, vol. 152, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Keyuan Qiu et al., “A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism,” Electronics, vol. 13, no. 10, pp. 1-23, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Ankita Gandhi et al., “Multimodal Sentiment Analysis: A Systematic Review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges and Future Directions,” Information Fusion, vol. 91, pp. 424-444, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Mehmet Umut Salur, and Ilhan Aydin, “A Novel Hybrid Deep Learning Model for Sentiment Classification,” IEEE Access, vol. 8, pp. 58080-58093, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Sebastian Kula et al., “Sentiment Analysis for Fake News Detection by Means of Neural Networks,” Proceedings, Part IV 20th International Conference Computational Science, Amsterdam, Netherlands, pp. 653-666, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Priya Patel, Devkishan Patel, and Chandani Naik, “Sentiment Analysis on Movie Review Using Deep Learning RNN Method,” Conference Proceedings Frontiers in Intelligent Computing: Theory and Applications, Surathkal, India, vol. 2, pp. 155-163, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Ru Ni, and Huan Cao, “Sentiment Analysis Based on GloVe and LSTM-GRU,” 2020 39th Chinese Control Conference (CCC) 2020, Shenyang, China, pp. 7492-7497, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Kian Long Tan, Chin Poo Lee, and Kian Ming Lim, “RoBERTa-GRU: A Hybrid Deep Learning Model for Enhanced Sentiment Analysis,” Applied Sciences, vol. 13, no. 6, pp. 1-16, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Alessandro Ortis, Giovanni Maria Farinella, and Sebastiano Battiato, “Survey on Visual Sentiment Analysis,” IET Image Processing, vol. 14, no. 8, pp. 1440-1456, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Zhuanghui Wu, Min Meng, and Jigang Wu, “Visual Sentiment Prediction with Attribute Augmentation and Multi-Attention Mechanism,” Neural Processing Letters, vol. 51, pp. 2403-2416, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Luo-yang Xue et al., “NLWSNet: A Weakly Supervised Network for Visual Sentiment Analysis in Mislabeled Web Images,” Frontiers of Information Technology & Electronic Engineering, vol. 21, pp. 1321-1333, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Jie Chen, Qirong Mao, and Luoyang Xue, “Visual Sentiment Analysis with Active Learning,” IEEE Access, vol. 8, pp. 185899-185908, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Jing Zhang et al., “Object Semantics Sentiment Correlation Analysis Enhanced Image Sentiment Classification,” Knowledge-Based Systems, vol. 191, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Di Wang et al., “TETFN: A Text Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis,” Pattern Recognition, vol. 136, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Bo Yang et al., “Multimodal Sentiment Analysis with Two-Phase Multi-Task Learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2015-2024, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Ting Wu et al., “Video Sentiment Analysis with Bimodal Information-Augmented Multi-Head Attention,” Knowledge-Based Systems, vol. 235, pp. 1-12, 2022.
[CrossRef] [ Google Scholar] [Publisher Link]
[20] Zheng Wang, Peng Gao, and Xuening Chu, “Sentiment Analysis from Customer-Generated Online Videos on Product Review Using Topic Modeling and Multi-Attention BLSTM,” Advanced Engineering Informatics, vol. 52, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Zhipeng He et al., “Advances in Multimodal Emotion Recognition Based on Brain–Computer Interfaces,” Brain Sciences, vol. 38, no. 10, pp. 1-29, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Sijie Mai et al., “Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis,” IEEE Transactions on Affective Computing, vol. 14, no. 3, pp. 2276-2289, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Ke Zhang et al., “Transfer Correlation between Textual Content to Images for Sentiment Analysis,” IEEE Access, vol. 8, pp. 35276-35289, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Xu Lu et al., “Semantic Driven Interpretable Deep Multi-Modal Hashing for Large-Scale Multimedia Retrieval,” IEEE Transactions on Multimedia, vol. 23, pp. 4541-4554, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Xiaocui Yang et al., “Image-Text Multimodal Emotion Classification via Multi-View Attentional Network,” IEEE Transactions on Multimedia, vol. 23, pp. 4014-4026, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Feiran Huang et al., “Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 16, no. 3, pp. 1-19, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Wenya Guo et al., “LD-MAN: Layout-Driven Multimodal Attention Network for Online News Sentiment Recognition,” IEEE Transactions on Multimedia, vol. 23, pp. 1785-1798, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Young Gyo Jung et al., “Enhanced Naive Bayes Classifier for Real-Time Sentiment Analysis with SparkrR,” 2016 International Conference on Information and Communication Technology Convergence (ICTC), Jeju, Korea (South), pp. 141-146, 2016.
[CrossRef] [GoogleScholar] [Publisher Link]
[29] Arman S. Zharmagambetov, and Alexandr A. Pak, “Sentiment Analysis of a Document Using Deep Learning Approach and Decision Trees,” 2015 Twelve International Conference on Electronics Computer and Computation (ICECCO), Almaty, Kazakhstan, pp. 1-4, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Kian Long Tan et al., “RoBERTa-LSTM: A Hybrid Model for Sentiment Analysis with Transformer and Recurrent Neural Network,” IEEE Access, vol. 10, pp. 21517-21525, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Sagar Hossen et al., “Hotel Review Analysis for the Prediction of Business Using Deep Learning Approach,” 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), Coimbatore, India, pp. 1489-1494, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Tanushree Dholpuria, Y.K. Rana, and Chetan Agrawal, “A Sentiment Analysis Approach through Deep Learning for a Movie Review,” 2018 8th International Conference on Communication Systems and Network Technologies (CSNT), Bhopal, India, pp. 173-181, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Apar Garg, and Rohit Kumar Kaliyar, “PSent20: An Effective Political Sentiment Analysis with Deep Learning Using Real-Time Social Media Tweets,” 2020 5th IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE), Jaipur, India, pp. 1-5, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Praphula Kumar Jain, Vijayalakshmi Saravanan, and Rajendra Pamula, “A Hybrid CNN-LSTM: A Deep Learning Approach for Consumer Sentiment Analysis Using Qualitative User-Generated Contents,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 20, no. 5, pp. 1-15, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Quoc-Tuan Truong, and Hady W. Lauw, “Visual Sentiment Analysis for Review Images with Item-Oriented and User-Oriented CNN,” Proceedings of the 25th ACM International Conference on Multimedia, California USA, pp. 1274-1282, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Gaurav Meena, Krishna Kumar Mohbey, and Sunil Kumar, “Sentiment Analysis on Images Using Convolutional Neural Networks Based Inception-V3 Transfer Learning Approach,” International Journal of Information Management Data Insights, vol. 3, no. 1, pp. 1-13, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Amir Zadeh et al., “Memory Fusion Network for Multi-view Sequential Learning,” Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, pp. 5634-5641, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Amir Zadeh et al., “Multi-Attention Recurrent Network for Human Communication Comprehension,” Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, pp. 5642-5649, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Soujanya Poria et al., “Context Dependent Sentiment Analysis in User-Generated Videos,” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, vol. 1, pp. 873-833, 2017.
CrossRef] [Google Scholar] [Publisher Link]
[40] Minghai Chen et al, “Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning,” Proceedings of the 19th ACM International Conference on Multimodal Interaction,” Glasgow UK, pp. 163-171, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[41] Amir Zadeh et al., “Tensor Fusion Network for Multimodal Sentiment Analysis,” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1103-1114, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[42] Amir Zadeh et al., “MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos,” Arxiv, pp. 1-10, 2016.
[CrossRef] [Google Scholar] [Publisher Link]