Enhancing Speech Emotion Recognition with Multi-Modal Hybrid Features and CNN

International Journal of Electronics and Communication Engineering
© 2025 by SSRG - IJECE Journal
Volume 12 Issue 7
Year of Publication : 2025
Authors : Rashmi Rani, Manoj Kumar Ramaiya
pdf
How to Cite?

Rashmi Rani, Manoj Kumar Ramaiya, "Enhancing Speech Emotion Recognition with Multi-Modal Hybrid Features and CNN," SSRG International Journal of Electronics and Communication Engineering, vol. 12,  no. 7, pp. 35-46, 2025. Crossref, https://doi.org/10.14445/23488549/IJECE-V12I7P104

Abstract:

Speech emotion recognition is pivotal in human-computer interaction, enabling machines to understand and respond to human emotions. While traditional methods often rely on low-level feature extraction, this paper proposes a language-independent, deep learning-based framework for accurate speech emotion classification. By combining the strengths of hybrid feature extraction techniques and 3D Convolution Neural Networks (CNN), the proposed framework effectively captures both local and global information from speech signals. The model is evaluated on RAVDESS, CREMA-D, SAVEE, and TESS datasets, achieving impressive accuracy rates of 98.48%. In order to verify the proposed work, we have compared the model with the LSTM and Bi-LSTM models too. Experimental results demonstrate the superiority of the proposed framework over state-of-the-art methods, paving the way for more sophisticated and empathetic human-computer interactions.

Keywords:

Emotion recognition , Speech , Convolution Neural Networks , ZCR, RMSE, MFCC, LSTM, Bi-LSTM.

References:

[1] Lucas Goncalves et al., “Odyssey 2024-Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results,” The Speaker and Language Recognition Workshop (Odyssey 2024), Quebec City, Canada, pp. 247-254, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Federico Costa, Miquel India, and Javier Hernando, “Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge,” arXiv preprint, pp. 1-8, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Mohamed A. Gismelbari et al., “Speech Emotion Recognition Using Deep Learning,” 2024 XXVII International Conference on Soft Computing and Measurements (SCM), Saint Petersburg, Russian Federation, pp. 380-384, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Zheng Lian et al., “MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition,” Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, Melbourne, VIC Australia, pp. 41-48, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Wahiba Ismaiel et al., “Deep Learning, Ensemble and Supervised Machine Learning for Arabic Speech Emotion Recognition,” Engineering, Technology & Applied Science Research, vol. 14, no. 2, pp. 13757-13764, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[6] A.V. Geetha et al., “Multimodal Emotion Recognition with Deep Learning: Advancements, Challenges, and Future Directions,” Information Fusion, vol. 105, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Sumera, K. Vaidehi, and Qamar Nisha, “A Machine Learning and Deep Learning based Approach to Generate a Speech Emotion Recognition System,” 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, pp. 573-577, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Jarod Duret, Mickael Rouvier, and Yannick Esteve, “MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition,” arXiv preprint, pp. 1-6, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[9] S. Ashmad Ahemed, and D. Jayakumar, “Voice Assisted Facial Emotion Recognition System for Blind Peoples with Tensorflow Model,” 2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS), Bhopal, India, pp. 1-4, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[10] M. Madhura et al., “Neural Networks and Emotions: A Deep Learning Perspective,” 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, pp. 1-7, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Daria Diatlova et al., “Adapting WavLM for Speech Emotion Recognition,” arXiv preprint, pp. 1-6, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Pragya Singh Tomar, Kirti Mathur, and Ugrasen Suman, “Fusing Facial and Speech Cues for Enhanced Multimodal Emotion Recognition,” International Journal of Information Technology, vol. 16, pp. 1397-1405, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Anandakumar Haldorai et al., “Bi-Model Emotional AI for Audio-Visual Human Emotion Detection Using Hybrid Deep Learning Model,” Artificial Intelligence for Sustainable Development, pp. 293-315, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[14] N. Kapileswar et al., “An Intelligent Emotion Recognition System based on Speech Terminologies using Artificial Intelligence Assisted Learning Scheme,” 2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM), Chennai, India, pp. 1-7, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Sudipta Bhattacharya et al., “Emotion Detection from Multilingual Audio using Deep Analysis,” Multimedia Tools and Applications, vol. 81, pp. 41309-41338, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Henry Harm, and Tanel Alum, “TalTech Systems for the Odyssey 2024 Emotion Recognition Challenge,” The Speaker and Language Recognition Workshop, Quebec City, Canada, pp. 1-5, 2024.
[Google Scholar] [Publisher Link]
[17] Jain Joseph, R.P. Aneesh, and Joseph Zacharias, “Deep Learning based Emotion Recognition in Human-Robot Interaction with Multi-Modal Data,” Fourth International Conference on Advances in Physical Sciences and Materials, Coimbatore, India, vol. 3122, no. 1, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Sung-Woo Byun, and Seok-Pil Lee, “A Study on a Speech Emotion Recognition System with Effective Acoustic Features using Deep Learning Algorithms,” Applied Sciences, vol. 11, no. 4, pp. 1-15, 1890.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Waleed Akram Khan, Hamad ul Qudous, and Asma Ahmad Farhan, “Speech Emotion Recognition using Feature Fusion: A Hybrid Approach to Deep Learning,” Multimedia Tools and Applications, vol. 83, pp. 75557-75584, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Rudra Tiwari et al., “Emotion Detection through Human Verbal Expression Using Deep Learning Techniques,” 2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS), Bhopal, India, pp. 1-7, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Omayma Mahmoudi, and Mouncef Filali Bouami, “Arabic Speech Emotion Recognition using Deep Neural Network,” International Conference on Digital Technologies and Applications, Fez, Morocco, pp. 124-133, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Davit Rizhinashvili, Abdallah Hussein Sham, and Gholamreza Anbarjafari, “Enhanced Speech Emotion Recognition using Averaged Valence Arousal Dominance Mapping and Deep Neural Networks,” Signal, Image and Video Processing, vol. 18, pp. 7445-7454, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Shrikala Deshmukh, and Preeti Gupta, “Application of Probabilistic Neural Network for Speech Emotion Recognition,” International Journal of Speech Technology, vol. 27, pp. 19-28, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Ravi Raj Choudhary, Gaurav Meena, and Krishna Kumar Mohbey, “Speech Emotion based Sentiment Recognition using Deep Neural Networks,” Journal of Physics: Conference Series, vol. 2236, pp. 1-11, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Kishor Bhangale, and Mohanaprasad Kothandaraman, “Speech Emotion Recognition based on Multiple Acoustic Features and Deep Convolutional Neural Network,” Electronics, vol. 12, no. 4, pp. 1-17, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Valli Madhavi Koti et al., “Speech Emotion Recognition using Extreme Machine Learning,” EAI Endorsed Transactions on Internet of Things, vol. 10, pp. 1-6, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Tae-Wan Kim, and Keun-Chang Kwak, “Speech Emotion Recognition Using Deep Learning Transfer Models and Explainable Techniques,” Applied Sciences, vol. 14, no. 4, pp. 306-311, 2024.
[CrossRef] [Google Scholar] [Publisher Link]