AER-Net: A Deep Attention-Residual Model for Imbalanced Speech Emotion Recognition

International Journal of Electronics and Communication Engineering
© 2025 by SSRG - IJECE Journal
Volume 12 Issue 6
Year of Publication : 2025
Authors : Saurabh Singh, Rajesh Singh, Chandradeep Bhatt, Dev Baloni, Anita Gehlot
pdf
How to Cite?

Saurabh Singh, Rajesh Singh, Chandradeep Bhatt, Dev Baloni, Anita Gehlot, "AER-Net: A Deep Attention-Residual Model for Imbalanced Speech Emotion Recognition," SSRG International Journal of Electronics and Communication Engineering, vol. 12,  no. 6, pp. 340-351, 2025. Crossref, https://doi.org/10.14445/23488549/IJECE-V12I6P127

Abstract:

The effective use of Speech Emotion Recognition (SER) for affective computing requires solutions to address several deployment problems, including class imbalance, insufficient deep architectures, and multi-level acoustic feature learning. The research presents Attention-Enhanced Residual Network (AER-Net), specifically addressing these major issues within deep learning architectures. AER-Net uses residual CNN blocks alongside Bidirectional LSTMs and a custom attention mechanism to process normalized Mel spectrograms and extract temporal and spectral emotional cues. The model employs adaptive pooling, regularization mechanisms, and class-weighted loss strategies to manage the class imbalance issue. AER-Net proves superior to traditional machine learning references in both CREMA-D and Dataverse speech emotion evaluations by achieving 87.7% accuracy across all metrics, including F1-score, recall and precision. AER-Net flaunts exceptional capability to detect minor emotions, including "disgust" and "fear", which traditional models typically mistype. Through the integrated attention mechanism, the system automatically focuses on specific speech segments that are emotionally important. The scalable and generalizable AER-Net system makes progress toward the development of emotional intelligence in AI systems through its application to real-world SER platforms. Research demonstrates the necessity of developing superior feature extraction methods together with strong architectural techniques for making future emotion identification systems.

Keywords:

Acoustic feature extraction, Attention mechanism, Emotion recognition, Imbalanced data handling, Speech Emotion Recognition (SER) residual networks.

References:

[1] Md Saroar Jahan et al., “A Comprehensive Study on NLP Data Augmentation for Hate Speech Detection: Legacy Methods, BERT, and LLMs,” arxiv, pp. 1-31, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Sadil Chamishka et al., “A Voice-Based Real-Time Emotion Detection Technique using Recurrent Neural Network Empowered Feature Modelling,” Multimedia Tools and Applications, vol. 81, pp. 35173-35194, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Haibin Wu et al., “EMO-SUPERB: An In-depth Look at Speech Emotion Recognition,” arxiv, pp. 1-19, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Papers with Code, Speech Emotion Recognition. [Online]. Available: https://paperswithcode.com/task/speech-emotion-recognition?page=9&q=
[5] Sushadevi Shamrao Adagale, and Praveen Gupta, “Speech-based Sentiment Recognition System using PDCNN and LSTM Algorithms,” Research Square, pp. 1-24, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Minoo Shayaninasab, and Bagher Babaali, “Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers,” arxiv, pp. 1-13, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Jiang Li et al., “CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition,” IEEE Transactions on Affective Computing, vol. 15, no. 4, pp. 1919-1933, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Ravi Raj Choudhary, Gaurav Meena, and Krishna Kumar Mohbey, “Speech Emotion Based Sentiment Recognition using Deep Neural Networks,” Journal of Physics: Conference Series: 2nd International Conference on Computational Intelligence, vol. 2236, pp. 1-10, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Rosa A. García-Hernández et al., “A Systematic Literature Review of Modalities, Trends, and Limitations in Emotion Recognition, Affective Computing, and Sentiment Analysis,” Applied Sciences, vol. 14, no. 16, pp. 1-25, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Hussein Farooq Tayeb Al-Saadawi, Bihter Das, and Resul Das, “A Systematic Review of Trimodal Affective Computing Approaches: Text, Audio, and Visual Integration in Emotion Recognition and Sentiment Analysis,” Expert Systems with Applications, vol. 225, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Rosa A. Garcia-Hernandez et al., “A Systematic Literature Review of Modalities, Trends, and Limitations in Emotion Recognition, Affective Computing, and Sentiment Analysis,” Applied Sciences, vol. 14, no. 16, pp. 1-25, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Samson Akinpelu, Serestina Viriri, and Adekanmi Adegun, “An Enhanced Speech Emotion Recognition using Vision Transformer,” Scientific Reports, vol. 14, no. 1, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Yaoxun Xu et al., “SECap: Speech Emotion Captioning with Large Language Model,” Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, Vancouver, Canada, vol. 38, no. 17, pp. 1-9, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Ziyang Ma et al., “Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition,” 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, pp. 11146-11150, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Fatima Ahmed, “Speech Emotion Recognition,” Thesis, Effat University, 2024.
[Publisher Link]
[16] Weidong Chen et al., “Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition,” IEEE Transactions on Affective Computing, vol. 15, no. 3, pp. 1711-1724, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Ahdi Hassan et al., Federated Learning and AI for Healthcare 5.0, IGI Global Scientific Publishing, pp. 1-391, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Sahil Kashyap et al., “Voice Sentiments Analysis,” International Journal of Progressive Research in Engineering Management and Science, vol. 4, pp. 1130-1132, 2024.
[Publisher Link]
[19] Bahman Mirheidari et al., “Automatic Detection of Expressed Emotion from Five-Minute Speech Samples: Challenges and Opportunities,” PLoS One, vol. 19, no. 3, pp. 1-16, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Tyler C. Dalal et al., “Speech based Natural Language Profile before, during and After the Onset of Psychosis: A Cluster Analysis,” Acta Psychiatrica Scandinavica, vol. 151, no. 3, pp. 332-347, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Tushar Anand et al., Voice and Speech Recognition Application in Emotion Detection, IGI Global Scientific Publishing, pp. 242-268, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Santosh Singh et al., “AI Model for Sentiment Analysis System Using Python,” Iconic Research and Engineering Journals, vol. 7, no. 8, pp. 41-45, 2024.
[Publisher Link]
[23] Serena Taylor, and Fariza Fauzi, “Multimodal Sentiment Analysis for the Malay Language: New Corpus using CNN-based Framework,” ACM Transactions on Asian and Low-Resource Language Information Processing, pp. 1-29, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Yee Sen Tan et al., “Video Sentiment Analysis for Child Safety,” 2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, China, pp. 783-790, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Nitin Liladhar Rane et al., Using Artificial Intelligence, Machine Learning, and Deep Learning for Sentiment Analysis in Customer Relationship Management to Improve Customer Experience, Loyalty, and Satisfaction, Trustworthy Artificial Intelligence in Industry and Society, pp. 233-261, 2024.
[CrossRef] [Publisher Link]
[26] Moshiur Rahman Faisal et al., “Bengali & Banglish: A Monolingual Dataset for Emotion Detection in Linguistically Diverse Contexts,” Data in Brief, vol. 55, pp. 1-18, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Jinchuan He, “Voice Emotion Recognition Using Natural Language Processing Deep Learning,” Master Thesis, Northeastern University, pp. 1-18, 2022.
[Google Scholar] [Publisher Link]
[28] Sophina Luitel, Yang Liu, and Mohd Anwar, “Investigating Fairness in Machine Learning-Based Audio Sentiment Analysis,” AI and Ethics, vol. 5, pp. 1099-1108, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Jianbo Zhao et al., “Sentiment Analysis of Video Danmakus based on MIBE-RoBERTa-FF-BiLSTM,” Scientific Reports, vol. 14, pp. 1-16, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Yicheng Zhong et al., “ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment,” Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, vol. 38, no. 7, pp. 7614-7622, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[31] A. Tepecik and E. Demir, “Emotion Detection with Pre-Trained Language Models BERT and ELECTRA Analysis of Turkish Data,” Intelligent Methods in Engineering Sciences, vol. 3, no. 1, pp. 7-12, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Xinfa Zhu et al., “METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1506-1518, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Suja Panickar et al., “Sentiment Analysis of Custom Speech Corpus: A proof of concept for NLP,” Procedia Computer Science, pp. 220-228, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Jie Shan, “Application of Natural Language Processing-based Emotional Semantic Analysis in the ‘One Core, Three Integrations’ Vocal Music Teaching Model,” Applied Mathematics and Nonlinear Sciences, vol. 9, no. 1, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Pamela Lopes da Cunha et al., “Automated Free Speech Analysis Reveals Distinct Markers of Alzheimer’s and Frontotemporal Dementia,” PLoS One, vol. 19, no. 6, pp. 1-19, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Galym Kapyshev, Marat Nurtas, and Aizhan Altaibek, “Speech Recognition for Kazakh Language: A Research Paper,” Procedia Computer Science, vol. 231, pp. 369-372, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[37] I. Venkata Dwaraka Srihith, and S. Ashok, “The Voice of Feeling: Exploring Emotions via Machine Learning,” Research and Reviews: Advancement in Cyber Security, vol. 1, no. 3, pp. 1-16, 2024.
[CrossRef] [Publisher Link]
[38] Mingyu Ji et al., “SS-Trans (Single-Stream Transformer for Multimodal Sentiment Analysis and Emotion Recognition): The Emotion Whisperer—A Single-Stream Transformer for Multimodal Sentiment Analysis,” Electronics, vol. 13, no. 21, pp. 1-16, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Zhicheng Liu et al., “Ensemble Pretrained Models for Multimodal Sentiment Analysis using Textual and Video Data Fusion,” WWW '24: Companion Proceedings of the ACM Web Conference 2024, Singapore, pp. 1841-1848, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Yuki Konishi, and Yoshihiro Tanaka, “An Emotional Expression System with Vibrotactile Feedback during the Robot’s Speech,” arxiv, pp. 1-4, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[41] H. Kuchuk, and A. Kuliahin, “Hybrid Recommender for Virtual Art Compositions with Video Sentiments Analysis,” Advanced Information Systems, vol. 8, no. 1, pp. 70-79, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[42] Xingqun Qi et al., “EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation,” IEEE Transactions on Multimedia, vol. 26, pp. 10420-10430, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[43] Jonathan Lo, and Elsie Wang, Flynn O'sullivan, Auditing Sentiment Analysis Algorithms for Bias. [Online]. Available: https://dsc180b.lojot.com/
[44] Viraj Nishchal Shah et al., “Investigation of Imbalanced Sentiment Analysis in Voice Data: A Comparative Study of Machine Learning Algorithms,” EAI Endorsed Transactions on Scalable Information Systems, vol. 11, no. 6, pp. 1-12, 2024. [CrossRef] [Google Scholar] [Publisher Link]
[45] Nirmalya Thakur et al., “A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles,” 26th International Conference on Human-Computer Interaction, HCII 2024, Washington, DC, USA, pp. 220-239, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[46] Ghulam Ali Amiri et al., “Decoding Gender Representation and Bias in Voice User Interfaces (VUIs),” International Journal of Computer Science and Mobile Computing, vol. 13, no. 5, pp. 76-88, 2024.
[CrossRef] [Publisher Link]
[47] Huan Zhao, Nianxin Huang, and Haijiao Chen, “Knowledge Enhancement for Speech Emotion Recognition via Multi-Level Acoustic Feature,” Connection Science, vol. 36, no. 1, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[48] Chao He et al., “Mixture of Attention Variants for Modal Fusion in Multi-Modal Sentiment Analysis,” Big Data and Cognitive Computing, vol. 8, no. 2, pp. 1-19, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[49] Houwei Cao et al., “CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,” IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377-390, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[50] Kate Dupuis, and M. Kathleen Pichora-Fuller, “Toronto Emotional Speech Set (TESS),” University of Toronto, 2010.
[Publisher Link]