Call For Paper - Upcoming Conferences

Research Article | Open Access | Download PDF
Volume 13 | Issue 5 | Year 2026 | Article Id. IJECE-V13I5P124 | DOI : https://doi.org/10.14445/23488549/IJECE-V13I5P124

Deep Convolutional Framework for Speaker Identification from Emotional Speech Using Multi-Domain Acoustic Feature Fusion


Rupali Khaklary, Nabankur Pathak

Received Revised Accepted Published
20 Feb 2026 20 Mar 2026 22 Apr 2026 27 May 2026

Citation :

Rupali Khaklary, Nabankur Pathak, "Deep Convolutional Framework for Speaker Identification from Emotional Speech Using Multi-Domain Acoustic Feature Fusion," International Journal of Electronics and Communication Engineering, vol. 13, no. 5, pp. 301-311, 2026. Crossref, https://doi.org/10.14445/23488549/IJECE-V13I5P124

Abstract

Speaker identification from emotional speech is challenging due to variations in acoustic characteristics introduced by different emotional states, which often affect speaker-specific information. This paper introduces a profound convolutional network for speaker identification through multi-domain acoustic features from a custom Bodo dataset. Complementary time–frequency representations, including Mel-Frequency Cepstral Coefficients (MFCC), Log-Mel spectrograms, and chroma features, are combined to capture spectral, perceptual, and harmonic characteristics of speech signals and hence enhance speaker discrimination under emotional variability. The 2-dimensional Convolutional Neural Network is used to acquire hierarchical feature representations to classify multi-class speakers. A Baseline system with MFCC features is created and then measures the effectiveness of feature fusion based on alternative combinations of features. Data augmentation methods, such as the addition of white noise and time stretching, are applied to the training dataset to enhance the generalization and robustness of models. Experimental findings show that the proposed multi-feature fusion approach outperforms with the fusion of MFCC and Log-Mel spectrogram representations, achieving an identification accuracy of 94.53%. The results suggest that the speaker separability is greatly enhanced, and a convolutional architecture is also computationally efficient by incorporating complementary acoustic features.

Keywords

Convolutional Neural Network, Deep Learning, Feature Fusion, Spectral features, Speaker Identification.

References

  1. Georg Heigold et al., “End-to-end Text-dependent Speaker Verification,” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, pp. 5115-5119, 2016.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  2. Muhammad Abdul Basit, Chanjuan Liu, and Enyu Zhao, “SDI: A Tool for Speech Differentiation in user Identification,” Expert Systems with Applications, vol. 243, 2024.
    [CrossRef] [Google Scholar] [Publisher Link]
  3. Tomi Kinnunen, and Haizhou Li, “An Overview of Text-independent Speaker Recognition: From Features to Supervectors,” Speech Communication, vol. 52, no. 1, pp. 12-40, 2010.
    [CrossRef] [Google Scholar] [Publisher Link]
  4. Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10, no. 1-3, pp. 19-41, 2000.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  5. Patrick Kenny, “Bayesian Speaker Verification with Heavy-tailed Priors,” Odyssey Speaker and Language Recognition Workshop, 2010.
    [
    Google Scholar] [Publisher Link]
  6. Najim Dehak et al., “Front-End Factor Analysis for Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  7. Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “VoxCeleb: A Large-scale Speaker Identification Dataset,” arXiv preprint, pp. 1-6, 2017.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  8. Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “VoxCeleb2: Deep Speaker Recognition,” arXiv preprint, pp. 1-6, 2018.
    [CrossRef] [Google Scholar] [Publisher Link]
  9. Muhammad Mohsin Kabir et al., “A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities,” IEEE Access, vol. 9, pp. 79236-79263, 2021.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  10. Dávid Sztahó, György Szaszák, and András Beke, “Deep Learning Methods in Speaker Recognition: A Review,” Periodica Polytechnica Electrical Engineering and Computer Science, vol. 65, no. 4, pp. 310-328, 2021.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  11. Wenzao Li et al., “TDNN Achitecture with Efficient Channel Attention and Improved Residual Blocks for Accurate Speaker Recognition,” Scientific Report, vol. 15, pp. 1-13, 2025.
    [CrossRef] [Google Scholar] [Publisher Link]
  12. David Snyder et al., “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, pp. 5329-5333, 2018.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  13.  Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” Inter Speech, Shanghai, China, pp. 3830-3834, 2020.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  14.  Kaiming He et al., “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770-778, 2016.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  15. Alexei Baevski et al., “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” NeurIPS Proceedings: 34th Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 1-12, 2020.
    [
    Google Scholar] [Publisher Link]
  16.  Wei-Ning Hsu et al., “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451-3460, 2021.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  17. Tessfu Geteye Fantaye, Junqing Yu, and Tulu Tilahun Hailu, “Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition,” Computers, vol. 9, no. 2, pp. 1-27, 2020.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  18. Jaher Hassan Chowdhury, Sheela Ramanna, and Ketan Kotecha, “Speech Emotion Recognition with Light Weight Deep Neural Ensemble Model using Hand Crafted Features,” Scientific Report, vol. 15, pp. 1-14, 2025.
    [CrossRef] [Google Scholar] [Publisher Link]
  19. Zhongxin Bai, and Xiao-Lei Zhang, “Speaker Recognition based on Deep Learning: An Overview,” Neural Networks, vol. 140, 2021, pp. 65-99, 2021.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  20. Daniele Salvati, Carlo Drioli, and Gian Luca Foresti, “A Late Fusion Deep Neural Network for Robust Speaker Identification using Raw Waveforms and Gammatone Cepstral Coefficients,” Expert Systems with Applications, vol. 222, pp. 1-9, 2023.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  21. Yi Liu et al., “Introducing Phonetic Information to Speaker Embedding for Speaker Verification,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2019, pp. 1-17, 2019.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  22. Soufiane Hourri, and Jamal Kharroubi, “A Deep Learning Approach for Speaker Recognition,” International Journal of Speech Technology, vol. 23, pp. 123-131, 2019.
    [CrossRef] [Google Scholar] [Publisher Link]
  23. Wu Zunjing, and Cao Zhigang, “Improved MFCC-Based Feature for Robust Speaker Identification,” Tsinghua Science and Technology, vol. 10, no. 2, pp. 158-161, 2005.
    [CrossRef] [Google Scholar] [Publisher Link]
  24. V. Sabitha, and P. Janardhanan, “Performance Analysis of Speaker Identification System using MFCC and DWT under Various Noise Levels,” International Journal of Engineering Research & Technology, vol. 2, no. 6, pp. 3337-3341, 2013.
    [Google Scholar] [Publisher Link]
  25. Haris Isyanto, Ajib Setyo Arifin, and Muhammad Suryanegara, “Voice Biometrics for Indonesian Language Users using Algorithm of Deep Learning CNN Residual and Hybrid of DWT-MFCC Extraction Features,” International Journal of Advanced Computer Science and Applications(IJACSA), vol. 13, no. 5, pp. 622-634, 2022. [CrossRef] [Google Scholar] [Publisher Link]
  26. Amit Moondra, and Poonam Chahal, “Speaker Recognition Improvement for Degraded Human Voice using Modified-MFCC with GMM,” International Journal of Advanced Computer Science and Applications(IJACSA), vol. 14, no. 6, pp. 246-252, 2023.  
    [CrossRef] [Google Scholar] [Publisher Link]
  27. Zhiyi Ji et al., “Speaker Recognition System based on MFCC Feature Extraction CNN Architecture,” Academic Journal of Computing & Information Science, vol. 7, no. 7, pp. 47-59, 2024.
    [CrossRef] [Google Scholar] [Publisher Link]
  28. Manish Tiwari, and Deepak Kumar Verma, “Enhanced Text-independent Speaker Recognition using MFCC, Bi-LSTM, and CNN-based Noise Removal Techniques,” International Journal of Speech Technology, vol. 27, pp. 1013-1026, 2024.
    [CrossRef] [Google Scholar] [Publisher Link]
  29. Nourah M. Almarshady, Adal A. Alashban, and Yousef A. Alotaibi, “Analysis and Investigation of Speaker Identification Problems Using Deep Learning Networks and the YOHO English Speech Dataset,” Applied Sciences, vol. 13, no. 17, pp. 1-19, 2023.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  30. Giovanni Costantini, Valerio Cesarini, and Emanuele Brenna, “High-Level CNN and Machine Learning Methods for Speaker Recognition,” Sensors, vol. 23, no. 7, pp. 1-13, 2023.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  31. Sonia Malik, Deepti Deshwal, and Neelu Trivedi, “AI-Driven Speaker Identification: Enabling Human-Centric and Trustworthy Intelligent Systems,” Human-Centric Intelligent Systems, vol. 6, pp. 141-162, 2025.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  32. Pinyan Li et al., “Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2025, pp. 1-15, 2025.
    [CrossRef] [Google Scholar] [Publisher Link]
  33. Zahra Shah, Giljin Jang, and Adil Farooq, “Feature Fusion for Performance Enhancement of Text Independent Speaker Identification,” ICCK Transactions on Intelligent Systematics, vol. 2, no. 1, pp. 27-37, 2025.
    [CrossRef] [Google Scholar] [Publisher Link]
  34. Neha Chauhan, Tsuyoshi Isshiki, and Dongju Li, “Enhancing Speaker Recognition Models with Noise-Resilient Feature Optimization Strategies,” Acoustics, vol. 6, no. 2, pp. 439-469, 2024.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  35. Fereshteh Manafzadeh Heir, Hossein Najafzadeh, and Sarvenaz Erfani, “A Hybrid CNN and Reinforcement Learning Framework for Speaker Identification using Mel-Spectrogram and Continuous Wavelet Transform Features,” Scientific Reports, vol. 16, pp. 1-27, 2026.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  36. Shalini Tomar, and Shashidhar G. Koolagudi, “CNN-MFCC Model for Speaker Recognition using Emotive Speech,” 2023 IEEE 8th International Conference for Convergence in Technology (I2CT), Lonavla, India, pp. 1-7, 2023.
    [CrossRef] [Google Scholar] [Publisher Link]
  37.  P. Sandhya et al., “Spectral Features for Emotional Speaker Recognition,” 2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC), Bengaluru, India, pp. 1-6, 2020.
    [CrossRef] [Google Scholar] [Publisher Link]
  38. Shaik Riyaz, Bathula Lakshmi Bhavani, and S. Venkatrama Phani Kumar, “Automatic Speaker Recognition System in Urdu using MFCC & HMM,” International Journal of Recent Technology and Engineering, vol. 7, no. 5S4, pp. 109-113, 2019.
    [
    Google Scholar] [Publisher Link]
  39. Havva Çelıktaş, and Cemal Hanılçı, “A Study on Turkish Text — Dependent Speaker Recognition,” 2017 25th Signal Processing and Communications Applications Conference (SIU), Antalya, Turkey, pp. 1-4, 2017.
    [CrossRef] [Google Scholar] [Publisher Link]
  40. Mansour Alsulaiman, Awais Mahmood, and Ghulam Muhammad, “Speaker Recognition based on Arabic Phonemes,” Speech Communication, vol. 86, pp. 42-51, 2017.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  41. S. Davis, and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
    [CrossRef] [Google Scholar] [Publisher Link]
  42. Meinard Mülle, Fundamentals of Music Processing, 1st ed., Springer, pp. 1-487, 2015.
    [CrossRef] [Google Scholar] [Publisher Link]
  43. M. R. H. Mondal, Subrato Bharati, and Prajoy Podder, “Diagnosis of COVID-19 Using Machine Learning and Deep Learning: A Review,” Current Medical Imaging, vol. 17, no. 12, pp. 1403-1418, 2021.
    [CrossRef] [Google Scholar] [Publisher Link]
  44. Zohaib Mushtaq, Shun-Feng Su, and Quoc-Viet Tran, “Spectral Images based Environmental Sound Classification using CNN with Meaningful data Augmentation,” Applied Acoustics, vol. 172, pp. 1-15, 2021.
    [CrossRef] [Google Scholar] [Publisher Link]
  45. Meysam Vakili, Mohammad Ghamsari, and Masoumeh Rezaei, “Performance Analysis and Comparison of Machine and Deep Learning Algorithms for IoT Data Classification,” arXiv preprint, pp. 1-13, 2020.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  46. Serkan Keser, and Esra Gezer, “Comparative Analysis of Speaker Identification Performance using Deep Learning, Machine Learning, and Novel Subspace Classifiers with Multiple Feature Extraction Techniques,” Digital Signal Processing, vol. 156, 2025,
    [CrossRef] [Google Scholar] [Publisher Link]
  47. D.A. Reynolds, and R.C. Rose, “Robust Text-independent Speaker Identification using Gaussian Mixture Speaker Models,” IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 72-83, 1995.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  48. J. P. Campbell, “Speaker Recognition: A Tutorial,” Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, 1997.
    [
    CrossRef] [Google Scholar] [Publisher Link]
  49. Björn Schuller et al., “Recognising Realistic Emotions and Affect in Speech: State of the Art and Lessons Learnt from the First Challenge,” Speech Communication, vol. 53, no. 9-10, pp. 1062-1087, 2011.
    [
    CrossRef] [Google Scholar] [Publisher Link]