Advancing Audio Processing and Emotion Recognition through Deep Learning Techniques

International Journal of Electrical and Electronics Engineering
© 2025 by SSRG - IJEEE Journal
Volume 12 Issue 4
Year of Publication : 2025
Authors : N. Venkata Sailaja, CH V K N S N Moorthy, Chiranjiva Rao Atluri, Gaddam Shiva Kumar Reddy, Gorugantula V S J Karthik, C S N V Ram Sree Santhosh
pdf
How to Cite?

N. Venkata Sailaja, CH V K N S N Moorthy, Chiranjiva Rao Atluri, Gaddam Shiva Kumar Reddy, Gorugantula V S J Karthik, C S N V Ram Sree Santhosh, "Advancing Audio Processing and Emotion Recognition through Deep Learning Techniques," SSRG International Journal of Electrical and Electronics Engineering, vol. 12,  no. 4, pp. 298-307, 2025. Crossref, https://doi.org/10.14445/23488379/IJEEE-V12I4P124

Abstract:

Speaker diarization is essential in audio processing, distinguishing and attributing speech segments to individual speakers in applications like speech recognition, customer service analytics, social media monitoring, and broadcast transcription. Despite advancements, challenges such as overlapping speech, varying acoustic environments, and noise resilience persist. This work proposes a deep learning-based approach using a pre-trained transformer model to handle multiple tasks, including feature extraction, Voice Activity Detection (VAD), segmentation, and speaker change detection. The system employs the ECAPA-TDNN model for speaker embeddings, followed by Agglomerative Hierarchical Clustering (AHC) for speaker labeling. Additionally, it performs sentiment analysis by converting segmented speech into text and applying a lexicon based model. The implementation supports real-time audio recording, speaker segmentation, sentiment analysis, and visualization. Tested on multi-speaker recordings, it achieved 93.4% transcription accuracy while reducing computational costs, making it more feasible for standard hardware compared to traditional resource-intensive methods. By integrating speaker diarization and sentiment analysis within a unified framework, this research contributes to real-time, speaker-aware applications in various domains.

Keywords:

Speaker Diarization, Deep Learning, Audio Signal Processing, Sentiment Analysis, Speech-to-Text, Transformer Models, ECAPA-TDNN, Agglomerative Hierarchical Clustering, Real-Time Speech Processing.

References:

[1] Alec Radford et al., “Robust Speech Recognition via Large-Scale Weak Supervision,” Proceedings of the 40th International Conference on Machine, Learning, Honolulu, Hawaii, USA, vol. 202, 2023.
[Google Scholar] [Publisher Link]
[2] Ouassila Kenai et al., “A New Architecture based Vad for Speaker Diarization/Detection Systems,” Journal of Signal Processing Systems, vol. 91, no. 11, pp. 827-840, 2019.
[CrossRef] [Google Scholar] [Publisher Link] 
[3] Federico Landini et al., “Bayesian Hmm Clustering of X-Vector Sequences (VBX) In Speaker Diarization: Theory, Implementation and Analysis on Standard Tasks,” Computer Speech and Language, vol. 71, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Liang He et al., “Latent Class Model with Application to Speaker Diarization,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2019, no. 1, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Daniel Garcia-Romero et al., “Speaker Diarization Using Deep Neural Network Embeddings,” IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Zili Huang et al., “Speaker Diarization with Region Proposal Network,” IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Yusuke Fujita et al., “End-To-End Neural Speaker Diarization with Permutation-Free Objectives,” arXiv, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Prachi Singh, and Sriram Ganapathy, “Deep Self-Supervised Hierarchical Clustering for Speaker Diarization,” arXiv, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Amitrajit Sarkar et al., “Says Who? Deep Learning Models for Joint Speech Recognition, Segmentation and Diarization,” IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, pp. 5229-5233, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Laurent El Shafey, Hagen Soltau, and Izhak Shafran, “Joint Speech Recognition and Speaker Diarization Via Sequence Transduction,” arXiv, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Ismail Shahin et al., “Novel Hybrid Dnn Approaches for Speaker Verification in Emotional and Stressful Talking Environments,” Neural Computing and Applications, vol. 33, no. 23, pp.16033-16055, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Farshad Teimoori, and Farbod Razzazi, “Unsupervised Help-Trained LS-SVR-Based Segmentation in Speaker Diarization System,” Multimedia Tools and Applications, vol. 78, no. 9, pp. 11743-11777, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Marie Kunesova et al., “Detection of Overlapping Speech for the Purposes of Speaker Diarization,” Speech and Computer 21st International Conference, Istanbul, Turkey, pp. 247-257, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Marek Hrúz, and ZbynÄ›k Zajíc, “Convolutional Neural Network for Speaker Change Detection in Telephone Speaker Diarization System,” IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, pp. 4945-4949, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Federico Landin et al., “But System Description for Dihard Speech Diarization Challenge 2019,” arXiv, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Lei Sun et al., “Speaker Diarization with Enhancing Speech for the First Dihard Challeng,” Proceedings of Interspeech, Hyderabad, India, pp. 2793-2797, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[17] David Snyder et al., “X-Vectors: Robust Dnn Embeddings for Speaker Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, pp. 5329-5333, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[18] India Massana et al., “LSTM Neural Network-Based Speaker Segmentation using Acoustic and Language Modelling,” Annual Conference of the International Speech Communication Association, Stockholm, pp. 2834-2838, 2017.
[Google Scholar] [Publisher Link]
[19] Neeraj Sajjan et al., “Leveraging LSTM Models for Overlap Detection in Multi-Party Meetings,” IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, pp. 5249-5253, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Mourad Jbene et al., “User Sentiment Analysis in Conversational Systems Based on Augmentation and Attention-Based BiLSTM,” Procedia Computer Science, vol. 207, pp. 4106-4112, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[21] María Teresa García-Ordás et al., “Sentiment Analysis in Non-Fixed Length Audios Using a Fully Convolutional Neural Network,” Biomedical Signal Processing and Control, vol. 69, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[22] Manas Jain et al., “Speech Emotion Recognition Using a Support Vector Machine,” arXiv, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” Proceedings of Interspeech, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[24] C. Hutto, and Eric Gilbert, “VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text,” Proceedings of the International AAAI Conference on Web and Social Media, vol. 8, no.1, pp. 216-225, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: A Large-Scale Speaker Identification Dataset,” arXiv, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “VoxCeleb2: Deep Speaker Recognition,” arXiv, 2018.
[CrossRef] [Google Scholar] [Publisher Link]