Speaker Diarization: A Comprehensive Survey of Clustering Methods, Neural Networks and Hybrid Frameworks for Robust Speaker Segmentation and Identification

International Journal of Electronics and Communication Engineering
© 2026 by SSRG - IJECE Journal
Volume 13 Issue 1
Year of Publication : 2026
Authors : Merlin Revathy S, Kumar S S
pdf
How to Cite?

Merlin Revathy S, Kumar S S, "Speaker Diarization: A Comprehensive Survey of Clustering Methods, Neural Networks and Hybrid Frameworks for Robust Speaker Segmentation and Identification," SSRG International Journal of Electronics and Communication Engineering, vol. 13,  no. 1, pp. 207-240, 2026. Crossref, https://doi.org/10.14445/23488549/IJECE-V13I1P117

Abstract:

The speaker diarization involves splitting an audio recording into segments based on who is speaking and when they speak. In simple words, it helps you know, “Who spoke when?” This method of sorting voices is significant for various aspects of real life. People use it in automatic meeting notes, video subtitles, voice helpers, and legal or medical recordings. With this, you can see which person is always speaking, so it is easier to follow and understand the talk. Early systems depended on the handcrafted features, which performed well, but they faced challenges in the real world. Mostly, in the time of background noise, many speakers are talking at the same time, or speakers have similar voices. This survey paper provides an extensive review of 95 research papers published between 2023 and 2025, categorizing the existing diarization methods into three groups: supervised learning-based techniques, unsupervised learning-based techniques, and hybrid learning frameworks. Furthermore, the enhancement of diarization performance is facilitated by integrating speaker diarization methods with voice recognition applications. This paper offers a useful review by combining the most recent advancements with neural techniques, hence promoting further advancements in the direction of more effective speaker diarization.

Keywords:

CClustering Techniques, Deep Learning, Hybrid Frameworks, Multimodal Analysis, Speaker Diarization.

References:

[1] Gongfan Chen et al., “Meet2Mitigate: An LLM-Powered Framework for Real-Time Issue Identification and Mitigation from Construction Meeting Discourse,” Advanced Engineering Informatics, vol. 64, pp. 1-42, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Józef Kotus, and Grzegorz Szwoch, “Separation of Simultaneous Speakers with Acoustic Vector Sensor,” Sensors, vol. 25, no. 5, pp. 1-20, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Linzhao Jia et al., “Enhanced Speaker-Turn Aware Hierarchical Model for Automated Classroom Dialogue Act Classification,” Expert Systems with Applications, vol. 296, 2026.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Giulio Bertamini et al., “Automated Segmentation of Child-Clinician Speech in Naturalistic Clinical Contexts,” Research in Developmental Disabilities, vol. 157, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Dmitrii Korzh et al., “Certification of Speaker Recognition Models to Additive Perturbations,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 17, pp. 17947-17956, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Rana Zeeshan, John Bogue, and Mamoona Naveed Asghar, “Relative Applicability of Diverse Automatic Speech Recognition Platforms for Transcription of Psychiatric Treatment Sessions,” IEEE Access, vol. 13, pp. 117343-117354, 2025. [CrossRef] [Google Scholar] [Publisher Link]
[7] Ryan Duke, and Alex Doboli, “Dialogic: A Multi-Modal Framework for Automated Team Behavior Modeling Based on Speech Acquisition,” Multimodal Technologies and Interaction, vol. 9, no. 3, pp. 1-26, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Nassim Asbai, Hadjer Bounazou, and Sihem Zitouni, “A Novel Approach to Deriving Adaboost Classifier Weights using Squared Loss Function for Overlapping Speech Detection,” Multimedia Tools and Applications, vol. 84, pp. 38545-38572, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Yueran Pan “Assessing the Expressive Language Levels of Autistic Children in Home Intervention,” IEEE Transactions on Computational Social Systems, vol. 12, no. 5, pp. 3647-3659, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Eric Ettore et al., “Childhood Trauma Affects Speech and Language Measures in Patients with Major Depressive Disorder during Clinical Interviews,” Journal of Affective Disorders, vol. 388, pp. 1-8, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Joonas Kalda et al., “Enhancing Pixit: Robust Methods for Real-World Speech Separation and Applications,” SSRN, pp. 1-36, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Mitchell A. Klusty et al., “Toward Automated Clinical Transcriptions,” AMIA Summits on Translational Science Proceedings, 2025.
[Google Scholar] [Publisher Link]
[13] Hasan Almgotir Kadhim, Lok Woo, and Satnam Dlay, “Novel Algorithm for Speech Segregation by Optimized K-Means of Statistical Properties of Clustered Features,” 2015 IEEE International Conference on Progress in Informatics and Computing (PIC), Nanjing, China, pp. 286-291, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Kalidindi Lakshmi Divya et al., Query-Facilitating Multidomain Summarization with BART Model and Accessible Query Interface, 1st ed., Hybrid and Advanced Technologies, CRC Press, pp. 321-326, 2025.
[Google Scholar] [Publisher Link]
[15] Jule Pohlhausen, Francesco Nespoli, and Jörg Bitzer, “Towards Privacy-Preserving Conversation Analysis in Everyday Life: Exploring the Privacy-Utility Trade-Off,” Computer Speech & Language, vol. 95, pp. 1-15, 2026.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Igor Abramovski et al., “Summary of the NOTSOFAR-1 Challenge: Highlights and Learnings,” Computer Speech & Language, vol. 93, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Ian Kwok et al., “New Frontiers in Artificial Intelligence: A Multimodal Communication Model,” Journal of Pain and Symptom Management, vol. 69, no. 5, pp. e454-e455, 2025.
 [CrossRef] [Google Scholar] [Publisher Link]
[18] Naela Fauzul Muna, and Mukhammad Andri Setiawan, “Automated Framework for Communication Development in Autism Spectrum Disorder Using Whisper ASR and GPT-4o LLM,” JTP-Jurnal of Educational Technology, vol. 27, no. 1, pp. 137-149, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Sadhana Singh, Lotika Singh, and Nandita Satsangee, “Automated Assessment of Classroom Interaction Based on Verbal Dynamics: A Deep Learning Approach,” SN Computer Science, vol. 6, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Liqaa Fadil, Alia K. Abdul Hassan, and Hiba B. Alwan, “Employing Chroma-Gram Techniques for Audio Source Separation in Human-Computer Interaction,” AIP Conference Proceedings, vol. 3264, no. 1, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Tong Li et al., “Trimodal-Persona: Leveraging Text, Audio, and Video for Big Five Personality Scoring In Psychological Interviews,” Research Gate, pp. 1-8, 2025.
[Google Scholar]
[22] S. Mhammad, and S. Molodyakov, “Developing and Analyzing an Algorithm for Separate Speech Recording of Multiple Speakers,” International Journal of Open Information Technologies, vol. 13, no. 5, pp. 41-48, 2025.
[Google Scholar] [Publisher Link]
[23] Thilo von Neumann et al., “Word Error Rate Definitions and Algorithms for Long-Form Multi-talker Speech Recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3174-3188, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Pinyan Li et al., “Enhancing Speaker Recognition with CRET Model: A Fusion of CONV2D, RESNET and ECAPA-TDNN,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2025, pp. 1-15, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Haoyuan Wang et al., “An Evaluation Framework for Ambient Digital Scribing Tools in Clinical Applications,” NPJ Digital Medicine, vol. 8, pp. 1-13, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Yuta Hirano et al., “Toward Fast Meeting Transcription: NAIST System for CHiME-8 NOTSOFAR-1 Task and Its Analysis,” Computer Speech & Language, vol. 95, pp. 1-13, 2026.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Srikanth Madikeri et al., “Autocrime-Open Multimodal Platform for Combating Organized Crime,” Forensic Science International: Digital Investigation, vol. 54, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Marisha Speights et al., “Transforming Child Speech Data into Clinical-Grade Artificial Intelligence Pipelines for Speech-Language Impairment Detection,” The Journal of the Acoustical Society of America, vol. 157, no. 4, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Kengatharaiyer Sarveswaran et al., “A Brief Overview of the First Workshop on Challenges in Processing South Asian Languages (chipsal),” Proceedings of the First Workshop on Challenges in Processing South Asian Languages, Abu Dhabi, UAE, pp. 1-8, 2025.
[Google Scholar] [Publisher Link]
[30] Ho Seok Ahnet al., “Social Human-Robot Interaction of Human-care Service Robots,” Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, Chicago IL USA, pp. 385-386, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Zhenzhen Weng et al., “Artificial Intelligence–Powered 3D Analysis of Video-Based Caregiver-Child Interactions,” Science Advances, vol. 11, no. 8, pp. 1-13, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Padma Jyothi Uppalapati, Madhavi Dabbiru, and Venkata Rao Kasukurthi, “AI-Driven Mock Interview Assessment: Leveraging Generative Language Models for Automated Evaluation,” International Journal of Machine Learning and Cybernetics, vol. 16, pp. 10057-10079, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Hatem Zehir, Toufik Hafs, and Sara Daas, “Unifying Heartbeats and Vocal Waves: An Approach to Multimodal Biometric Identification at the Score Level,” Arabian Journal for Science and Engineering, vol. 50, pp. 19535-19554, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Farhan Samir et al., “A Comparative Approach for Auditing Multilingual Phonetic Transcript Archives,” Transactions of the Association for Computational Linguistics, vol. 13, pp. 595-612, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Lauren Harrington et al., “Variability in Performance Across four Generations of Automatic Speaker Recognition Systems,” IAFPA 2025, Rotterdam, The Netherlands, pp. 3993-3997, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Jose Luis Medellin-Garibay, and J.C. Cuevas-Tello, Artificial Neural Networks for Speaker, Intelligent Sustainable Systems: Selected Papers of WorldS4 2024, Volume 3, pp. 1-511, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Federico Pardo, Óscar Cánovas, and Félix J. García Clemente, “Audio Features in Education: A Systematic Review of Computational Applications and Research Gaps,” Applied Sciences, vol. 15, no. 12, pp. 1-41, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Steffen T. Eberhardt et al., “Development and Validation of Large Language Model Rating Scales for Automatically Transcribed Psychological Therapy Sessions,” Scientific Reports, vol. 15, pp. 1-13, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Rita Francese et al., “DeepTald: A System for Supporting Schizophrenia-Related Language and Thought Disorders Detection with NLP Models and Explanations,” Multimedia Tools and Applications, vol. 84, pp. 46273-46306, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Andrii M. Striuk, and Vladyslav V. Hordiienko, “Research and Development of a Subtitle Management System using Artificial Intelligence,” CEUR Workshop Proceedings, pp. 415-427, 2025.
[Google Scholar] [Publisher Link]
[41] Alymzhan Toleu et al., “Speaker Change Detection with Pre-trained Large Audio Model,” Asian Conference on Intelligent Information and Database Systems, Kitakyushu, Japan, pp. 262-274, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[42] Soubhik Barari, and Tyler Simko, “The Promise of Text, Audio, and Video Data for the Study of us Local Politics and Federalism,” Publius: The Journal of Federalism, vol. 55, no. 2, pp. 223-252, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[43] Bang Zeng, and Ming Li, “USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction,” IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2110-2124, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[44] Ravi D. Shankar, R. B. Manjula, and Rajashekhar C. Biradar, “Revolutionizing Speaker Recognition and Diarization: A Novel Methodology in Speech Analysis,” SN Computer Science, vol. 6, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[45] Defu Chen et al., “Res2Former: Integrating Res2Net and Transformer for a Highly Efficient Speaker Verification System,” Electronics, vol. 14, no. 12, pp. 1-19, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[46] Erfan Loweimi et al., “Speaker Retrieval in the Wild: Challenges, Effectiveness and Robustness,” arXiv preprint, pp. 1-13, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[47] Shuai Wang et al., “Overview of Speaker Modeling and its Applications: From the Lens of Deep Speaker Representation Learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 4971-4998, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[48] Qiuyu Zheng et al., “NResNet: Nested Residual Network Based on Channel and Frequency Domain Attention Mechanism for Speaker Verification in Classroom,” Multimedia Tools and Applications, vol. 84, pp. 14235-14251, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[49] Danwei Cai, “Speaker Representation Learning Under Self-Supervised and Knowledge Transfer Setting,” Doctoral Thesis, Duke University, pp. 1-24, 2023.
[Google Scholar]
[50] Qing Gu et al., “A Domain Robust Pre-Training Method with Local Prototypes for Speaker Verification,” Interspeech, pp. 1-5, 2025.
[Google Scholar] [Publisher Link]
[51] Hao Wang, Xiaobing Lin, and Jiashu Zhang, “A Lightweight CNN-Conformer Model for Automatic Speaker Verification,” IEEE Signal Processing Letters, vol. 31, pp. 56-60, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[52] Jin Sob Kim et al., “Universal Pooling Method of Multi-Layer Features from Pretrained Models for Speaker Verification,” arXiv preprint, pp. 1-6, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[53] Min Hyun Han, Sung Hwan Mun, and Nam Soo Kim, “Generalized Score Comparison-Based Learning Objective for Deep Speaker Embedding,” IEEE Access, vol. 13, pp. 51194-51207, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[54] Bei Liu, Zhengyang Chen, and Yanmin Qian, “Depth-First Neural Architecture with Attentive Feature Fusion for Efficient Speaker Verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1825-1838, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[55] Zhiyong Chen et al., “Open-Set Speaker Identification through Efficient Few-shot Tuning with Speaker Reciprocal Points and Unknown Samples,” IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3347-3362, 2025. [CrossRef] [Google Scholar] [Publisher Link]
[56] Danwei Cai et al., “Self-Supervised Reflective Learning through Self-Distillation and Online Clustering for Speaker Representation Learning,” IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 1535-1550, 2025. [CrossRef] [Google Scholar] [Publisher Link]
[57] Prachi Singh, and Sriram Ganapathy, “End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization,” IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 448-457, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[58] Y. Dissen, S. Harpaz, and J. Keshet, “Label-Free Speaker Diarization Using Self-Supervised Speaker Embeddings,” SSRN, 2025.
[Google Scholar]
[59] Prachi Singh, Amrit Kaul, and Sriram Ganapathy, “Supervised Hierarchical Clustering using Graph Neural Networks for Speaker Diarization,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, pp. 1-5, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[60] K.V. Aljinu Khadar, R.K. Sunil Kumar, and V.V. Sameer, “Speaker Diarization based on X Vector Extracted from Time-Delay Neural Networks (TDNN) using Agglomerative Hierarchical Clustering in Noisy Environment,” International Journal of Speech Technology, vol. 28, pp. 13-26, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[61] Mathias Näreaho, “Speaker Diarization in Challenging Environments using Deep Networks: An Evaluation of a State-of-the-Art System,” DiVA Portal, 2023.
[Google Scholar] [Publisher Link]
[62] Vinod K. Pande, Vijay K. Kale, and Sangramsing N. Kayte, “Feature Extraction Using I-Vector and X-Vector Methods for Speaker Diarization,” ICTACT Journal on Soft Computing, vol. 15, no. 4, pp. 3717-3721, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[63] Nikhil Raghav et al., “Self-Tuning Spectral Clustering for Speaker Diarization,” 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, pp. 1-5, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[64] Rahul Dasari, “Speaker Diarization Using Multi-View Contrastive Learning Embeddings,” Technical Report, pp. 1-6, 2025.
[Google Scholar] [Publisher Link]
[65] Joonas Kalda et al., “Design Choices for PixIT-based Speaker-Attributed ASR: Team ToTaTo at the NOTSOFAR-1 Challenge,” Computer Speech & Language, vol. 95, pp. 1-16, 2026.
[CrossRef] [Google Scholar] [Publisher Link]
[66] Federico Landini, “From Modular to End-to-End Speaker Diarization,” arXiv preprint, pp. 1-138, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[67] Weiqing Wang, and Ming Li, “End-to-End Online Speaker Diarization with Target Speaker Tracking,” arXiv preprint, pp. 1-13, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[68] Weiqing Wang, and Ming Li, “Online Neural Speaker Diarization with Target Speaker Tracking IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 5078-5091, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[69] Youngki Kwon et al., “Absolute Decision Corrupts Absolutely: Conservative Online Speaker Diarisation,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, pp. 1-5, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[70] Chin-Yi Cheng et al., “Multi-Target Extractor and Detector for Unknown-Number Speaker Diarization,” IEEE Signal Processing Letters, vol. 30, pp. 638-642, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[71] Yiling Huan et al., “Towards Word-Level End-to-end Neural Speaker Diarization with Auxiliary Network,” arXiv preprint, pp. 1-5, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[72] Han Yin et al., “SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models,” arXiv preprint, pp. 1-9, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[73] Samuele Cornell et al., “One Model to Rule Them All? Towards End-To-End Joint Speaker Diarization and Speech Recognition,” ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, pp. 11856-11860, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[74] Tianyi Tan et al., “DistillW2N: A Lightweight One-Shot Whisper to Normal Voice Conversion Model Using Distillation of Self-Supervised Features,” ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, pp. 1-5, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[75] Zhao-Ci Liu et al., “PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 4199-4210, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[76] Pourya Jafarzadeh, Amir Mohammad Rostami, and Padideh Choobdar, “Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT,” arXiv preprint, pp. 1-9, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[77] Valentin Vielzeuf, “Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: A Focus on HuBERT's Pretraining,” arXiv preprint, pp. 1-5, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[78] Alexander H. Liu et al., “Dinosr: Self-Distillation and Online Clustering for Self-Supervised Speech Representation Learning,” Advances in Neural Information Processing Systems, pp. 1-17, 2023.
[Google Scholar] [Publisher Link]
[79] Bing Han, Zhengyang Chen, and Yanmin Qian, “Self-Supervised Learning with Cluster-Aware-Dino for High-Performance Robust Speaker Verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 529-541, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[80] Zia-ur-Rehmana, Arif Mahmood, and Wenxiong Kang, “Pseudo-Label Refinement for Improving Self-Supervised Learning Systems,” arXiv preprint, pp. 1-11, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[81] Irina-Elena Veliche, and Pascale Fung, “Improving Fairness and Robustness in End-to-End Speech Recognition Through Unsupervised Clustering,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, pp. 1-5, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[82] Faria Nazir et al., “A Computer-Aided Speech Analytics Approach for Pronunciation Feedback using Deep Feature Clustering,” Multimedia Systems, vol. 29, pp. 1699-1715, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[83] Simin Kou et al., “Structure-Aware Subspace Clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 10, pp. 10569-10582, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[84] Anaïs Ollagnier, Elena Cabrio, and Serena Villata, “Unsupervised Fine-Grained Hate Speech Target Community Detection and Characterisation on Social Media,” Social Network Analysis and Mining, vol. 13, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[85] Ziyang Ma et al., “Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation,” arXiv preprint, pp. 1-5, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[86] S. Merlin Revathy, and S.S. Kumar, “Optimized Deep Embedded Clustering-Based Speaker Diarization with Speech Enhancement,” Circuits, Systems, and Signal Processing, vol. 44, pp. 5044-5074, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[87] Amarendra Jadda, and Inty Santi Prabha, “Adaptive Weiner Filtering with AR-GWO based Optimized Fuzzy Wavelet Neural Network for Enhanced Speech Enhancement,” Multimedia Tools and Applications, vol. 82, pp. 24101-24125, 2023. [CrossRef] [Google Scholar] [Publisher Link]
[88] Mohammad Reza Falahzadeh et al., “Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition,” Circuits, Systems, and Signal Processing, vol. 42, pp. 449-492, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[89] Suwon Shon et al., “Context-Aware Fine-Tuning of Self-Supervised Speech Models,” ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[90] Rongjie Huang et al., “VoiceTuner: Self-Supervised Pre-Training and Efficient Fine-tuning For Voice Generation,” Proceedings of the 32nd ACM International Conference on Multimedia, pp. 10630-10639, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[91] Xianghu Yue et al., “Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters,” Electronics, vol. 13, no. 1, pp. 1-13, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[92] Weixi Lai, “Parameter-Efficient Fine-Tuning for Sarcasm Detection in Speech Using the Self-Supervised Pre-Trained Model WavLM,” Master’s Thesis, University of Groningen, pp. 1-39, 2024.
[Google Scholar] [Publisher Link]
[93] Bryce Irvin et al., “Self-Supervised Learning for Speech Enhancement Through Synthesis,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, pp. 1-5, 2023. [CrossRef] [Google Scholar] [Publisher Link]
[94] Muzamil Ahmed et al., “An Enhanced Deep Learning Approach for Speaker Diarization using TitaNet, MarbelNet and Time Delay Network,” Scientific Reports, vol. 15, pp. 1-15, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[95] Amit Kumar Bhuyan, Hrishikesh Dutta, and Subir Biswas, “Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 9, no. 2, pp. 1934-1946, 2025.
[CrossRef] [Google Scholar] [Publisher Link]