Enhancing Hindi Speech Recognition with Deep CNN-LiGRU through Mel and Gammatone Frequency Cepstral Coefficient Fusion

International Journal of Electronics and Communication Engineering
© 2025 by SSRG - IJECE Journal
Volume 12 Issue 8
Year of Publication : 2025
Authors : Hetal Gaudani, Narendra M Patel
pdf
How to Cite?

Hetal Gaudani, Narendra M Patel, "Enhancing Hindi Speech Recognition with Deep CNN-LiGRU through Mel and Gammatone Frequency Cepstral Coefficient Fusion," SSRG International Journal of Electronics and Communication Engineering, vol. 12,  no. 8, pp. 138-148, 2025. Crossref, https://doi.org/10.14445/23488549/IJECE-V12I8P112

Abstract:

In India, Hindi stands as the most prevalent language. An ideal communication system in a local language would enable regular individuals to engage with machines via speech interfaces for information retrieval and daily tasks. Despite notable advancements in recent decades, achieving natural and robust human-machine speech interaction in regional languages remains challenging, especially in environments characterized by significant noise and reverberation..In such challenging, noisy environments, traditional automatic speech recognition systems employing Mel Frequency Cepstral Coefficients (MFCC) or Perceptual Linear Prediction (PLP) features have shown subpar performance. Research indicates that Gammatone Frequency Cepstral Coefficients (GFCC) perform well in noisy environments. To address this challenge, a novel approach combined features (MFGF) extracted from MFCC and GFCC to harness the strengths of both while mitigating their respective weaknesses. To bolster the robustness of speech recognition, contemporary speech recognition systems commonly leverage acoustic models based on Recurrent Neural Networks (RNNs), which inherently capture extensive temporal contexts and long-term speech modulations. The study introduces a Deep Convolutional- Light Gated Neural Network (CNN-LiGRU) model. This paper presents a refined iteration of Gated Recurrent Units (GRUs) termed Light Gated Neural Network (LiGRU), coupled with Convolution Neural Network (CNN).CNN extracts spatial features, capturing relevant patterns from the input spectrogram, while LiGRU handles the sequential nature of speech, capturing long-term dependencies and enhancing the model’s proficiency in understanding and recognizing spoken language. CNN-LiGRU demonstrates a 30% improvement in recognition rates over CNN-Recurrent Neural Network (RNN). Additionally, LiGRU eliminates the reset gate, resulting in reduced per-epoch training time compared to conventional RNNs. A comparative analysis with state-of-the-art methods is conducted, and experimental results indicate that the fusion approach outperforms, achieving speech recognition accuracy with minimal loss.

Keywords:

CNN, GFCC, LiGRU, MFCC, RNN.

References:

[1] Steven Davis, and Paul Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Hynek Hermansky, “Perceptual Linear Predictive (PLP) Analysis of Speech,” The Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738-1752,1990.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Hynek Hermansky et al., “RASTA-PLP Speech Analysis,” [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, San Francisco, CA, USA, vol. 1, pp. 121-124, 1992.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Harshita Gupta, and Divya Gupta, “LPC and LPCC Method of Feature Extraction in Speech Recognition System,” 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence), Noida, India, pp. 498-502, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[5] D.A. Reynolds, “Experimental Evaluation of Features for Robust Speaker Identification,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, pp. 639-643, 1994.
[CrossRef] [Google Scholar] [Publisher Link]
[6] R.K. Aggarwal, and M. Dave, “Performance Evaluation of Sequentially Combined Heterogeneous Feature Streams for Hindi Speech Recognition System,” Telecommunication Systems, vol. 52, pp. 1457-1466, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Medikonda Jeevan et al., “Robust Speaker Verification using GFCC based I-Vectors,” Proceedings of the International Conference on Signal, Networks, Computing, and Systems, vol. 1, pp. 85-89, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Lisheng Fan et al., “Secure Multiuser Communications in Multiple Amplify-and-Forward Relay Networks,” IEEE Transactions on Communications, vol. 62, no. 9, pp. 3299-3310, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Alejandro Acero, Acoustical and Environmental Robustness in Automatic Speech Recognition, Kluwer Academic, pp. 1-186, 1993.
[Google Scholar] [Publisher Link]
[10] Alejandro Acero, and Richard M. Stern, “Environmental Robustness in Automatic Speech Recognition,” International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA, pp. 849-852, 1990.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Yang Shao et al., “A Computational Auditory Scene Analysis System for Speech Segregation and Robust Speech Recognition,” Computer Speech Language, vol. 24, no. 1, pp. 77-93, 2010.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Lawrence R. Rabiner, and Biing-Hwang Juang, Fundamentals of Speech Recognition, PTR Prentice Hall, pp. 1-507, 1993.
[Google Scholar] [Publisher Link]
[13] Dimitri Palaz, Magimai-Doss, and Ronan Collobert, “Analysis of CNN-based Speech Recognition System using Raw Speech as Input,” Interspeech, pp. 11-15, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Shuo-Yiin Chang, and Nelson Morgan, “Robust CNN-based Speech Recognition with Gabor Filter Kernels,” Fifteenth Annual Conference of the International Speech Communication Association, pp. 905-909, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Vishal Passricha, and Rajesh Kumar Aggarwal, “A Comparative Analysis of Pooling Strategies for Convolutional Neural Network based Hindi ASR,” Journal of Ambient Intelligence and Humanized Computing, vol. 11, pp. 675-691, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Jeffrey L. Elman, “Finding Structure in Time,” Cognitive Science, vol. 14, no. 2, pp. 179-211, 1990.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Ankit Kumar, and Rajesh Kumar Aggarwal, “Discriminatively Trained Continuous Hindi Speech Recognition using Integrated Acoustic Features and Recurrent Neural Network Language Modelling,” Journal of Intelligent Systems, vol. 30, no. 1, pp. 165-179, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Yongqiang Wang et al, “Transformer-based Acoustic Modeling for Hybrid Speech Recognition,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp. 6874-6878, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Yajie Miao, Mohammad Gowayyed, and Florian Metze, “EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding,” 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, pp. 167-174, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Anjuli Kannan et al, “Large-scale Multilingual Speech Recognition with a Streaming End-to-End Model,” arXiv Preprint, pp. 1-5, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “A Time Delay Neural Network Architecture for Efficient Modeling of Long Temporal Contexts,” Sixteenth Annual Conference of the International Speech Communication Association, vol. 2015, pp. 3214-3218, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Ankit Kumar, and Rajesh Kumar Aggarwal, “Hindi Speech Recognition using Time Delay Neural Network Acoustic Modeling with i-Vector Adaptation,” International Journal of Speech Technology, vol. 25, pp. 67-78, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Jian Kang et al., “Advanced Recurrent Network-based Hybrid Acoustic Models for Low Resource Speech Recognition,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2018, pp. 1-15, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Alex Graves, Navdeep Jaitly, and Abdel-Rahman Mohamed, “Hybrid Speech Recognition with Deep Bidirectional LSTM,” 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, pp. 273-278, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Y. Bengio, P. Simard, and P. Frasconi, “Learning Long-Term Dependencies with Gradient Descent is Difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157-166, 1994.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Kyunghyun Cho et al., “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches,” arXiv Preprint, pp. 1-9, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Mirco Ravanelli et al., “Light Gated Recurrent Units for Speech Recognition,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 92-102, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Ankit Kumar, and Rajesh Kumar Aggarwal, “A Hybrid CNN-LiGRU Acoustic Modeling using Raw Waveform Sincnet for Hindi ASR,” Computer Science, vol. 21, no. 4, pp. 397-417, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Kumud Tripathi, M. Kiran Reddy, and K. Sreenivasa Rao, “Multilingual and Multimode Phone Recognition System for Indian Languages,” Speech Communication, vol. 119, pp. 12-23, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Mohit Dua, Rajesh Kumar Aggarwal, and Mantosh Biswas, “Optimizing Integrated Features for Hindi Automatic Speech Recognition System,” Journal of Intelligent Systems, vol. 29, no. 1, pp. 959-976, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[31] O. Farooq, S. Datta, and M.C. Shrotriya, “Wavelet Sub-Band Based Temporal Features for Robust Hindi Phoneme Recognition,” International Journal of Wavelets, Multiresolution and Information Processing, vol. 8, no. 6, pp. 847-859, 2010. [CrossRef] [Google Scholar] [Publisher Link]
[32] Vishal Passricha, and Rajesh Kumar Aggarwal, “A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition,” Journal of Intelligent Systems, vol. 29, no. 1, pp. 1261-1274, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Vishal Passricha, and Rajesh Kumar Aggarwal, “A Comparative Analysis of Pooling Strategies for Convolutional Neural Network based Hindi ASR,” Journal of Ambient Intelligence and Humanized Computing, vol. 11, pp. 675-691, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber, “Learning Precise Timing with LSTM Recurrent Networks,” Journal of Machine Learning Research, vol. 3, pp. 115-143, 2002.
[Google Scholar] [Publisher Link]
[35] Kyunghyun Cho, Bart van Merrienboer, and Dzmitry Bahdanau, “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches,” arXiv Preprint, pp. 1-9, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Hetal Gaudani, and Narendra M. Patel, “Comparative Study of Robust Feature Extraction Techniques for ASR for Limited Resource Hindi Language,” Proceedings of Second International Conference on Sustainable Expert Systems, Singapore, pp. 763-775, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Text to Speech Synthesis Systems in Indian Languages, Indic TTS, IIT Madras. [Online]. Available: https://www.iitm.ac.in/donlab/tts/index.php
[38] Diwan, Anuj, et al. “Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages,” arXiv Preprint, pp. 1-6, 2021.
[CrossRef] [Google Scholar] [Publisher Link]