Development of a Kiswahili Text-to-Speech System based on Tacotron 2 and Wave Net Vocoder

International Journal of Electrical and Electronics Engineering
© 2023 by SSRG - IJEEE Journal
Volume 10 Issue 2
Year of Publication : 2023
Authors : Kelvin Kiptoo Rono, Ciira Wa Maina, Elijah Mwangi
pdf
How to Cite?

Kelvin Kiptoo Rono, Ciira Wa Maina, Elijah Mwangi, "Development of a Kiswahili Text-to-Speech System based on Tacotron 2 and Wave Net Vocoder," SSRG International Journal of Electrical and Electronics Engineering, vol. 10,  no. 2, pp. 75-83, 2023. Crossref, https://doi.org/10.14445/23488379/IJEEE-V10I2P107

Abstract:

Text-to-Speech (TTS) system converts an input text into a synthetic speech output. The paper provides a detailed description of developing a Kiswahili TTS system using Tacotron 2 architecture and WaveNet vocoder. Kiswahili TTS system will help visually impaired persons to learn, assist persons with communication disorders, and provide a source of input in Voice Alarm and Public Address equipments. Tacotron 2 is a sequence-to-sequence model for building TTS systems. The model consists of three steps: encoder, decoder, and vocoder. The encoder step converts the input text into a sequence of characters. The decoder predicts the sequence of Mel-spectrograms for each generated character sequence. Finally, the vocoder transforms the Mel-spectrograms into speech waveforms. The Kiswahili TTS system developed based on Tacotron 2 obtained a mean opinion score (MOS) of 4.05. The score shows that the speech generated by the system is comparable to human speech. The system implementation is available in the link below https://github.com/Rono8/kiswahili_tts

Keywords:

Kiswahili text-to-speech system, Language modeling, Natural language processing, Speech processing, Text-to-speech system.

References:

[1] The University of Arizona, "Critical Languages Program," 2016. [Online]. Available: https://clp.arizona.edu/language/swahili
[2] Suman K. Saksamudre, P.P. Shrishrimal, and R.R. Deshmukh, "A Review on Different Approaches for Speech," International Journal of Computer Applications, vol. 115, no. 22, pp. 23-28, 2015. Crossref, http://doi.org/10.5120/20284-2839
[3] K. Simonyan et al., "WaveNet: Generative Model for Raw Audio," pp. 1–15, 2016. Crossref, https://doi.org/10.48550/arXiv.1609.03499
[4] Sercan O. Arik et al., "Deep Voice: Real-Time Neural Text-to-Speech," axXiv:1702.07825, 2017. Crossref, https://doi.org/10.48550/arXiv.1702.07825
[5] Mucemi Gakuru et al., "Development of Kiswahili Text to Speech," 2005 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, pp. 1481-1484, 2005. Crossref, https://doi.org/10.21437/Interspeech.2005-522
[6] K. Ngugi, P. W. Wagacha, and P.W Wagacha, "Swahili Text-to-Speech System," African Journal of Science and Technology, vol. 6, no. 1, pp. 80–89, 2005. Crossref, https://doi.org/10.4314/ajst.v6i1.55170
[7] R. A. Khan, and J. S. Chitode, "Concatenative Speech Synthesis : A Review," International Journal of Computer Applications, New York, USA, vol. 136, no. 3, pp. 1–6, 2016. Crossref, https://doi.org/10.5120/ijca2016907992
[8] Vinayak K. Bairagi, Sarang L. Joshi, and Vastav Bharambe, "Implementing Concatenative Text-To-Speech Synthesis System for Marathi Language using Python," SSRG International Journal of Electrical and Electronics Engineering, vol. 9, no. 9, pp. 1-11, 2022. Crossref, https://doi.org/10.14445/23488379/IJEEE-V9I9P101
[9] Sneha Lukose, and Savitha S. Upadhya, "Text to Speech Synthesizer-Formant Synthesis," 2017 International Conference on Nascent Technologies in Engineering, Vashi, India, pp. 1-4, 2017. Crossref, https://doi.org/10.1109/ICNTE.2017.7947945
[10] Lauri Juvela et al., "GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 6, pp. 1019-1030, 2019. Crossref, https://doi.org/10.1109/TASLP.2019.2906484
[11] Abitha A, and Lincy K, "A Faster RCNN Based Image Text Detection and Text to Speech Conversion," SSRG International Journal of Electronics and Communication Engineering, vol. 5, no. 5, pp. 11-14, 2018. Crossref, https://doi.org/10.14445/23488549/IJECE-V5I5P103
[12] S. R. Hertz, "Integration of Rule-Based Formant Synthesis and Waveform Concatenation: A Hybrid Approach to Text-to-Speech Synthesis," Proceedings of 2002 IEEE Workshop on Speech Synthesis, pp. 87-90, 2002. Crossref, https://doi.org/10.1109/WSS.2002.1224379
[13] Heiga Zen, Keiichi Tokuda, and Alan W. Black, "Statistical Parametric Synthesis," Journal of Speech Communication, vol. 51, no. 11, pp. 1039-1064, 2009. Crossref, https://doi.org/10.1016/j.specom.2009.04.004
[14] Ryan Prenger, Rafael Valle, and Bryan Catanzaro, "WaveGlow: A Flow-Based Generative Network for Speech Synthesis," IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, United Kingdom, pp. 3617-3621, 2019. Crossref, https://doi.org/10.1109/ICASSP.2019.8683143
[15] Fisher Yu, and Vladlen Koltun, "Multi-Scale Context Aggregation by Dilated Convolutions," arXiv:1511.07122, 2015. Crossref, https://doi.org/10.48550/arXiv.1511.07122
[16] Anamika Baradiya, and Vinay Jain, "Speech and Speaker Recognition Technology using MFCC and SVM," SSRG International Journal of Electronics and Communication Engineering, vol. 2, no. 5, pp. 6-9, 2015. Crossref, https://doi.org/10.14445/23488549/IJECE-V2I5P105
[17] Martin Strauss, and Bernd Edler, "A Flow-Based Neural Network for Time Domain Speech Enhancement," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 5754-5758, 2021. Crossref, https://doi.org/10.1109/ICASSP39728.2021.9413999
[18] WordProject, Swahili Audio Bible, 2022. [Online]. Available: https://www.wordproject.org/bibles/audio/05_swahili/index.htm
[19] WordProject, Swahili Audio Bible, 2022. [Online]. Available: https://www.wordproject.org/contact/new/copyrights.htm
[20] Tom Le Paine et al., "Fast Wavenet Generation Algorithm," Proceedings of the 26th International Conference on World Wide Web, Geneva, Switzerland, no. 98, pp. 381-389, 2016.
[21] Prashanth Kannadaguli and Vidya Bhat, "Phoneme Modeling for Speech Recognition in Kannada using Multivariate Bayesian Classifier," SSRG International Journal of Electronics and Communication Engineering, vol. 1, no. 9, pp. 1-4, 2014. Crossref, https://doi.org/10.14445/23488549/IJECE-V1I9P101
[22] Soroush Mehri et al., "SampleRNN: An Unconditional End-to-End Neural Audio Generation Model," arXiv:1612.07837, 2017. Crossref, https://doi.org/10.48550/arXiv.1612.07837
[23] GitHub, An Implementation of Tacotron Speech Synthesis in Tensorflow, 2017. [Online]. Available: https://github.com/keithito/tacotron
[24] K.Sureshkumar, and P.Thatchinamoorthy, "Speech and Spectral Landscapes Using Mel-Frequency Cepstral Coefficients Signal Processing," SSRG International Journal of VLSI & Signal Processing, vol. 3, no. 1, pp. 5-8, 2016. Crossref, https://doi.org/10.14445/23942584/IJVSP-V3I1P102
[25] The Local Language Speech Technology Initiative, 2022. [Online]. Available: https://outsideecho.com/llsti/downloads-languages-kiswahili.htm
[26] Yuxuan Wang et al., "Tacotron: Towards End-to-End Speech Synthesis," Interspeech, 2017. Crossref, https://doi.org/10.48550/arXiv.1703.10135
[27] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim, "Probability Density Distillation with Generative Adversarial Networks for High-Quality Parallel Waveform Generation," arXiv: 1904.04472, 2019. Crossref, https://doi.org/10.48550/arXiv.1904.04472
[28] Jonathan Shen et al., "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions," 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, Alberta, Canada, pp. 4779-4783, 2018. Crossref, https://doi.org/10.1109/ICASSP.2018.8461368