Transforming Sanskrit: Natural Text-to-Speech with Optimized Encoders
Keywords:
Text-to-Speech (TTS) synthesis, Grapheme-to-Phoneme, Gated Convolutional Neural Network, Gated Recurrent Unit, Adaptive Cheetah Optimization, HiFi-GAN vocoderAbstract
Sanskrit is a very ancient and classical language having a significant impact on science, philosophy, and literature. There are less and fewer skilled speakers of this ancient language, which prevents the easy access to its rich cultural heritage although it is the root of many Indian languages. Thus, resulting in declination of the spoken use of Sanskrit these days. To solve the problem, we need innovative technical solutions that will enable and promote the spoken form of Sanskrit. One way to produce speech to enhance accessibility to the language in the modern era is by using text-to-speech (TTS) synthesis. To help enhance the synthesis quality as well as naturalness of speech, the paper discusses on an improved Sanskrit TTS system having optimized transformer encoding. The system employs Grapheme-to-Phoneme (G2P) to convert Sanskrit text into sounds and use a transformer-based mel-style speaker encoder to extract the speaker’s vector. The Gated Convolutional Neural Network (GCNN) captures local features, and GRU, Gated Recurrent Unit, is used for analyzing temporal features. The optimized transformer encoder, optimized by the Adaptive Cheetah Optimization (ACO) algorithm, processes the extracted features The processed output acts as an input to a mel-spectrogram. Later, the mel-spectrogram is converted into high-quality audio waveform using HiFi-GAN vocoder. This complete process leads to a highly effective TTS system that greatly improves speech synthesis for Sanskrit allowing natural sounding speech that closely resembles the voice quality of target speaker. To show that the suggested method is effective, we developed the system with Python and take Vāksañcayaḥ - Sanskrit Speech Corpus dataset for demonstrating our results. The results show a significant improvement in creating speech that resembles the voice of the target speaker.
Downloads
References
Harrison, K. David, (2007). When Languages Die: The Extinction of the World’s Languages and The Erosion of Human Knowledge, Oxford University Press Local Language Speech Technology Initiative website http://llsti.org/
I. Goodfellow et al., Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org
Dunaev, A., 2019. A Text-to-Speech system based on Deep Neural Networks. Bachelor Thesis.
Alastalo, A., 2021. Finnish end-to-end speech synthesis with Tacotron 2 and WaveNet.
Rama, GL Jayavardhana, A. G. Ramakrishnan, R. Muralishankar, and R. Prathibha. ”A complete text-to-speech synthesis system in Tamil,”In Proc. 2002 IEEE Workshop on Speech Synthesis, pp. 191-194. IEEE, 2002.
Kumar, H. R. S., J. K. Ashwini, B. S. R. Rajaram and A. G. Ramakrishnan. ”MILE TTS for Tamil and Kannada for Blizzard Challenge 2013.” In Blizzard Challenge Workshop, vol. 2013.
B S R Rajaram, H R Shiva Kumar, and A G Ramakrishnan, ”MILE TTS for Tamil for Blizzard challenge 2014”, In Blizzard Challenge Workshop, vol. 2014.
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous, ”Tacotron: Towards end-to-end speech synthesis,” in INTERSPEECH, 2017.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui. Wu, ”Natural TTS synthesis by conditioning WaveNet
on mel spectrogram predictions,” In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
Sercan O Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al., ”Deep voice: Real-time neural text-tospeech,” arXiv preprint arXiv:1702.07825, 2017
Li, Y. A., Han, C., &Mesgarani, N. (2022). Styletts: A style-based generative model for natural and diverse text-to-speech synthesis. ArXiv preprint arXiv:2205.15439.
Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31.
Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., ... & Wei, F. (2023). Neural codec language models are zero-shot text to speech synthesizers. ArXiv preprint arXiv:2301.02111.
Kumari, R., Dev, A., & Kumar, A. (2021). An efficient adaptive artificial neural networkbased text to speech synthesizer for Hindi language. Multimedia Tools and Applications, 80(16), 24669-24695.
Ni, J., Wang, L., Gao, H., Qian, K., Zhang, Y., Chang, S., & Hasegawa-Johnson, M. (2022). Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition. ArXiv preprint arXiv:2203.15796.
Chen, M., Chen, M., Liang, S., Ma, J., Chen, L., Wang, S., & Xiao, J. (2019, September). Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding. In Interspeech (pp. 2105-2109).
Li, N., Liu, Y., Wu, Y., Liu, S., Zhao, S., & Liu, M. (2020, April). Robutrans: A robust transformer-based text-to-speech model. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 05, pp. 8228-8235).
Jain, R., Yihas been, M. Y., Bigioi, D., Corcoran, P., &Cucu, H. (2022). A text-to-speech pipeline, evaluation methodology, and initial fine-tuning results for child speech synthesis. IEEE Access, 10, 47628-47642.
D. Min, D.B. Lee, E. Yang, S.J. Hwang, in International Conference on Machine Learning. Meta-StyleSpeech: multi-speaker adaptive text-tospeech generation (PMLR, 2021), p. 7748–7759
Khan, A. and Sarfaraz, A., 2019. RNN-LSTM-GRU based language transformation. Soft Computing, 23(24), pp.13007-13024.
J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization. (2016). arXiv preprint arXiv:1607.06450
Shang, W., Chiu, J. and Sohn, K., 2017, February. Exploring normalization in deep residual networks with concatenated rectified linear units. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1).
Sait, S.M., Mehta, P., Gürses, D. and Yildiz, A.R., 2023. Cheetah optimization algorithm for optimum design of heat exchangers. Materials Testing, 65(8), pp.1230-1236.
Fan, J., Li, Y. and Wang, T., 2021. An improved African vultures optimization algorithm based on tent chaotic mapping and time-varying mechanism. Plos one, 16(11), p.e0260725.
Kong, J., Kim, J. and Bae, J., 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33, pp.17022-17033.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.