Transforming Sanskrit: Natural Text-to-Speech with Optimized Encoders

Authors

  • Sabnam Kumari, Amita Malik

Keywords:

Text-to-Speech (TTS) synthesis, Grapheme-to-Phoneme, Gated Convolutional Neural Network, Gated Recurrent Unit, Adaptive Cheetah Optimization, HiFi-GAN vocoder

Abstract

Sanskrit is a very ancient and classical language having a significant impact on science, philosophy, and literature. There are less and fewer skilled speakers of this ancient language, which prevents the easy access to its rich cultural heritage although it is the root of many Indian languages. Thus, resulting in declination of the spoken use of Sanskrit these days. To solve the problem, we need innovative technical solutions that will enable and promote the spoken form of Sanskrit. One way to produce speech to enhance accessibility to the language in the modern era is by using text-to-speech (TTS) synthesis. To help enhance the synthesis quality as well as naturalness of speech, the paper discusses on an improved Sanskrit TTS system having optimized transformer encoding. The system employs Grapheme-to-Phoneme (G2P) to convert Sanskrit text into sounds and use a transformer-based mel-style speaker encoder to extract the speaker’s vector. The Gated Convolutional Neural Network (GCNN) captures local features, and GRU, Gated Recurrent Unit, is used for analyzing temporal features. The optimized transformer encoder, optimized by the Adaptive Cheetah Optimization (ACO) algorithm, processes the extracted features The processed output acts as an input to a mel-spectrogram. Later, the mel-spectrogram is converted into high-quality audio waveform using HiFi-GAN vocoder. This complete process leads to a highly effective TTS system that greatly improves speech synthesis for Sanskrit allowing natural sounding speech that closely resembles the voice quality of target speaker. To show that the suggested method is effective, we developed the system with Python and take Vāksañcayaḥ - Sanskrit Speech Corpus dataset for demonstrating our results. The results show a significant improvement in creating speech that resembles the voice of the target speaker.

Downloads

Download data is not yet available.

References

Harrison, K. David, (2007). When Languages Die: The Extinction of the World’s Languages and The Erosion of Human Knowledge, Oxford University Press Local Language Speech Technology Initiative website http://llsti.org/

I. Goodfellow et al., Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org

Dunaev, A., 2019. A Text-to-Speech system based on Deep Neural Networks. Bachelor Thesis.

Alastalo, A., 2021. Finnish end-to-end speech synthesis with Tacotron 2 and WaveNet.

Rama, GL Jayavardhana, A. G. Ramakrishnan, R. Muralishankar, and R. Prathibha. ”A complete text-to-speech synthesis system in Tamil,”In Proc. 2002 IEEE Workshop on Speech Synthesis, pp. 191-194. IEEE, 2002.

Kumar, H. R. S., J. K. Ashwini, B. S. R. Rajaram and A. G. Ramakrishnan. ”MILE TTS for Tamil and Kannada for Blizzard Challenge 2013.” In Blizzard Challenge Workshop, vol. 2013.

B S R Rajaram, H R Shiva Kumar, and A G Ramakrishnan, ”MILE TTS for Tamil for Blizzard challenge 2014”, In Blizzard Challenge Workshop, vol. 2014.

Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous, ”Tacotron: Towards end-to-end speech synthesis,” in INTERSPEECH, 2017.

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui. Wu, ”Natural TTS synthesis by conditioning WaveNet

on mel spectrogram predictions,” In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.

Sercan O Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al., ”Deep voice: Real-time neural text-tospeech,” arXiv preprint arXiv:1702.07825, 2017

Li, Y. A., Han, C., &Mesgarani, N. (2022). Styletts: A style-based generative model for natural and diverse text-to-speech synthesis. ArXiv preprint arXiv:2205.15439.

Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31.

Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., ... & Wei, F. (2023). Neural codec language models are zero-shot text to speech synthesizers. ArXiv preprint arXiv:2301.02111.

Kumari, R., Dev, A., & Kumar, A. (2021). An efficient adaptive artificial neural networkbased text to speech synthesizer for Hindi language. Multimedia Tools and Applications, 80(16), 24669-24695.

Ni, J., Wang, L., Gao, H., Qian, K., Zhang, Y., Chang, S., & Hasegawa-Johnson, M. (2022). Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition. ArXiv preprint arXiv:2203.15796.

Chen, M., Chen, M., Liang, S., Ma, J., Chen, L., Wang, S., & Xiao, J. (2019, September). Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding. In Interspeech (pp. 2105-2109).

Li, N., Liu, Y., Wu, Y., Liu, S., Zhao, S., & Liu, M. (2020, April). Robutrans: A robust transformer-based text-to-speech model. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 05, pp. 8228-8235).

Jain, R., Yihas been, M. Y., Bigioi, D., Corcoran, P., &Cucu, H. (2022). A text-to-speech pipeline, evaluation methodology, and initial fine-tuning results for child speech synthesis. IEEE Access, 10, 47628-47642.

D. Min, D.B. Lee, E. Yang, S.J. Hwang, in International Conference on Machine Learning. Meta-StyleSpeech: multi-speaker adaptive text-tospeech generation (PMLR, 2021), p. 7748–7759

Khan, A. and Sarfaraz, A., 2019. RNN-LSTM-GRU based language transformation. Soft Computing, 23(24), pp.13007-13024.

J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization. (2016). arXiv preprint arXiv:1607.06450

Shang, W., Chiu, J. and Sohn, K., 2017, February. Exploring normalization in deep residual networks with concatenated rectified linear units. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1).

Sait, S.M., Mehta, P., Gürses, D. and Yildiz, A.R., 2023. Cheetah optimization algorithm for optimum design of heat exchangers. Materials Testing, 65(8), pp.1230-1236.

Fan, J., Li, Y. and Wang, T., 2021. An improved African vultures optimization algorithm based on tent chaotic mapping and time-varying mechanism. Plos one, 16(11), p.e0260725.

Kong, J., Kim, J. and Bae, J., 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33, pp.17022-17033.

Downloads

Published

06.11.2024

How to Cite

Sabnam Kumari. (2024). Transforming Sanskrit: Natural Text-to-Speech with Optimized Encoders. International Journal of Intelligent Systems and Applications in Engineering, 12(23s), 2637 –. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7422

Issue

Section

Research Article