Real Time Voice Cloning Using Generative Adversarial Network

Authors

  • C Hrishikesava Reddy, D. Rajasekhar, T. Haritha, B. Ranjit Naik, N. Srinivasula Reddy

Keywords:

Voice cloning Generative, Adversarial Networks, Speaker encoder, Synthesizer, Deep learning vocoder

Abstract

This research introduces a cutting-edge approach to real-time voice cloning by harnessing the capabilities of Generative Adversarial Networks (GANs). Voice cloning involves creating a digital reproduction of a person's voice that closely mimics their natural speech. Traditional techniques often require extensive datasets and lengthy processing times, making them less practical for real-time applications. Contrasting, the proposed method uses the goodness of GANs so that it greatly reduces data size and processing time, in addition to giving excellent outputs. Our model is trained on a wide variety of speech samples. This enables our model to capture and replicate the distinctive features of an individual's voice. The framework is built into two major components: the generator, which synthesizes voice outputs, and the discriminator, which evaluates the authenticity of these outputs. These two components interact in a continuous feedback loop through adversarial training, thereby continually improving the quality and realism of the generated speech. The system is designed to be highly efficient, running seamlessly on standard hardware configurations. This makes it more accessible for a wide range of applications, such as personalized voice assistants, custom voice-overs, and enhanced gaming experiences that rely on immersive audio. Experimental evaluations show that the GAN-based approach not only generates highly realistic voice clones but also retains the distinctive characteristics of the target speaker's voice. Moreover, a comparative analysis shows that this approach is superior to traditional voice cloning methods in terms of output quality and computational efficiency. This study is a significant step in the advancement of voice cloning technology by introducing a faster and more efficient way to generate lifelike voice replicas. It opens new possibilities for interactive voice-driven solutions and sets the stage for further innovations in personalized audio applications.

Downloads

Download data is not yet available.

References

Merlijn Blaauw, Jordi Bonada, Ryunosuke Daido. “Data Efficient Voice Cloning for Neural Singing Synthesis” 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp : 6840 to 6844.

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu,” WaveNet: A Generative Model for Raw Audio”, on arXiv on September 12, 2016.pp:1-15

Giuseppe Ruggiero, Enrico Zovato, Luigi Di Caro, Vincent Pollet, ”Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning”, arXiv on February 10, 2021,pp:1-5

V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp: 5206-5210, doi: 10.1109/ICASSP.2015.7178964.

Sercan Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou,” Deep Voice 2: Multi-Speaker Neural Text-to-Speech” , arXiv in October 2017.pp:1-16

S. Shirali-Shahreza and G. Penn, "MOS Naturalness and the Quest for Human-Like Speech," 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp. 346-352, doi: 10.1109/SLT.2018.8639599.

Jixun Yao, Yi Lei, Qing Wang, Pengcheng Guo, Ziqian Ning, Lei Xie, Hai Li, Junhui Liu, Danming Xie,”Preserving background sound in noise-robust voice conversion via multi-task learning”, 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023).pp:1-6

E. Variani, X. Lei, E. McDermott, I. L. Moreno and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014, pp. 4052-4056,doi: 10.1109/ICASSP.2014.6854363.

Zeyu Qiu, Jun Tang, Yaxin Zhang, Jiaxin Li, Xishan Bai,” A Voice Cloning Method Based on the Improved HiFi-GAN Model”, Computational Intelligence and Neuroscience in 2022.pp:1-12

Mingyang Zhang, Yi Zhou, Li Zhao, Haizhou Li,”Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data,” IEEE/ACM Transactions on Audio, Speech, and Language Processing in 2021 pp:1-5

Downloads

Published

19.12.2024

How to Cite

C Hrishikesava Reddy. (2024). Real Time Voice Cloning Using Generative Adversarial Network. International Journal of Intelligent Systems and Applications in Engineering, 12(4), 5133–5139. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7288

Issue

Section

Research Article

Similar Articles

You may also start an advanced similarity search for this article.