Real Time Voice Cloning Using Generative Adversarial Network
Keywords:
Voice cloning Generative, Adversarial Networks, Speaker encoder, Synthesizer, Deep learning vocoderAbstract
This research introduces a cutting-edge approach to real-time voice cloning by harnessing the capabilities of Generative Adversarial Networks (GANs). Voice cloning involves creating a digital reproduction of a person's voice that closely mimics their natural speech. Traditional techniques often require extensive datasets and lengthy processing times, making them less practical for real-time applications. Contrasting, the proposed method uses the goodness of GANs so that it greatly reduces data size and processing time, in addition to giving excellent outputs. Our model is trained on a wide variety of speech samples. This enables our model to capture and replicate the distinctive features of an individual's voice. The framework is built into two major components: the generator, which synthesizes voice outputs, and the discriminator, which evaluates the authenticity of these outputs. These two components interact in a continuous feedback loop through adversarial training, thereby continually improving the quality and realism of the generated speech. The system is designed to be highly efficient, running seamlessly on standard hardware configurations. This makes it more accessible for a wide range of applications, such as personalized voice assistants, custom voice-overs, and enhanced gaming experiences that rely on immersive audio. Experimental evaluations show that the GAN-based approach not only generates highly realistic voice clones but also retains the distinctive characteristics of the target speaker's voice. Moreover, a comparative analysis shows that this approach is superior to traditional voice cloning methods in terms of output quality and computational efficiency. This study is a significant step in the advancement of voice cloning technology by introducing a faster and more efficient way to generate lifelike voice replicas. It opens new possibilities for interactive voice-driven solutions and sets the stage for further innovations in personalized audio applications.
Downloads
References
Merlijn Blaauw, Jordi Bonada, Ryunosuke Daido. “Data Efficient Voice Cloning for Neural Singing Synthesis” 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp : 6840 to 6844.
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu,” WaveNet: A Generative Model for Raw Audio”, on arXiv on September 12, 2016.pp:1-15
Giuseppe Ruggiero, Enrico Zovato, Luigi Di Caro, Vincent Pollet, ”Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning”, arXiv on February 10, 2021,pp:1-5
V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp: 5206-5210, doi: 10.1109/ICASSP.2015.7178964.
Sercan Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou,” Deep Voice 2: Multi-Speaker Neural Text-to-Speech” , arXiv in October 2017.pp:1-16
S. Shirali-Shahreza and G. Penn, "MOS Naturalness and the Quest for Human-Like Speech," 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp. 346-352, doi: 10.1109/SLT.2018.8639599.
Jixun Yao, Yi Lei, Qing Wang, Pengcheng Guo, Ziqian Ning, Lei Xie, Hai Li, Junhui Liu, Danming Xie,”Preserving background sound in noise-robust voice conversion via multi-task learning”, 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023).pp:1-6
E. Variani, X. Lei, E. McDermott, I. L. Moreno and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014, pp. 4052-4056,doi: 10.1109/ICASSP.2014.6854363.
Zeyu Qiu, Jun Tang, Yaxin Zhang, Jiaxin Li, Xishan Bai,” A Voice Cloning Method Based on the Improved HiFi-GAN Model”, Computational Intelligence and Neuroscience in 2022.pp:1-12
Mingyang Zhang, Yi Zhou, Li Zhao, Haizhou Li,”Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data,” IEEE/ACM Transactions on Audio, Speech, and Language Processing in 2021 pp:1-5
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.