Advanced Voice Cloning And Transcription Using Deep Learning: Implementation For High-Fidelity Speech Synthesis

Authors

  • Sreenivasa Rao Kakumanu, Jupalli Pushpakumari, Kolli Veena, RamaRao Tandu, Padma TNS Author

DOI:

https://doi.org/10.64252/f8gk4e23

Keywords:

Speech Synthesis, Tacotron2, HiFi-GAN, Mel-Spectrogram, Text-to-Speech (TTS)

Abstract

State of the art voice cloning methodologies employed conventional concatenative and parametric synthesis techniques, effective and efficient though they have remained, producing mechanical, if somewhat constrained speech. Advancements in deep learning however have allowed modern TTS systems to employ neural nets for the generation of richer, more natural and communicative speech. The approach presented here is for creating a sophisticated voice-cloning technique that integrates model found in NVIDIA's Tacotron2 and HiFi-GAN to achieve very natural-like speech synthesis. Tacotron2 uses a sequence-to-sequence architecture with an attention mechanism that converts the text into mel-spectrograms. Subsequently, the conversion of such spectrograms to real waveforms is done through a GAN-based vocoder called HiFi-GAN. It also incorporates other techniques such as denoising and super-resolution to enhance the quality of audio output in terms of clarity and naturalness. This work has further evaluation concerning its performance based on the RMS Loss during text-to-speech conversions. The resulting system shows striking improvements over state-of-the-art methods that achieve much better quality and efficiency toward practical applications for voice reproduction and digital communications.

Downloads

Download data is not yet available.

Downloads

Published

2025-05-05

How to Cite

Advanced Voice Cloning And Transcription Using Deep Learning: Implementation For High-Fidelity Speech Synthesis. (2025). International Journal of Environmental Sciences, 11(3s), 1246-1253. https://doi.org/10.64252/f8gk4e23