Text-to-Speech System Using Python and the Xtts Model – Text Limitations and Solution – Part 3
Text-to-Speech (TTS) systems have become increasingly essential in the world of AI, powering everything from virtual assistants to accessibility tools. In this article, we will walk through the process of setting up and using a TTS system based on Python, leveraging the Xtts
model for high-quality speech synthesis. This guide is designed for developers who are looking to build their own TTS system and integrate it into projects.
Related
How to Set Up a Python Virtual Environment in Visual Studio Code
How to Set Up a Text-to-Speech Project with XTTS Model
How to Build a Text-to-Speech (TTS) Application Using Python and SQLite
Prerequisites
Before we begin, make sure you have Python 3.9.6 installed on your machine. You can check your Python version using the command:
python --version
Additionally, we will use a virtual environment to manage dependencies. To create a virtual environment and activate it, follow these steps:
python -m venv venv
venv/Scripts/activate
Once the environment is activated, it’s a good idea to upgrade your pip
package installer to the latest version:
python.exe -m pip install --upgrade pip
Installing Required Packages
We will install the TTS
package and other dependencies, using a local cache to speed up installations. Here’s how you can install the necessary packages:
pip install TTS --cache-dir "D:/internship/tts_project/.cache"
pip uninstall torch torchvision torchaudio
pip install transformers datasets torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --cache-dir "D:/internship/tts_project/.cache"
Additionally, we will install the NVIDIA CUDA toolkit to leverage GPU acceleration if needed. You can download it from NVIDIA’s website.
Initializing the TTS Model
With all the dependencies in place, we can now start setting up the TTS model. The following code will help load the model configuration and initialize it:
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import soundfile as sf # To save the output as a wav file
# Step 1: Load the model configuration
config = XttsConfig()
config.load_json("D:/internship/tts/assets/tts_configs/config.json")
# Step 2: Initialize the model
model = Xtts.init_from_config(config)
# Step 3: Load the pre-trained weights
model.load_checkpoint(config, checkpoint_dir="D:/internship/tts/assets/tts_configs", eval=True)
# Optional: If you have CUDA installed and want to use GPU, uncomment the line below
# model.cuda()
Synthesizing Speech
To create a function for converting text into speech, use the createTTS
function below. This function takes text input, an input audio file for speaker characteristics, and outputs the synthesized speech into a .wav
file:
def createTTS(text, input_audio, output_audio):
# Step 4: Synthesize the output
outputs = model.synthesize(
text,
config,
speaker_wav=input_audio, # Replace with the correct path
gpt_cond_len=3,
language="en",
)
# Step 5: Save the synthesized speech to a wav file
output_wav = outputs['wav']
sf.write(output_audio, output_wav, config.audio.sample_rate)
print("Speech synthesis complete and saved to output.wav")
This function uses an input .wav
file to capture the speaker’s voice characteristics and outputs the generated speech to a new .wav
file.
Managing Long Text
When working with lengthy text, it’s often necessary to break it into smaller chunks before feeding it into the TTS model. The following function helps break down text based on punctuation, ensuring that each chunk is under a specified length:
import re
def break_text_by_punctuation(text, max_chunk_size=250):
# Split the text using regular expression on punctuation marks
sentences = re.split(r'([.,!?])', text)
chunks = []
current_chunk = ""
for i in range(0, len(sentences)-1, 2):
sentence = sentences[i] + sentences[i+1]
# If adding the sentence to the current chunk exceeds the limit, store the chunk and start a new one
if len(current_chunk) + len(sentence) > max_chunk_size:
chunks.append(current_chunk.strip())
current_chunk = sentence
else:
current_chunk += sentence
# Add the last chunk
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
This function ensures that the text is properly split into smaller chunks based on punctuation, allowing the TTS model to process each chunk independently.
Reading Data and Generating Speech
Now, let’s bring everything together. The following code reads a text file, splits the content into chunks, and converts each chunk into speech:
def getData():
f = open('data.txt','r')
data = f.read()
f.close()
return data
data = break_text_by_punctuation(getData())
count = 1
for d in data:
print(d)
createTTS(d, "input.wav", str(count) + ".wav")
count = count + 1
Complete code
# python version 3.9.6
# python --version
# python -m venv venv
# venv/Scripts/activate
# python.exe -m pip install --upgrade pip
# pip install TTS --cache-dir "D:/internship/tts_project/.cache"
# pip uninstall torch torchvision torchaudio
# pip install transformers datasets torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --cache-dir "D:/internship/tts_project/.cache"
# https://developer.nvidia.com/cuda-downloads
# pip install soundfile --cache-dir "D:/internship/tts_project/.cache"
# pip install deepspeed==0.10.3 --cache-dir "D:/internship/tts_project/.cache" optional
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import soundfile as sf # To save the output as a wav file
# Step 1: Load the model configuration
config = XttsConfig()
config.load_json("D:/internship/tts/assets/tts_configs/config.json")
# Step 2: Initialize the model
model = Xtts.init_from_config(config)
# Step 3: Load the pre-trained weights
model.load_checkpoint(config, checkpoint_dir="D:/internship/tts/assets/tts_configs", eval=True)
# Optional: If you have CUDA installed and want to use GPU, uncomment the line below
# model.cuda()
def createTTS(text, input_audio, output_audio):
# Step 4: Synthesize the output
outputs = model.synthesize(
text,
config,
speaker_wav=input_audio, # Replace with the correct path
gpt_cond_len=3,
language="en",
)
# Step 5: Save the synthesized speech to a wav file
output_wav = outputs['wav']
sf.write(output_audio, output_wav, config.audio.sample_rate)
print("Speech synthesis complete and saved to output.wav")
def getData():
f = open('data.txt','r')
data = f.read()
f.close()
return data
import re
def break_text_by_punctuation(text, max_chunk_size=250):
# Split the text using regular expression on punctuation marks
sentences = re.split(r'([.,!?])', text)
chunks = []
current_chunk = ""
for i in range(0, len(sentences)-1, 2):
sentence = sentences[i] + sentences[i+1]
# If adding the sentence to the current chunk exceeds the limit, store the chunk and start a new one
if len(current_chunk) + len(sentence) > max_chunk_size:
chunks.append(current_chunk.strip())
current_chunk = sentence
else:
current_chunk += sentence
# Add the last chunk
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
data = break_text_by_punctuation(getData())
count = 1
for d in data:
print(d)
createTTS(d, "input.wav", str(count) + ".wav")
count = count + 1
Here, the getData
function reads the text from a file, splits it into manageable chunks, and then generates a .wav
file for each chunk using the createTTS
function.
Conclusion
With this step-by-step guide, you can now build a powerful Text-to-Speech system using Python and the Xtts model. You can further enhance this system by exploring other language models or optimizing performance with GPU acceleration. This TTS system can be integrated into various applications, from voice assistants to content creation tools, offering flexibility and customization options tailored to your project’s needs.