How to Set Up a Text-to-Speech Project with XTTS Model

Shubham Gupta 7 Comments September 4, 2024

This guide will walk you through the steps needed to set up a text-to-speech (TTS) project using the XTTS model from TTS.tts. We will cover the installation, configuration, and synthesis of speech using a pre-trained model. Let’s dive in!

Overview of AI/ML

Python and Visual Studio Code setup

Step 1: Setting Up the Environment

First, you’ll need to create a Python environment for the project. Using a virtual environment will help isolate the dependencies for this specific project.

Create a Virtual Environment

If you’re using venv, follow these steps:

# Create a virtual environment
python -m venv tts_project

# Activate the virtual environment
# On Windows:
tts_project\Scripts\activate
# On Linux/macOS:
source tts_project/bin/activate

Once the virtual environment is activated, you can proceed to install the necessary dependencies.

Step 2: Install PyTorch with CUDA Support (Optional)

For faster inference, you may want to run the model on a GPU. To do this, install the correct version of PyTorch with CUDA support. Visit the official PyTorch website and choose the correct CUDA version based on your system. Here’s an example installation for CUDA 11.7:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

If you don’t have a GPU or don’t want CUDA support, you can skip this step and install the CPU version instead:

pip install torch torchvision torchaudio

Step 3: Install Other Dependencies

Next, install the necessary libraries for the XTTS model and for saving audio files as .wav

pip install TTS soundfile

The TTS package contains the XTTS model, while soundfile is used to save the synthesized output as a .wav file.

Step 4: Download the XTTS Model Configuration and Pre-trained Weights

Before you can synthesize speech, you need the configuration file and pre-trained weights for the XTTS model.

Configuration File: The configuration file defines the model architecture and synthesis parameters.
Pre-trained Weights: The weights represent the learned parameters of the model for a specific speaker or set of speakers.

Download link -> https://huggingface.co/coqui/XTTS-v2/tree/main

Place these files in your project directory, for example:

/path/to/xtts/config.json
/path/to/xtts/model_checkpoint.pth

Step 5: Write the Python Script

Here is the code to set up and run the XTTS model for synthesizing text into speech:

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import soundfile as sf  # To save the output as a wav file

# Step 1: Load the model configuration
config = XttsConfig()
config.load_json("/path/to/xtts/config.json")

# Step 2: Initialize the model
model = Xtts.init_from_config(config)

# Step 3: Load the pre-trained weights
model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", eval=True)

# Optional: If you have CUDA installed and want to use GPU, uncomment the line below
# model.cuda()

# Step 4: Synthesize the output
outputs = model.synthesize(
    "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
    config,
    speaker_wav="/data/TTS-public/_refclips/3.wav",  # Replace with the correct path
    gpt_cond_len=3,
    language="en",
)

# Step 5: Save the synthesized speech to a wav file
output_wav = outputs['wav']
sf.write('output.wav', output_wav, config.audio.sample_rate)

print("Speech synthesis complete and saved to output.wav")

Explanation:

Loading the Config: The XttsConfig class loads the model configuration from a JSON file, which includes information about the model architecture and parameters.
Initializing the Model: The Xtts.init_from_config(config) initializes the XTTS model based on the loaded configuration.
Loading the Pre-trained Model Weights: The model’s pre-trained weights are loaded from the checkpoint directory. This step is essential to enable the model to perform speech synthesis.
Synthesis Process: The model.synthesize() function takes in a string of text, speaker information (if available), and other parameters to generate the speech waveform.
Saving the Output: The generated audio waveform is saved as a .wav file using the soundfile package.

Step 6: Run the Python Script

Once the script is ready, you can run it from the command line:

python synthesize_speech.py

If everything is set up correctly, this script will generate a .wav file containing the synthesized speech and print the message: Speech synthesis complete and saved to output.wav

Step 7: Troubleshooting Common Issues

AssertionError: “Torch not compiled with CUDA enabled”
- This error means you are trying to run the model on a GPU, but your PyTorch installation does not have CUDA support. Either install the CUDA version of PyTorch (see Step 2), or modify the script to run on CPU by removing .cuda().
File Not Found Errors
- Make sure the paths to the configuration file, speaker waveform, and pre-trained weights are correct. Update the paths in the script if needed.
Performance Issues
- If you’re running on CPU and experience slow performance, consider installing PyTorch with GPU (CUDA) support if your system has a compatible GPU.

Commands used in live session to setup TTS

# python version 3.9.6
# python --version
# python -m venv venv
# venv/Scripts/activate
# python.exe -m pip install --upgrade pip
# pip install TTS --cache-dir "D:/internship/tts_project/.cache"
# pip uninstall torch torchvision torchaudio
# pip install transformers datasets torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --cache-dir "D:/internship/tts_project/.cache"
# https://developer.nvidia.com/cuda-downloads
# pip install soundfile --cache-dir "D:/internship/tts_project/.cache"
# pip install deepspeed==0.10.3 --cache-dir "D:/internship/tts_project/.cache" optional

Conclusion

You’ve now successfully set up a Text-to-Speech (TTS) project using the XTTS model. By following the steps outlined in this guide, you should be able to synthesize speech from text and save it as a .wav file. This project can be expanded upon by integrating it into applications, experimenting with different voices, or training models on custom datasets. Happy coding!

7 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Harmanpreet singh

3 months ago

hello sir ,
i am getting this error after running the code :

File “C:\Users\harman\Desktop\Full stack Revison\New folder\main.py”, line 19, in
outputs = model.synthesize(
File “C:\Users\harman\Desktop\Full stack Revison\New folder\venv\lib\site-packages\TTS\tts\models\xtts.py”, line 419, in synthesize
return func(*args, **kwargs)
File “C:\Users\harman\Desktop\Full stack Revison\New folder\venv\lib\site-packages\TTS\tts\models\xtts.py”, line 488, in full_inference
return self.inference(
File “C:\Users\harman\Desktop\Full stack Revison\New folder\venv\lib\site-packages\torch\utils\_contextlib.py”, line 115, in decorate_context
return func(*args, **kwargs)
File “C:\Users\harman\Desktop\Full stack Revison\New folder\venv\lib\site-packages\TTS\tts\models\xtts.py”, line 534, in inference
text_tokens = torch.IntTensor(self.tokenizer.encode(sent, lang=language)).unsqueeze(0).to(self.device)
File “C:\Users\harman\Desktop\Full stack Revison\New folder\venv\lib\site-packages\TTS\tts\layers\xtts\tokenizer.py”, line 653, in encode
return self.tokenizer.encode(txt).ids
AttributeError: ‘NoneType’ object has no attribute ‘encode’

Shubham Gupta

Author

Reply to Harmanpreet singh

Request you to share the code using https://pastebin.com. Paste your code there and share the link here. Only then I will know the situation.

2 months ago

I have tried running the code and it is running fine.

Make sure you have downloaded the model files in one single folder
Path to input.wav file make sure it exists. It should be recorded or downloaded voice for sample.
Rest is ok

preeti singh

ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (TTS, gruut, encodec, bnnumerizer, jieba, docopt, gruut-ipa, gruut-lang-de, gruut-lang-en, gruut-lang-es, gruut-lang-fr)
Sir i face the error for download this code pip install TTS –cache-dir “C:/intnship/tts/.cache”

Reply to preeti singh

You have just provided the output. I need to see complete code and steps to reproduce the problem. Only then I can help. To share the code you can use https://pastebin.com . In live session you can show me the problem. One solution is to install visual studio build tools and check for C++ development environment.

Pritam Singh

1 month ago

Hello,
I tried to set up a Text-to-Speech project with the XTTS model in Linux, but this error always occurs.

Traceback (most recent call last):
File “/home/prit/internship/tts/main.py”, line 26, in <module>
  model.load_checkpoint(config, checkpoint_dir=”/home/prit/internship/tts/assets/tts_configs”, eval=True)
File “/home/prit/internship/tts/venv/lib/python3.9/site-packages/TTS/tts/models/xtts.py”, line 764, in load_checkpoint
  self.speaker_manager = SpeakerManager(speaker_file_path)
File “/home/prit/internship/tts/venv/lib/python3.9/site-packages/TTS/tts/layers/xtts/xtts_manager.py”, line 6, in __init__
  self.speakers = torch.load(speaker_file_path, map_location=’cpu’)
File “/home/prit/internship/tts/venv/lib/python3.9/site-packages/torch/serialization.py”, line 1028, in load
  return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File “/home/prit/internship/tts/venv/lib/python3.9/site-packages/torch/serialization.py”, line 1246, in _legacy_load
  magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, ‘v’.

Last edited 1 month ago by Pritam Singh

Reply to Pritam Singh

Check if you have downloaded all the required model files and kept them in tts_configs folder. Looks like model was corrupted.