Merge branch 'main' of https://github.com/neonbjb/tortoise-tts into main

2025-12-06 07:12:00 +01:00 · 2023-10-19 02:20:00 +05:30 · 2023-10-19 02:20:00 +05:30 · 6c5f1fadb0
parent 5257d66d88 c8a3f8a3e0
commit 6c5f1fadb0
5 changed files with 184 additions and 176 deletions
--- a/Advanced_Usage.md
+++ b/Advanced_Usage.md
@ -0,0 +1,103 @@
+## Advanced Usage
+
+### Generation settings
+
+Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs
+that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using
+various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've
+set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with
+these settings (and it's very likely that I missed something!)
+
+These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
+```api.tts``` for a full list.
+
+### Prompt engineering
+
+Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion
+by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to
+take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the
+prompt "\[I am really sad,\] Please feed me." will only speak the words "Please feed me" (with a sad tonality).
+
+### Playing with the voice latent
+
+Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent,
+then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents
+are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.
+
+This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output
+what it thinks the "average" of those two voices sounds like.
+
+#### Generating conditioning latents from voices
+
+Use the script `get_conditioning_latents.py` to extract conditioning latents for a voice you have installed. This script
+will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).
+
+Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.
+
+#### Using raw conditioning latents to generate speech
+
+After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single
+".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).
+
+## Tortoise-detect
+
+Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip
+came from Tortoise.
+
+This classifier can be run on any computer, usage is as follows:
+
+```commandline
+python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
+```
+
+This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier
+as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false
+positives.
+
+## Model architecture
+
+Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate
+models that work together. I've assembled a write-up of the system architecture here:
+[https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/)
+
+## Training
+
+These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of
+~50k hours of speech data, most of which was transcribed by [ocotillo](http://www.github.com/neonbjb/ocotillo). Training was done on my own
+[DLAS](https://github.com/neonbjb/DL-Art-School) trainer.
+
+I currently do not have plans to release the training configurations or methodology. See the next section..
+
+## Ethical Considerations
+
+Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began
+wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system
+could be misused are many. It doesn't take much creativity to think up how.
+
+After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:
+
+1. It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
+2. It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
+3. The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
+4. I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See `tortoise-detect` above.
+5. If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.
+
+### Diversity
+
+The diversity expressed by ML models is strongly tied to the datasets they were trained on.
+
+Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to
+balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities
+or of people who speak with strong accents.
+
+## Looking forward
+
+Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when
+training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training
+of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with
+exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.
+
+I want to mention here
+that I think Tortoise could be a **lot** better. The three major components of Tortoise are either vanilla Transformer Encoder stacks
+or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason
+to believe that the same is not true of TTS.
--- a/README.md
+++ b/README.md
@ -12,43 +12,10 @@ Manuscript: https://arxiv.org/abs/2305.07243

 Please duplicate space if you don't want to wait in a queue.
 https://huggingface.co/spaces/Manmay/tortoise-tts
-
-## Version history
-#### v3.0.0; 2023/10/18
- Added fast inference for tortoise with hifidecoder(inspired by xtts by [coquiTTS](https://github.com/coqui-ai/TTS) 🐸, check their multi-lingual model)
-#### v2.8.0; 2023/9/13
- Added custom tokenizer for non-english models
-#### v2.7.0; 2023/7/26
- Bug fixes
- Added Apple Silicon Support
- Updated Transformer version
-#### v2.6.0; 2023/7/26
- Bug fixes
-
-#### v2.5.0; 2023/7/09
- Added kv_cache support 5x faster
- Added deepspeed support 10x faster
- Added half precision support
-  
-#### v2.4.0; 2022/5/17
- Removed CVVP model. Found that it does not, in fact, make an appreciable difference in the output.
- Add better debugging support; existing tools now spit out debug files which can be used to reproduce bad runs.
-
-#### v2.3.0; 2022/5/12
- New CLVP-large model for further improved decoding guidance.
- Improvements to read.py and do_tts.py (new options)
-
-#### v2.2.0; 2022/5/5
- Added several new voices from the training set.
- Automated redaction. Wrap the text you want to use to prompt the model but not be spoken in brackets.
- Bug fixes
-
-#### v2.1.0; 2022/5/2
- Added ability to produce totally random voices.
- Added ability to download voice conditioning latent via a script, and then use a user-provided conditioning latent.
- Added ability to use your own pretrained models.
- Refactored directory structures.
- Performance improvements & bug fixes.
+## Install via pip 
+```
+pip install tortoise-tts==3.0.0
+```

 ## What's in a name?

@ -56,6 +23,8 @@ I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise
 is insanely slow. It leverages both an autoregressive decoder **and** a diffusion decoder; both known for their low
 sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.

+well..... not so slow anymore now we can get a **0.25-0.3 RTF** on 4GB vram and with streaming we can get < **500 ms** latency !!! 
+
 ## Demos

 See [this page](http://nonint.com/static/tortoise_v2_examples.html) for a large list of example outputs.
@ -225,144 +194,6 @@ reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
 tts = api.TextToSpeech(use_deepspeed=True, kv_cache=True, half=True)
 pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
 ```
-## Voice customization guide
-
-Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.
-
-These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
-
-### Provided voices
-
-This repo comes with several pre-packaged voices. Voices prepended with "train_" came from the training set and perform
-far better than the others. If your goal is high quality speech, I recommend you pick one of them. If you want to see
-what Tortoise can do for zero-shot mimicking, take a look at the others.
-
-### Adding a new voice
-
-To add new voices to Tortoise, you will need to do the following:
-
-1. Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section.
-2. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
-3. Save the clips as a WAV file with floating point format and a 22,050 sample rate.
-4. Create a subdirectory in voices/
-5. Put your clips in that subdirectory.
-6. Run tortoise utilities with --voice=<your_subdirectory_name>.
-
-### Picking good reference clips
-
-As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking
-good clips:
-
-1. Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them.
-2. Avoid speeches. These generally have distortion caused by the amplification system.
-3. Avoid clips from phone calls.
-4. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.
-5. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book.
-6. The text being spoken in the clips does not matter, but diverse text does seem to perform better.
-
-## Advanced Usage
-
-### Generation settings
-
-Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs
-that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using
-various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've
-set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with
-these settings (and it's very likely that I missed something!)
-
-These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
-```api.tts``` for a full list.
-
-### Prompt engineering
-
-Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion
-by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to
-take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the
-prompt "\[I am really sad,\] Please feed me." will only speak the words "Please feed me" (with a sad tonality).
-
-### Playing with the voice latent
-
-Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent,
-then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents
-are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.
-
-This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output
-what it thinks the "average" of those two voices sounds like.
-
-#### Generating conditioning latents from voices
-
-Use the script `get_conditioning_latents.py` to extract conditioning latents for a voice you have installed. This script
-will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).
-
-Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.
-
-#### Using raw conditioning latents to generate speech
-
-After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single
-".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).
-
-## Tortoise-detect
-
-Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip
-came from Tortoise.
-
-This classifier can be run on any computer, usage is as follows:
-
-```commandline
-python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
-```
-
-This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier
-as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false
-positives.
-
-## Model architecture
-
-Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate
-models that work together. I've assembled a write-up of the system architecture here:
-[https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/)
-
-## Training
-
-These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of
-~50k hours of speech data, most of which was transcribed by [ocotillo](http://www.github.com/neonbjb/ocotillo). Training was done on my own
-[DLAS](https://github.com/neonbjb/DL-Art-School) trainer.
-
-I currently do not have plans to release the training configurations or methodology. See the next section..
-
-## Ethical Considerations
-
-Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began
-wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system
-could be misused are many. It doesn't take much creativity to think up how.
-
-After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:
-
-1. It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
-2. It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
-3. The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
-4. I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See `tortoise-detect` above.
-5. If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.
-
-### Diversity
-
-The diversity expressed by ML models is strongly tied to the datasets they were trained on.
-
-Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to
-balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities
-or of people who speak with strong accents.
-
-## Looking forward
-
-Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when
-training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training
-of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with
-exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.
-
-I want to mention here
-that I think Tortoise could be a **lot** better. The three major components of Tortoise are either vanilla Transformer Encoder stacks
-or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason
-to believe that the same is not true of TTS.

 ## Acknowledgements

--- a/VERSIONS.md
+++ b/VERSIONS.md
@ -0,0 +1,36 @@
+## Version history
+#### v3.0.0; 2023/10/18
+- Added fast inference for tortoise with hifidecoder(inspired by xtts by [coquiTTS](https://github.com/coqui-ai/TTS) 🐸, check their multi-lingual model)
+#### v2.8.0; 2023/9/13
+- Added custom tokenizer for non-english models
+#### v2.7.0; 2023/7/26
+- Bug fixes
+- Added Apple Silicon Support
+- Updated Transformer version
+#### v2.6.0; 2023/7/26
+- Bug fixes
+
+#### v2.5.0; 2023/7/09
+- Added kv_cache support 5x faster
+- Added deepspeed support 10x faster
+- Added half precision support
+  
+#### v2.4.0; 2022/5/17
+- Removed CVVP model. Found that it does not, in fact, make an appreciable difference in the output.
+- Add better debugging support; existing tools now spit out debug files which can be used to reproduce bad runs.
+
+#### v2.3.0; 2022/5/12
+- New CLVP-large model for further improved decoding guidance.
+- Improvements to read.py and do_tts.py (new options)
+
+#### v2.2.0; 2022/5/5
+- Added several new voices from the training set.
+- Automated redaction. Wrap the text you want to use to prompt the model but not be spoken in brackets.
+- Bug fixes
+
+#### v2.1.0; 2022/5/2
+- Added ability to produce totally random voices.
+- Added ability to download voice conditioning latent via a script, and then use a user-provided conditioning latent.
+- Added ability to use your own pretrained models.
+- Refactored directory structures.
+- Performance improvements & bug fixes.
--- a/Voice_customization_guide.md
+++ b/Voice_customization_guide.md
@ -0,0 +1,34 @@
+## Voice customization guide
+
+Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.
+
+These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
+
+### Provided voices
+
+This repo comes with several pre-packaged voices. Voices prepended with "train_" came from the training set and perform
+far better than the others. If your goal is high quality speech, I recommend you pick one of them. If you want to see
+what Tortoise can do for zero-shot mimicking, take a look at the others.
+
+### Adding a new voice
+
+To add new voices to Tortoise, you will need to do the following:
+
+1. Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section.
+2. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
+3. Save the clips as a WAV file with floating point format and a 22,050 sample rate.
+4. Create a subdirectory in voices/
+5. Put your clips in that subdirectory.
+6. Run tortoise utilities with --voice=<your_subdirectory_name>.
+
+### Picking good reference clips
+
+As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking
+good clips:
+
+1. Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them.
+2. Avoid speeches. These generally have distortion caused by the amplification system.
+3. Avoid clips from phone calls.
+4. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.
+5. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book.
+6. The text being spoken in the clips does not matter, but diverse text does seem to perform better.
--- a/tortoise/models/hifigan_decoder.py
+++ b/tortoise/models/hifigan_decoder.py
@ -230,6 +230,10 @@ class HifiganGenerator(torch.nn.Module):
        if not conv_post_weight_norm:
            remove_weight_norm(self.conv_post)

+        self.device = torch.device('cuda' if torch.cuda.is_available() else'cpu')
+        if torch.backends.mps.is_available():
+            self.device = torch.device('mps')
+
    def forward(self, x, g=None):
        """
        Args:
@ -287,7 +291,7 @@ class HifiganGenerator(torch.nn.Module):
            mode="linear",
        )
        g = g.unsqueeze(0)
-        return self.forward(up_2.to("cuda"), g.transpose(1,2))
+        return self.forward(up_2.to(self.device), g.transpose(1,2))

    def remove_weight_norm(self):
        print("Removing weight norm...")