Add support for extracting and feeding conditioning latents directly into the model

- Adds a new script and API endpoints for doing this
- Reworks autoregressive and diffusion models so that the conditioning is computed separately (which will actually provide a mild performance boost)
- Updates README

This is untested. Need to do the following manual tests (and someday write unit tests for this behemoth before
it becomes a problem..)
1) Does get_conditioning_latents.py work?
2) Can I feed those latents back into the model by creating a new voice?
3) Can I still mix and match voices (both with conditioning latents and normal voices) with read.py?
This commit is contained in:
James Betker 2022-05-01 17:25:18 -06:00
parent a8264f5cef
commit 0ffc191408
8 changed files with 165 additions and 78 deletions

View file

@ -118,12 +118,24 @@ These settings are not available in the normal scripts packaged with Tortoise. T
### Playing with the voice latent
Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent, then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents are quite expressive, affecting
everything from tone to speaking rate to speech abnormalities.
Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent,
then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents
are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.
This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output what it thinks the "average" of those two voices sounds like. You could also theoretically build a small extension to Tortoise that gradually shifts the
latent from one speaker to another, then apply it across a bit of spoken text (something I havent implemented yet, but might
get to soon!) I am sure there are other interesting things that can be done here. Please let me know what you find!
This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output
what it thinks the "average" of those two voices sounds like.
#### Generating conditioning latents from voices
Use the script `get_conditioning_latents.py` to extract conditioning latents for a voice you have installed. This script
will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).
Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.
#### Using raw conditioning latents to generate speech
After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single
".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).
### Send me feedback!