diff --git a/tortoise_v2_examples.html b/tortoise_v2_examples.html index 1a457d1..0810792 100644 --- a/tortoise_v2_examples.html +++ b/tortoise_v2_examples.html @@ -1,128 +1,1508 @@ -
TorToiSe is a text-to-speech program built in April 2022 by jbetker@. TorToiSe is open source, with trained model weights -available at https://github.com/neonbjb/tortoise-tts
+TorToiSe is a text-to-speech program built in April 2022 by jbetker@. TorToiSe is open source, with trained model + weights + available at https://github.com/neonbjb/tortoise-tts
-This page demonstrates some of the results of TorToiSe.
+This page demonstrates some of the results of TorToiSe.
-Following are several particularly good results generated by the model.
+Following are several particularly good results generated by the model.
-LJSpeech is a popular dataset used to train small-scale TTS models. TorToiSe is a multi-voice model, following is how -it renders the LJSpeech voice with and without fine-tuning, compared with results for the same text from the popular Tacotron2 -model paired with the Waveglow vocoder.
-| Tacotron2+Waveglow | TorToiSe | TorToiSe Finetuned | |||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
- |
- |||||||||||||||||||||||||
| Tacotron2+Waveglow | +TorToiSe | +TorToiSe Finetuned | +
|---|---|---|
+ |
+ ||
- | ||
+ |
+
NaturalVoice is a SOTA TTS engine developed by Microsoft Research Asia in May 2022. It features realistic prosody -and end-to-end generation with no need for a vocoder. While not much has actually been released about this model other -than five samples, those samples are quite good and I would consider this the most competitive TTS engine out there -right now.
-| Natural Voice | TorToiSe Finetuned | -
|---|---|
- |
It is important to note that it is not actually fair to compare any of these models: Tortoise is a multi-voice probabilistic -model trained on millions of hours of speech with an exceptionally slow inference time. Tacotron and NaturalVoice are efficient, -fast, single-voice models trained on 24 hours of speech. Unfortunately, there isn't much in the way of actually comparable -research to Tortoise.
+NaturalVoice is a SOTA TTS engine developed by Microsoft Research Asia in May 2022. It features realistic prosody + and end-to-end generation with no need for a vocoder. While not much has actually been released about this model + other + than five samples, those samples are quite good and I would consider this the most competitive TTS engine out + there + right now.
+| Natural Voice | +TorToiSe Finetuned | +
|---|---|
+ |
+
It is important to note that it is not actually fair to compare any of these models: Tortoise is a multi-voice + probabilistic + model trained on millions of hours of speech with an exceptionally slow inference time. Tacotron and + NaturalVoice are efficient, + fast, single-voice models trained on 24 hours of speech. Unfortunately, there isn't much in the way of actually + comparable + research to Tortoise.
-Following are all the results from which the hand-picked results were drawn from. Also included is the reference - audio that the program is trying to mimic. This will give you a better sense of how TorToiSe really performs.
+Following are all the results from which the hand-picked results were drawn from. Also included is the reference + audio that the program is trying to mimic. This will give you a better sense of how TorToiSe really performs. +
-| text | angie | daniel | deniro | emma | freeman | geralt | halle | jlaw | lj | myself | pat | snakes | tom | train_atkins | train_dotrice | train_kennard | weaver | william | -
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| reference clip | ||||||||||||||||||
| autoregressive_ml | ||||||||||||||||||
| bengio_it_needs_to_know_what_is_bad | ||||||||||||||||||
| dickinson_stop_for_death | ||||||||||||||||||
| espn_basketball | ||||||||||||||||||
| frost_oar_to_oar | ||||||||||||||||||
| frost_road_not_taken | ||||||||||||||||||
| gatsby_and_so_we_beat_on | ||||||||||||||||||
| harrypotter_differences_of_habit_and_language | ||||||||||||||||||
| i_am_a_language_model | ||||||||||||||||||
| melodie_kao | ||||||||||||||||||
| nyt_covid | ||||||||||||||||||
| real_courage_is_when_you_know_your_licked | ||||||||||||||||||
| rolling_stone_review | ||||||||||||||||||
| spacecraft_interview | ||||||||||||||||||
| tacotron2_sample1 | ||||||||||||||||||
| tacotron2_sample2 | ||||||||||||||||||
| tacotron2_sample3 | ||||||||||||||||||
| tacotron2_sample4 | ||||||||||||||||||
| watts_this_is_the_real_secret_of_life | ||||||||||||||||||
| wilde_nowadays_people_know_the_price |
| text | +angie | +daniel | +deniro | +emma | +freeman | +geralt | +halle | +jlaw | +lj | +myself | +pat | +snakes | +tom | +train_atkins | +train_dotrice | +train_kennard | +weaver | +william | +
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| reference clip | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| autoregressive_ml | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| bengio_it_needs_to_know_what_is_bad | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| dickinson_stop_for_death | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| espn_basketball | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| frost_oar_to_oar | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| frost_road_not_taken | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| gatsby_and_so_we_beat_on | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| harrypotter_differences_of_habit_and_language | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| i_am_a_language_model | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| melodie_kao | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| nyt_covid | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| real_courage_is_when_you_know_your_licked | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| rolling_stone_review | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| spacecraft_interview | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| tacotron2_sample1 | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| tacotron2_sample2 | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| tacotron2_sample3 | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| tacotron2_sample4 | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| watts_this_is_the_real_secret_of_life | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
| wilde_nowadays_people_know_the_price | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
Tortoise is capable of "prompt-engineering" in that tone and prosody is affected by the emotions inflected in the words -fed to the program. For example, prompting the model with "[I am so angry,] I went to the park and threw a ball" will -result in it outputting "I went to the park and threw the ball" with an angry tone.
+Tortoise is capable of "prompt-engineering" in that tone and prosody is affected by the emotions inflected in the + words + fed to the program. For example, prompting the model with "[I am so angry,] I went to the park and threw a ball" + will + result in it outputting "I went to the park and threw the ball" with an angry tone.
-Following are a few examples of different prompts. The effect is subtle, but is definitely there. Many voices are -less effected by this.
+Following are a few examples of different prompts. The effect is subtle, but is definitely there. Many voices are + less effected by this.
-Angry: