diff --git a/.gitignore b/.gitignore index 82504f8..7693938 100644 --- a/.gitignore +++ b/.gitignore @@ -131,4 +131,5 @@ dmypy.json .idea/* .models/* .custom/* -results/* \ No newline at end of file +results/* +debug_states/* \ No newline at end of file diff --git a/MANIFEST.in b/MANIFEST.in new file mode 100644 index 0000000..d19c969 --- /dev/null +++ b/MANIFEST.in @@ -0,0 +1,2 @@ +recursive-include tortoise/data * +recursive-include tortoise/voices * diff --git a/README.md b/README.md index b2d345c..368fd59 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,14 @@ Tortoise is a text-to-speech program built with the following priorities: This repo contains all the code needed to run Tortoise TTS in inference mode. -### New features +A (*very*) rough draft of the Tortoise paper is now available in doc format. I would definitely appreciate any comments, suggestions or reviews: +https://docs.google.com/document/d/13O_eyY65i6AkNrN_LdPhpUjGhyTNKYHvDrIvHnHe1GA + +### Version history + +#### v2.4; 2022/5/17 +- Removed CVVP model. Found that it does not, in fact, make an appreciable difference in the output. +- Add better debugging support; existing tools now spit out debug files which can be used to reproduce bad runs. #### v2.3; 2022/5/12 - New CLVP-large model for further improved decoding guidance. @@ -35,6 +42,8 @@ sampling rates. On a K80, expect to generate a medium sized sentence every 2 min See [this page](http://nonint.com/static/tortoise_v2_examples.html) for a large list of example outputs. +Cool application of Tortoise+GPT-3 (not by me): https://twitter.com/lexman_ai + ## Usage guide ### Colab @@ -44,7 +53,7 @@ https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sh ### Local Installation -If you want to use this on your own computer, you must have an NVIDIA GPU. +If you want to use this on your own computer, you must have an NVIDIA GPU. First, install pytorch using these instructions: [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/). On Windows, I **highly** recommend using the Conda installation path. I have been told that if you do not do this, you @@ -55,6 +64,7 @@ Next, install TorToiSe and it's dependencies: ```shell git clone https://github.com/neonbjb/tortoise-tts.git cd tortoise-tts +python -m pip install -r ./requirements.txt python setup.py install ``` @@ -75,7 +85,7 @@ This script provides tools for reading large amounts of text. python tortoise/read.py --textfile --voice random ``` -This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series +This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and output that as well. @@ -89,7 +99,7 @@ Tortoise can be used programmatically, like so: ```python reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths] tts = api.TextToSpeech() -pcm_audio = tts.tts_with_preset("your text here", reference_clips, preset='fast') +pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast') ``` ## Voice customization guide @@ -100,7 +110,7 @@ These reference clips are recordings of a speaker that you provide to guide spee ### Random voice -I've included a feature which randomly generates a voice. These voices don't actually exist and will be random every time you run +I've included a feature which randomly generates a voice. These voices don't actually exist and will be random every time you run it. The results are quite fascinating and I recommend you play around with it! You can use the random voice by passing in 'random' as the voice name. Tortoise will take care of the rest. @@ -111,7 +121,7 @@ For the those in the ML space: this is created by projecting a random vector ont This repo comes with several pre-packaged voices. Voices prepended with "train_" came from the training set and perform far better than the others. If your goal is high quality speech, I recommend you pick one of them. If you want to see -what Tortoise can do for zero-shot mimicing, take a look at the others. +what Tortoise can do for zero-shot mimicking, take a look at the others. ### Adding a new voice @@ -158,11 +168,11 @@ prompt "\[I am really sad,\] Please feed me." will only speak the words "Please ### Playing with the voice latent -Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent, -then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents +Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent, +then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents are quite expressive, affecting everything from tone to speaking rate to speech abnormalities. -This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output +This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output what it thinks the "average" of those two voices sounds like. #### Generating conditioning latents from voices @@ -201,13 +211,13 @@ positives. ## Model architecture -Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate +Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate models that work together. I've assembled a write-up of the system architecture here: [https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/) ## Training -These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of +These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of ~50k hours of speech data, most of which was transcribed by [ocotillo](http://www.github.com/neonbjb/ocotillo). Training was done on my own [DLAS](https://github.com/neonbjb/DL-Art-School) trainer. @@ -243,14 +253,14 @@ of the model increases multiplicatively. On enterprise-grade hardware, this is n exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck. I want to mention here -that I think Tortoise could do be a **lot** better. The three major components of Tortoise are either vanilla Transformer Encoder stacks +that I think Tortoise could be a **lot** better. The three major components of Tortoise are either vanilla Transformer Encoder stacks or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason to believe that the same is not true of TTS. The largest model in Tortoise v2 is considerably smaller than GPT-2 large. It is 20x smaller that the original DALLE transformer. Imagine what a TTS model trained at or near GPT-3 or DALLE scale could achieve. -If you are an ethical organization with computational resources to spare interested in seeing what this model could do +If you are an ethical organization with computational resources to spare interested in seeing what this model could do if properly scaled out, please reach out to me! I would love to collaborate on this. ## Acknowledgements @@ -262,6 +272,7 @@ credit a few of the amazing folks in the community that have helped make this ha - [Ramesh et al](https://arxiv.org/pdf/2102.12092.pdf) who authored the DALLE paper, which is the inspiration behind Tortoise. - [Nichol and Dhariwal](https://arxiv.org/pdf/2102.09672.pdf) who authored the (revision of) the code that drives the diffusion model. - [Jang et al](https://arxiv.org/pdf/2106.07889.pdf) who developed and open-sourced univnet, the vocoder this repo uses. +- [Kim and Jung](https://github.com/mindslab-ai/univnet) who implemented univnet pytorch model. - [lucidrains](https://github.com/lucidrains) who writes awesome open source pytorch models, many of which are used here. - [Patrick von Platen](https://huggingface.co/patrickvonplaten) whose guides on setting up wav2vec were invaluable to building my dataset. @@ -269,4 +280,4 @@ credit a few of the amazing folks in the community that have helped make this ha Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development. -If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub. \ No newline at end of file +If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub. diff --git a/examples/naturalspeech_comparison/fibers/naturalspeech.mp3 b/examples/naturalspeech_comparison/fibers/naturalspeech.mp3 new file mode 100644 index 0000000..57e540e Binary files /dev/null and b/examples/naturalspeech_comparison/fibers/naturalspeech.mp3 differ diff --git a/examples/naturalspeech_comparison/fibers/tortoise.mp3 b/examples/naturalspeech_comparison/fibers/tortoise.mp3 new file mode 100644 index 0000000..1788df8 Binary files /dev/null and b/examples/naturalspeech_comparison/fibers/tortoise.mp3 differ diff --git a/examples/naturalspeech_comparison/lax/naturalspeech.mp3 b/examples/naturalspeech_comparison/lax/naturalspeech.mp3 new file mode 100644 index 0000000..ebcb779 Binary files /dev/null and b/examples/naturalspeech_comparison/lax/naturalspeech.mp3 differ diff --git a/examples/naturalspeech_comparison/lax/tortoise.mp3 b/examples/naturalspeech_comparison/lax/tortoise.mp3 new file mode 100644 index 0000000..2901215 Binary files /dev/null and b/examples/naturalspeech_comparison/lax/tortoise.mp3 differ diff --git a/examples/naturalspeech_comparison/maltby/naturalspeech.mp3 b/examples/naturalspeech_comparison/maltby/naturalspeech.mp3 new file mode 100644 index 0000000..4cee574 Binary files /dev/null and b/examples/naturalspeech_comparison/maltby/naturalspeech.mp3 differ diff --git a/examples/naturalspeech_comparison/maltby/tortoise.mp3 b/examples/naturalspeech_comparison/maltby/tortoise.mp3 new file mode 100644 index 0000000..1831056 Binary files /dev/null and b/examples/naturalspeech_comparison/maltby/tortoise.mp3 differ diff --git a/requirements.txt b/requirements.txt index 12fd2bd..c1846c9 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,6 +1,6 @@ tqdm rotary_embedding_torch -transformers +transformers==4.19 tokenizers inflect progressbar @@ -10,3 +10,9 @@ scipy==0.10.1 librosa==0.9.1 numba==0.48.0 ffmpeg +numpy==1.20.0 +numba==0.48.0 +torchaudio +threadpoolctl +llvmlite +appdirs \ No newline at end of file diff --git a/scripts/tortoise_tts.py b/scripts/tortoise_tts.py new file mode 100755 index 0000000..932a780 --- /dev/null +++ b/scripts/tortoise_tts.py @@ -0,0 +1,266 @@ +#!/usr/bin/env python3 + +import argparse +import os +import sys +import tempfile +import time + +import torch +import torchaudio + +from tortoise.api import MODELS_DIR, TextToSpeech +from tortoise.utils.audio import get_voices, load_voices, load_audio +from tortoise.utils.text import split_and_recombine_text + +parser = argparse.ArgumentParser( + description='TorToiSe is a text-to-speech program that is capable of synthesizing speech ' + 'in multiple voices with realistic prosody and intonation.') + +parser.add_argument( + 'text', type=str, nargs='*', + help='Text to speak. If omitted, text is read from stdin.') +parser.add_argument( + '-v, --voice', type=str, default='random', metavar='VOICE', dest='voice', + help='Selects the voice to use for generation. Use the & character to join two voices together. ' + 'Use a comma to perform inference on multiple voices. Set to "all" to use all available voices. ' + 'Note that multiple voices require the --output-dir option to be set.') +parser.add_argument( + '-V, --voices-dir', metavar='VOICES_DIR', type=str, dest='voices_dir', + help='Path to directory containing extra voices to be loaded. Use a comma to specify multiple directories.') +parser.add_argument( + '-p, --preset', type=str, default='fast', choices=['ultra_fast', 'fast', 'standard', 'high_quality'], dest='preset', + help='Which voice quality preset to use.') +parser.add_argument( + '-q, --quiet', default=False, action='store_true', dest='quiet', + help='Suppress all output.') + +output_group = parser.add_mutually_exclusive_group(required=True) +output_group.add_argument( + '-l, --list-voices', default=False, action='store_true', dest='list_voices', + help='List available voices and exit.') +output_group.add_argument( + '-P, --play', action='store_true', dest='play', + help='Play the audio (requires pydub).') +output_group.add_argument( + '-o, --output', type=str, metavar='OUTPUT', dest='output', + help='Save the audio to a file.') +output_group.add_argument( + '-O, --output-dir', type=str, metavar='OUTPUT_DIR', dest='output_dir', + help='Save the audio to a directory as individual segments.') + +multi_output_group = parser.add_argument_group('multi-output options (requires --output-dir)') +multi_output_group.add_argument( + '--candidates', type=int, default=1, + help='How many output candidates to produce per-voice. Note that only the first candidate is used in the combined output.') +multi_output_group.add_argument( + '--regenerate', type=str, default=None, + help='Comma-separated list of clip numbers to re-generate.') +multi_output_group.add_argument( + '--skip-existing', action='store_true', + help='Set to skip re-generating existing clips.') + +advanced_group = parser.add_argument_group('advanced options') +advanced_group.add_argument( + '--produce-debug-state', default=False, action='store_true', + help='Whether or not to produce debug_states in current directory, which can aid in reproducing problems.') +advanced_group.add_argument( + '--seed', type=int, default=None, + help='Random seed which can be used to reproduce results.') +advanced_group.add_argument( + '--models-dir', type=str, default=MODELS_DIR, + help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to ' + '~/.cache/tortoise/.models, so this should only be specified if you have custom checkpoints.') +advanced_group.add_argument( + '--text-split', type=str, default=None, + help='How big chunks to split the text into, in the format ,.') +advanced_group.add_argument( + '--disable-redaction', default=False, action='store_true', + help='Normally text enclosed in brackets are automatically redacted from the spoken output ' + '(but are still rendered by the model), this can be used for prompt engineering. ' + 'Set this to disable this behavior.') +advanced_group.add_argument( + '--device', type=str, default=None, + help='Device to use for inference.') +advanced_group.add_argument( + '--batch-size', type=int, default=None, + help='Batch size to use for inference. If omitted, the batch size is set based on available GPU memory.') + +tuning_group = parser.add_argument_group('tuning options (overrides preset settings)') +tuning_group.add_argument( + '--num-autoregressive-samples', type=int, default=None, + help='Number of samples taken from the autoregressive model, all of which are filtered using CLVP. ' + 'As TorToiSe is a probabilistic model, more samples means a higher probability of creating something "great".') +tuning_group.add_argument( + '--temperature', type=float, default=None, + help='The softmax temperature of the autoregressive model.') +tuning_group.add_argument( + '--length-penalty', type=float, default=None, + help='A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.') +tuning_group.add_argument( + '--repetition-penalty', type=float, default=None, + help='A penalty that prevents the autoregressive decoder from repeating itself during decoding. ' + 'Can be used to reduce the incidence of long silences or "uhhhhhhs", etc.') +tuning_group.add_argument( + '--top-p', type=float, default=None, + help='P value used in nucleus sampling. 0 to 1. Lower values mean the decoder produces more "likely" (aka boring) outputs.') +tuning_group.add_argument( + '--max-mel-tokens', type=int, default=None, + help='Restricts the output length. 1 to 600. Each unit is 1/20 of a second.') +tuning_group.add_argument( + '--cvvp-amount', type=float, default=None, + help='How much the CVVP model should influence the output.' + 'Increasing this can in some cases reduce the likelihood of multiple speakers.') +tuning_group.add_argument( + '--diffusion-iterations', type=int, default=None, + help='Number of diffusion steps to perform. More steps means the network has more chances to iteratively' + 'refine the output, which should theoretically mean a higher quality output. ' + 'Generally a value above 250 is not noticeably better, however.') +tuning_group.add_argument( + '--cond-free', type=bool, default=None, + help='Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for ' + 'each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output ' + 'of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and ' + 'dramatically improves realism.') +tuning_group.add_argument( + '--cond-free-k', type=float, default=None, + help='Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf]. ' + 'As cond_free_k increases, the output becomes dominated by the conditioning-free signal. ' + 'Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k') +tuning_group.add_argument( + '--diffusion-temperature', type=float, default=None, + help='Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0 ' + 'are the "mean" prediction of the diffusion network and will sound bland and smeared. ') + +usage_examples = f''' +Examples: + +Read text using random voice and place it in a file: + + {parser.prog} -o hello.wav "Hello, how are you?" + +Read text from stdin and play it using the tom voice: + + echo "Say it like you mean it!" | {parser.prog} -P -v tom + +Read a text file using multiple voices and save the audio clips to a directory: + + {parser.prog} -O /tmp/tts-results -v tom,emma max_length: + parser.error(f'--text-split: desired_length ({desired_length}) must be <= max_length ({max_length})') + texts = split_and_recombine_text(text, desired_length, max_length) +else: + texts = split_and_recombine_text(text) +if len(texts) == 0: + parser.error('no text provided') + +if args.output_dir: + os.makedirs(args.output_dir, exist_ok=True) +else: + if len(selected_voices) > 1: + parser.error('cannot have multiple voices without --output-dir"') + if args.candidates > 1: + parser.error('cannot have multiple candidates without --output-dir"') + +# error out early if pydub isn't installed +if args.play: + try: + import pydub + import pydub.playback + except ImportError: + parser.error('--play requires pydub to be installed, which can be done with "pip install pydub"') + +seed = int(time.time()) if args.seed is None else args.seed +if not args.quiet: + print('Loading tts...') +tts = TextToSpeech(models_dir=args.models_dir, enable_redaction=not args.disable_redaction, + device=args.device, autoregressive_batch_size=args.batch_size) +gen_settings = { + 'use_deterministic_seed': seed, + 'verbose': not args.quiet, + 'k': args.candidates, + 'preset': args.preset, +} +tuning_options = [ + 'num_autoregressive_samples', 'temperature', 'length_penalty', 'repetition_penalty', 'top_p', + 'max_mel_tokens', 'cvvp_amount', 'diffusion_iterations', 'cond_free', 'cond_free_k', 'diffusion_temperature'] +for option in tuning_options: + if getattr(args, option) is not None: + gen_settings[option] = getattr(args, option) +total_clips = len(texts) * len(selected_voices) +regenerate_clips = [int(x) for x in args.regenerate.split(',')] if args.regenerate else None +for voice_idx, voice in enumerate(selected_voices): + audio_parts = [] + voice_samples, conditioning_latents = load_voices(voice, extra_voice_dirs) + for text_idx, text in enumerate(texts): + clip_name = f'{"-".join(voice)}_{text_idx:02d}' + if args.output_dir: + first_clip = os.path.join(args.output_dir, f'{clip_name}_00.wav') + if (args.skip_existing or (regenerate_clips and text_idx not in regenerate_clips)) and os.path.exists(first_clip): + audio_parts.append(load_audio(first_clip, 24000)) + if not args.quiet: + print(f'Skipping {clip_name}') + continue + if not args.quiet: + print(f'Rendering {clip_name} ({(voice_idx * len(texts) + text_idx + 1)} of {total_clips})...') + print(' ' + text) + gen = tts.tts_with_preset( + text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, **gen_settings) + gen = gen if args.candidates > 1 else [gen] + for candidate_idx, audio in enumerate(gen): + audio = audio.squeeze(0).cpu() + if candidate_idx == 0: + audio_parts.append(audio) + if args.output_dir: + filename = f'{clip_name}_{candidate_idx:02d}.wav' + torchaudio.save(os.path.join(args.output_dir, filename), audio, 24000) + + audio = torch.cat(audio_parts, dim=-1) + if args.output_dir: + filename = f'{"-".join(voice)}_combined.wav' + torchaudio.save(os.path.join(args.output_dir, filename), audio, 24000) + elif args.output: + filename = args.output if args.output else os.tmp + torchaudio.save(args.output, audio, 24000) + elif args.play: + f = tempfile.NamedTemporaryFile(suffix='.wav', delete=True) + torchaudio.save(f.name, audio, 24000) + pydub.playback.play(pydub.AudioSegment.from_wav(f.name)) + + if args.produce_debug_state: + os.makedirs('debug_states', exist_ok=True) + dbg_state = (seed, texts, voice_samples, conditioning_latents, args) + torch.save(dbg_state, os.path.join('debug_states', f'debug_{"-".join(voice)}.pth')) diff --git a/setup.py b/setup.py index fce0bd9..99bae37 100644 --- a/setup.py +++ b/setup.py @@ -6,7 +6,7 @@ with open("README.md", "r", encoding="utf-8") as fh: setuptools.setup( name="TorToiSe", packages=setuptools.find_packages(), - version="2.3.0", + version="2.4.2", author="James Betker", author_email="james@adamant.ai", description="A high quality multi-voice text-to-speech library", @@ -14,6 +14,10 @@ setuptools.setup( long_description_content_type="text/markdown", url="https://github.com/neonbjb/tortoise-tts", project_urls={}, + scripts=[ + 'scripts/tortoise_tts.py', + ], + include_package_data=True, install_requires=[ 'tqdm', 'rotary_embedding_torch', @@ -32,4 +36,4 @@ setuptools.setup( "Operating System :: OS Independent", ], python_requires=">=3.6", -) \ No newline at end of file +) diff --git a/tortoise/api.py b/tortoise/api.py index ca8d825..296ef14 100644 --- a/tortoise/api.py +++ b/tortoise/api.py @@ -1,6 +1,7 @@ import os import random import uuid +from time import time from urllib import request import torch @@ -9,13 +10,13 @@ import progressbar import torchaudio from tortoise.models.classifier import AudioMiniEncoderWithClassifierHead -from tortoise.models.cvvp import CVVP from tortoise.models.diffusion_decoder import DiffusionTts from tortoise.models.autoregressive import UnifiedVoice from tqdm import tqdm from tortoise.models.arch_util import TorchMelSpectrogram from tortoise.models.clvp import CLVP +from tortoise.models.cvvp import CVVP from tortoise.models.random_latent_generator import RandomLatentConverter from tortoise.models.vocoder import UnivNetGenerator from tortoise.utils.audio import wav_to_univnet_mel, denormalize_tacotron_mel @@ -25,22 +26,25 @@ from tortoise.utils.wav2vec_alignment import Wav2VecAlignment pbar = None +DEFAULT_MODELS_DIR = os.path.join(os.path.expanduser('~'), '.cache', 'tortoise', 'models') +MODELS_DIR = os.environ.get('TORTOISE_MODELS_DIR', DEFAULT_MODELS_DIR) +MODELS = { + 'autoregressive.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/autoregressive.pth', + 'classifier.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/classifier.pth', + 'clvp2.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/clvp2.pth', + 'cvvp.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/cvvp.pth', + 'diffusion_decoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/diffusion_decoder.pth', + 'vocoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/vocoder.pth', + 'rlg_auto.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth', + 'rlg_diffuser.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth', +} def download_models(specific_models=None): """ Call to download all the models that Tortoise uses. """ - MODELS = { - 'autoregressive.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/autoregressive.pth', - 'classifier.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/classifier.pth', - 'clvp2.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/clvp2.pth', - 'cvvp.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/cvvp.pth', - 'diffusion_decoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/diffusion_decoder.pth', - 'vocoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/vocoder.pth', - 'rlg_auto.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth', - 'rlg_diffuser.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth', - } - os.makedirs('.models', exist_ok=True) + os.makedirs(MODELS_DIR, exist_ok=True) + def show_progress(block_num, block_size, total_size): global pbar if pbar is None: @@ -56,13 +60,26 @@ def download_models(specific_models=None): for model_name, url in MODELS.items(): if specific_models is not None and model_name not in specific_models: continue - if os.path.exists(f'.models/{model_name}'): + model_path = os.path.join(MODELS_DIR, model_name) + if os.path.exists(model_path): continue print(f'Downloading {model_name} from {url}...') - request.urlretrieve(url, f'.models/{model_name}', show_progress) + request.urlretrieve(url, model_path, show_progress) print('Done.') +def get_model_path(model_name, models_dir=MODELS_DIR): + """ + Get path to given model, download it if it doesn't exist. + """ + if model_name not in MODELS: + raise ValueError(f'Model {model_name} not found in available models.') + model_path = os.path.join(models_dir, model_name) + if not os.path.exists(model_path) and models_dir == MODELS_DIR: + download_models([model_name]) + return model_path + + def pad_or_truncate(t, length): """ Utility function for forcing to have the specified sequence length, whether by clipping it or padding it with 0s. @@ -84,7 +101,7 @@ def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired_diffusi conditioning_free=cond_free, conditioning_free_k=cond_free_k) -def format_conditioning(clip, cond_length=132300): +def format_conditioning(clip, cond_length=132300, device='cuda'): """ Converts the given conditioning signal to a MEL spectrogram and clips it as expected by the models. """ @@ -95,7 +112,7 @@ def format_conditioning(clip, cond_length=132300): rand_start = random.randint(0, gap) clip = clip[:, rand_start:rand_start + cond_length] mel_clip = TorchMelSpectrogram()(clip.unsqueeze(0)).squeeze(0) - return mel_clip.unsqueeze(0).cuda() + return mel_clip.unsqueeze(0).to(device) def fix_autoregressive_output(codes, stop_token, complain=True): @@ -150,22 +167,38 @@ def classify_audio_clip(clip): :param clip: torch tensor containing audio waveform data (get it from load_audio) :return: True if the clip was classified as coming from Tortoise and false if it was classified as real. """ - download_models(['classifier.pth']) classifier = AudioMiniEncoderWithClassifierHead(2, spec_dim=1, embedding_dim=512, depth=5, downsample_factor=4, resnet_blocks=2, attn_blocks=4, num_attn_heads=4, base_channels=32, dropout=0, kernel_size=5, distribute_zero_label=False) - classifier.load_state_dict(torch.load('.models/classifier.pth', map_location=torch.device('cpu'))) + classifier.load_state_dict(torch.load(get_model_path('classifier.pth'), map_location=torch.device('cpu'))) clip = clip.cpu().unsqueeze(0) results = F.softmax(classifier(clip), dim=-1) return results[0][0] +def pick_best_batch_size_for_gpu(): + """ + Tries to pick a batch size that will fit in your GPU. These sizes aren't guaranteed to work, but they should give + you a good shot. + """ + if torch.cuda.is_available(): + _, available = torch.cuda.mem_get_info() + availableGb = available / (1024 ** 3) + if availableGb > 14: + return 16 + elif availableGb > 10: + return 8 + elif availableGb > 7: + return 4 + return 1 + + class TextToSpeech: """ Main entry point into Tortoise. """ - def __init__(self, autoregressive_batch_size=16, models_dir='.models', enable_redaction=True): + def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, enable_redaction=True, device=None): """ Constructor :param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing @@ -175,14 +208,16 @@ class TextToSpeech: :param enable_redaction: When true, text enclosed in brackets are automatically redacted from the spoken output (but are still rendered by the model). This can be used for prompt engineering. Default is true. + :param device: Device to use when running the model. If omitted, the device will be automatically chosen. """ - self.autoregressive_batch_size = autoregressive_batch_size + self.models_dir = models_dir + self.autoregressive_batch_size = pick_best_batch_size_for_gpu() if autoregressive_batch_size is None else autoregressive_batch_size self.enable_redaction = enable_redaction + self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') if self.enable_redaction: self.aligner = Wav2VecAlignment() self.tokenizer = VoiceBpeTokenizer() - download_models() if os.path.exists(f'{models_dir}/autoregressive.ptt'): # Assume this is a traced directory. @@ -193,31 +228,34 @@ class TextToSpeech: model_dim=1024, heads=16, number_text_tokens=255, start_text_token=255, checkpointing=False, train_solo_embeddings=False).cpu().eval() - self.autoregressive.load_state_dict(torch.load(f'{models_dir}/autoregressive.pth')) + self.autoregressive.load_state_dict(torch.load(get_model_path('autoregressive.pth', models_dir))) self.diffusion = DiffusionTts(model_channels=1024, num_layers=10, in_channels=100, out_channels=200, in_latent_channels=1024, in_tokens=8193, dropout=0, use_fp16=False, num_heads=16, layer_drop=0, unconditioned_percentage=0).cpu().eval() - self.diffusion.load_state_dict(torch.load(f'{models_dir}/diffusion_decoder.pth')) + self.diffusion.load_state_dict(torch.load(get_model_path('diffusion_decoder.pth', models_dir))) self.clvp = CLVP(dim_text=768, dim_speech=768, dim_latent=768, num_text_tokens=256, text_enc_depth=20, text_seq_len=350, text_heads=12, num_speech_tokens=8192, speech_enc_depth=20, speech_heads=12, speech_seq_len=430, use_xformers=True).cpu().eval() - self.clvp.load_state_dict(torch.load(f'{models_dir}/clvp2.pth')) - - self.cvvp = CVVP(model_dim=512, transformer_heads=8, dropout=0, mel_codes=8192, conditioning_enc_depth=8, cond_mask_percentage=0, - speech_enc_depth=8, speech_mask_percentage=0, latent_multiplier=1).cpu().eval() - self.cvvp.load_state_dict(torch.load(f'{models_dir}/cvvp.pth')) + self.clvp.load_state_dict(torch.load(get_model_path('clvp2.pth', models_dir))) + self.cvvp = None # CVVP model is only loaded if used. self.vocoder = UnivNetGenerator().cpu() - self.vocoder.load_state_dict(torch.load(f'{models_dir}/vocoder.pth')['model_g']) + self.vocoder.load_state_dict(torch.load(get_model_path('vocoder.pth', models_dir), map_location=torch.device('cpu'))['model_g']) self.vocoder.eval(inference=True) # Random latent generators (RLGs) are loaded lazily. self.rlg_auto = None self.rlg_diffusion = None + def load_cvvp(self): + """Load CVVP model.""" + self.cvvp = CVVP(model_dim=512, transformer_heads=8, dropout=0, mel_codes=8192, conditioning_enc_depth=8, cond_mask_percentage=0, + speech_enc_depth=8, speech_mask_percentage=0, latent_multiplier=1).cpu().eval() + self.cvvp.load_state_dict(torch.load(get_model_path('cvvp.pth', self.models_dir))) + def get_conditioning_latents(self, voice_samples, return_mels=False): """ Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent). @@ -226,15 +264,15 @@ class TextToSpeech: :param voice_samples: List of 2 or more ~10 second reference clips, which should be torch tensors containing 22.05kHz waveform data. """ with torch.no_grad(): - voice_samples = [v.to('cuda') for v in voice_samples] + voice_samples = [v.to(self.device) for v in voice_samples] auto_conds = [] if not isinstance(voice_samples, list): voice_samples = [voice_samples] for vs in voice_samples: - auto_conds.append(format_conditioning(vs)) + auto_conds.append(format_conditioning(vs, device=self.device)) auto_conds = torch.stack(auto_conds, dim=1) - self.autoregressive = self.autoregressive.cuda() + self.autoregressive = self.autoregressive.to(self.device) auto_latent = self.autoregressive.get_conditioning(auto_conds) self.autoregressive = self.autoregressive.cpu() @@ -243,11 +281,11 @@ class TextToSpeech: # The diffuser operates at a sample rate of 24000 (except for the latent inputs) sample = torchaudio.functional.resample(sample, 22050, 24000) sample = pad_or_truncate(sample, 102400) - cond_mel = wav_to_univnet_mel(sample.to('cuda'), do_normalization=False) + cond_mel = wav_to_univnet_mel(sample.to(self.device), do_normalization=False, device=self.device) diffusion_conds.append(cond_mel) diffusion_conds = torch.stack(diffusion_conds, dim=1) - self.diffusion = self.diffusion.cuda() + self.diffusion = self.diffusion.to(self.device) diffusion_latent = self.diffusion.get_conditioning(diffusion_conds) self.diffusion = self.diffusion.cpu() @@ -260,9 +298,9 @@ class TextToSpeech: # Lazy-load the RLG models. if self.rlg_auto is None: self.rlg_auto = RandomLatentConverter(1024).eval() - self.rlg_auto.load_state_dict(torch.load('.models/rlg_auto.pth', map_location=torch.device('cpu'))) + self.rlg_auto.load_state_dict(torch.load(get_model_path('rlg_auto.pth', self.models_dir), map_location=torch.device('cpu'))) self.rlg_diffusion = RandomLatentConverter(2048).eval() - self.rlg_diffusion.load_state_dict(torch.load('.models/rlg_diffuser.pth', map_location=torch.device('cpu'))) + self.rlg_diffusion.load_state_dict(torch.load(get_model_path('rlg_diffuser.pth', self.models_dir), map_location=torch.device('cpu'))) with torch.no_grad(): return self.rlg_auto(torch.tensor([0.0])), self.rlg_diffusion(torch.tensor([0.0])) @@ -275,9 +313,9 @@ class TextToSpeech: 'high_quality': Use if you want the absolute best. This is not really worth the compute, though. """ # Use generally found best tuning knobs for generation. - kwargs.update({'temperature': .8, 'length_penalty': 1.0, 'repetition_penalty': 2.0, - 'top_p': .8, - 'cond_free_k': 2.0, 'diffusion_temperature': 1.0}) + settings = {'temperature': .8, 'length_penalty': 1.0, 'repetition_penalty': 2.0, + 'top_p': .8, + 'cond_free_k': 2.0, 'diffusion_temperature': 1.0} # Presets are defined here. presets = { 'ultra_fast': {'num_autoregressive_samples': 16, 'diffusion_iterations': 30, 'cond_free': False}, @@ -285,14 +323,16 @@ class TextToSpeech: 'standard': {'num_autoregressive_samples': 256, 'diffusion_iterations': 200}, 'high_quality': {'num_autoregressive_samples': 256, 'diffusion_iterations': 400}, } - kwargs.update(presets[preset]) - return self.tts(text, **kwargs) + settings.update(presets[preset]) + settings.update(kwargs) # allow overriding of preset settings with kwargs + return self.tts(text, **settings) - def tts(self, text, voice_samples=None, conditioning_latents=None, k=1, verbose=True, + def tts(self, text, voice_samples=None, conditioning_latents=None, k=1, verbose=True, use_deterministic_seed=None, + return_deterministic_state=False, # autoregressive generation parameters follow num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500, - # CLVP & CVVP parameters - clvp_cvvp_slider=.5, + # CVVP parameters follow + cvvp_amount=.0, # diffusion generation parameters follow diffusion_iterations=100, cond_free=True, cond_free_k=2, diffusion_temperature=1.0, **hf_generate_kwargs): @@ -303,10 +343,10 @@ class TextToSpeech: :param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which can be provided in lieu of voice_samples. This is ignored unless voice_samples=None. Conditioning latents can be retrieved via get_conditioning_latents(). - :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP and CVVP models) clips are returned. + :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP model) clips are returned. :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true. ~~AUTOREGRESSIVE KNOBS~~ - :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP+CVVP. + :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP. As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great". :param temperature: The softmax temperature of the autoregressive model. :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs. @@ -319,10 +359,8 @@ class TextToSpeech: could use some tuning. :param typical_mass: The typical_mass parameter from the typical_sampling algorithm. ~~CLVP-CVVP KNOBS~~ - :param clvp_cvvp_slider: Controls the influence of the CLVP and CVVP models in selecting the best output from the autoregressive model. - [0,1]. Values closer to 1 will cause Tortoise to emit clips that follow the text more. Values closer to - 0 will cause Tortoise to emit clips that more closely follow the reference clip (e.g. the voice sounds more - similar). + :param cvvp_amount: Controls the influence of the CVVP model in selecting the best output from the autoregressive model. + [0,1]. Values closer to 1 mean the CVVP model is more important, 0 disables the CVVP model. ~~DIFFUSION KNOBS~~ :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better, @@ -343,7 +381,9 @@ class TextToSpeech: :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length. Sample rate is 24kHz. """ - text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).cuda() + deterministic_seed = self.deterministic_state(seed=use_deterministic_seed) + + text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device) text_tokens = F.pad(text_tokens, (0, 1)) # This may not be necessary. assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.' @@ -354,8 +394,8 @@ class TextToSpeech: auto_conditioning, diffusion_conditioning = conditioning_latents else: auto_conditioning, diffusion_conditioning = self.get_random_conditioning_latents() - auto_conditioning = auto_conditioning.cuda() - diffusion_conditioning = diffusion_conditioning.cuda() + auto_conditioning = auto_conditioning.to(self.device) + diffusion_conditioning = diffusion_conditioning.to(self.device) diffuser = load_discrete_vocoder_diffuser(desired_diffusion_steps=diffusion_iterations, cond_free=cond_free, cond_free_k=cond_free_k) @@ -364,7 +404,7 @@ class TextToSpeech: num_batches = num_autoregressive_samples // self.autoregressive_batch_size stop_mel_token = self.autoregressive.stop_mel_token calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output" - self.autoregressive = self.autoregressive.cuda() + self.autoregressive = self.autoregressive.to(self.device) if verbose: print("Generating autoregressive samples..") for b in tqdm(range(num_batches), disable=not verbose): @@ -383,33 +423,44 @@ class TextToSpeech: self.autoregressive = self.autoregressive.cpu() clip_results = [] - self.clvp = self.clvp.cuda() - self.cvvp = self.cvvp.cuda() + self.clvp = self.clvp.to(self.device) + if cvvp_amount > 0: + if self.cvvp is None: + self.load_cvvp() + self.cvvp = self.cvvp.to(self.device) if verbose: - print("Computing best candidates using CLVP and CVVP") + if self.cvvp is None: + print("Computing best candidates using CLVP") + else: + print(f"Computing best candidates using CLVP {((1-cvvp_amount) * 100):2.0f}% and CVVP {(cvvp_amount * 100):2.0f}%") for batch in tqdm(samples, disable=not verbose): for i in range(batch.shape[0]): batch[i] = fix_autoregressive_output(batch[i], stop_mel_token) - clvp = self.clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False) - if auto_conds is not None: + if cvvp_amount != 1: + clvp = self.clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False) + if auto_conds is not None and cvvp_amount > 0: cvvp_accumulator = 0 for cl in range(auto_conds.shape[1]): cvvp_accumulator = cvvp_accumulator + self.cvvp(auto_conds[:, cl].repeat(batch.shape[0], 1, 1), batch, return_loss=False) cvvp = cvvp_accumulator / auto_conds.shape[1] - clip_results.append(clvp * clvp_cvvp_slider + cvvp * (1-clvp_cvvp_slider)) + if cvvp_amount == 1: + clip_results.append(cvvp) + else: + clip_results.append(cvvp * cvvp_amount + clvp * (1-cvvp_amount)) else: clip_results.append(clvp) clip_results = torch.cat(clip_results, dim=0) samples = torch.cat(samples, dim=0) best_results = samples[torch.topk(clip_results, k=k).indices] self.clvp = self.clvp.cpu() - self.cvvp = self.cvvp.cpu() + if self.cvvp is not None: + self.cvvp = self.cvvp.cpu() del samples # The diffusion model actually wants the last hidden layer from the autoregressive model as conditioning # inputs. Re-produce those for the top results. This could be made more efficient by storing all of these # results, but will increase memory usage. - self.autoregressive = self.autoregressive.cuda() + self.autoregressive = self.autoregressive.to(self.device) best_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1), torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results, torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device), @@ -420,8 +471,8 @@ class TextToSpeech: if verbose: print("Transforming autoregressive outputs into audio..") wav_candidates = [] - self.diffusion = self.diffusion.cuda() - self.vocoder = self.vocoder.cuda() + self.diffusion = self.diffusion.to(self.device) + self.vocoder = self.vocoder.to(self.device) for b in range(best_results.shape[0]): codes = best_results[b].unsqueeze(0) latents = best_latents[b].unsqueeze(0) @@ -449,7 +500,26 @@ class TextToSpeech: return self.aligner.redact(clip.squeeze(1), text).unsqueeze(1) return clip wav_candidates = [potentially_redact(wav_candidate, text) for wav_candidate in wav_candidates] - if len(wav_candidates) > 1: - return wav_candidates - return wav_candidates[0] + if len(wav_candidates) > 1: + res = wav_candidates + else: + res = wav_candidates[0] + + if return_deterministic_state: + return res, (deterministic_seed, text, voice_samples, conditioning_latents) + else: + return res + + def deterministic_state(self, seed=None): + """ + Sets the random seeds that tortoise uses to the current time() and returns that seed so results can be + reproduced. + """ + seed = int(time()) if seed is None else seed + torch.manual_seed(seed) + random.seed(seed) + # Can't currently set this because of CUBLAS. TODO: potentially enable it if necessary. + # torch.use_deterministic_algorithms(True) + + return seed diff --git a/tortoise/data/got.txt b/tortoise/data/got.txt new file mode 100644 index 0000000..a7180b9 --- /dev/null +++ b/tortoise/data/got.txt @@ -0,0 +1,276 @@ +Chapter One + + +Bran + + +The morning had dawned clear and cold, with a crispness that hinted at the end of summer. They set forth at daybreak to see a man beheaded, twenty in all, and Bran rode among them, nervous with excitement. This was the first time he had been deemed old enough to go with his lord father and his brothers to see the king's justice done. It was the ninth year of summer, and the seventh of Bran's life. + + +The man had been taken outside a small holdfast in the hills. Robb thought he was a wildling, his sword sworn to Mance Rayder, the King-beyond-the-Wall. It made Bran's skin prickle to think of it. He remembered the hearth tales Old Nan told them. The wildlings were cruel men, she said, slavers and slayers and thieves. They consorted with giants and ghouls, stole girl children in the dead of night, and drank blood from polished horns. And their women lay with the Others in the Long Night to sire terrible half-human children. + + +But the man they found bound hand and foot to the holdfast wall awaiting the king's justice was old and scrawny, not much taller than Robb. He had lost both ears and a finger to frostbite, and he dressed all in black, the same as a brother of the Night's Watch, except that his furs were ragged and greasy. + + +The breath of man and horse mingled, steaming, in the cold morning air as his lord father had the man cut down from the wall and dragged before them. Robb and Jon sat tall and still on their horses, with Bran between them on his pony, trying to seem older than seven, trying to pretend that he'd seen all this before. A faint wind blew through the holdfast gate. Over their heads flapped the banner of the Starks of Winterfell: a grey direwolf racing across an ice-white field. + +Bran's father sat solemnly on his horse, long brown hair stirring in the wind. His closely trimmed beard was shot with white, making him look older than his thirty-five years. He had a grim cast to his grey eyes this day, and he seemed not at all the man who would sit before the fire in the evening and talk softly of the age of heroes and the children of the forest. He had taken off Father's face, Bran thought, and donned the face of Lord Stark of Winterfell. + + +There were questions asked and answers given there in the chill of morning, but afterward Bran could not recall much of what had been said. Finally his lord father gave a command, and two of his guardsmen dragged the ragged man to the ironwood stump in the center of the square. They forced his head down onto the hard black wood. Lord Eddard Stark dismounted and his ward Theon Greyjoy brought forth the sword. "Ice," that sword was called. It was as wide across as a man's hand, and taller even than Robb. The blade was Valyrian steel, spell-forged and dark as smoke. Nothing held an edge like Valyrian steel. + + +His father peeled off his gloves and handed them to Jory Cassel, the captain of his household guard. He took hold of Ice with both hands and said, "In the name of Robert of the House Baratheon, the First of his Name, King of the Andals and the Rhoynar and the First Men, Lord of the Seven Kingdoms and Protector of the Realm, by the word of Eddard of the House Stark, Lord of Winterfell and Warden of the North, I do sentence you to die." He lifted the greatsword high above his head. + + +Bran's bastard brother Jon Snow moved closer. "Keep the pony well in hand," he whispered. "And don't look away. Father will know if you do." + + +Bran kept his pony well in hand, and did not look away. + + +His father took off the man's head with a single sure stroke. Blood sprayed out across the snow, as red as surnmerwine. One of the horses reared and had to be restrained to keep from bolting. Bran could not take his eyes off the blood. The snows around the stump drank it eagerly, reddening as he watched. + +The head bounced off a thick root and rolled. It came up near Greyjoy's feet. Theon was a lean, dark youth of nineteen who found everything amusing. He laughed, put his boot on the head, and kicked it away. + + +"Ass," Jon muttered, low enough so Greyjoy did not hear. He put a hand on Bran's shoulder, and Bran looked over at his bastard brother. "You did well," Jon told him solemnly. Jon was fourteen, an old hand at justice. + + +It seemed colder on the long ride back to Winterfell, though the wind had died by then and the sun was higher in the sky. Bran rode with his brothers, well ahead of the main party, his pony struggling hard to keep up with their horses. + + +"The deserter died bravely," Robb said. He was big and broad and growing every day, with his mother's coloring, the fair skin, red-brown hair, and blue eyes of the Tullys of Riverrun. "He had courage, at the least." + + +"No," Jon Snow said quietly. "It was not courage. This one was dead of fear. You could see it in his eyes, Stark." Jon's eyes were a grey so dark they seemed almost black, but there was little they did not see. He was of an age with Robb, but they did not look alike. Jon was slender where Robb was muscular, dark where Robb was fair, graceful and quick where his half brother was strong and fast. + + +Robb was not impressed. "The Others take his eyes," he swore. "He died well. Race you to the bridge?" + + +"Done," Jon said, kicking his horse forward. Robb cursed and followed, and they galloped off down the trail, Robb laughing and hooting, Jon silent and intent. The hooves of their horses kicked up showers of snow as they went. + +Bran did not try to follow. His pony could not keep up. He had seen the ragged man's eyes, and he was thinking of them now. After a while, the sound of Robb's laughter receded, and the woods grew silent again. + + +So deep in thought was he that he never heard the rest of the party until his father moved up to ride beside him. "Are you well, Bran?" he asked, not unkindly. + + +"Yes, Father," Bran told him. He looked up. Wrapped in his furs and leathers, mounted on his great warhorse, his lord father loomed over him like a giant. "Robb says the man died bravely, but Jon says he was afraid." + + +"What do you think?" his father asked. + + +Bran thought about it. "Can a man still be brave if he's afraid?" + + +"That is the only time a man can be brave," his father told him. "Do you understand why I did it?" + + +"He was a wildling," Bran said. "They carry off women and sell them to the Others." + + +His lord father smiled. "Old Nan has been telling you stories again. In truth, the man was an oathbreaker, a deserter from the Night's Watch. No man is more dangerous. The deserter knows his life is forfeit if he is taken, so he will not flinch from any crime, no matter how vile. But you mistake me. The question was not why the man had to die, but why I must do it." + + +Bran had no answer for that. "King Robert has a headsman," he said, uncertainly. + + +"He does," his father admitted. "As did the Targaryen kings before him. Yet our way is the older way. The blood of the First Men still flows in the veins of the Starks, and we hold to the belief that the man who passes the sentence should swing the sword. If you would take a man's life, you owe it to him to look into his eyes and hear his final words. And if you cannot bear to do that, then perhaps the man does not deserve to die. + + +"One day, Bran, you will be Robb's bannerman, holding a keep of your own for your brother and your king, and justice will fall to you. When that day comes, you must take no pleasure in the task, but neither must you look away. A ruler who hides behind paid executioners soon forgets what death is." + + +That was when Jon reappeared on the crest of the hill before them. He waved and shouted down at them. "Father, Bran, come quickly, see what Robb has found!" Then he was gone again. + + +Jory rode up beside them. "Trouble, my lord?" + + +"Beyond a doubt," his lord father said. "Come, let us see what mischief my sons have rooted out now." He sent his horse into a trot. Jory and Bran and the rest came after. + + +They found Robb on the riverbank north of the bridge, with Jon still mounted beside him. The late summer snows had been heavy this moonturn. Robb stood knee-deep in white, his hood pulled back so the sun shone in his hair. He was cradling something in his arm, while the boys talked in hushed, excited voices. + + +The riders picked their way carefully through the drifts, groping for solid footing on the hidden, uneven ground . Jory Cassel and Theon Greyjoy were the first to reach the boys. Greyjoy was laughing and joking as he rode. Bran heard the breath go out of him. "Gods!" he exclaimed, struggling to keep control of his horse as he reached for his sword. + + +Jory's sword was already out. "Robb, get away from it!" he called as his horse reared under him. + + +Robb grinned and looked up from the bundle in his arms. "She can't hurt you," he said. "She's dead, Jory." + + +Bran was afire with curiosity by then. He would have spurred the pony faster, but his father made them dismount beside the bridge and approach on foot. Bran jumped off and ran. + + +By then Jon, Jory, and Theon Greyjoy had all dismounted as well. "What in the seven hells is it?" Greyjoy was saying. + + +"A wolf," Robb told him. + + +"A freak," Greyjoy said. "Look at the size of it." + + +Bran's heart was thumping in his chest as he pushed through a waist-high drift to his brothers' side. + + +Half-buried in bloodstained snow, a huge dark shape slumped in death. Ice had formed in its shaggy grey fur, and the faint smell of corruption clung to it like a woman's perfume. Bran glimpsed blind eyes crawling with maggots, a wide mouth full of yellowed teeth. But it was the size of it that made him gasp. It was bigger than his pony, twice the size of the largest hound in his father's kennel. + + +"It's no freak," Jon said calmly. "That's a direwolf. They grow larger than the other kind." + + +Theon Greyjoy said, "There's not been a direwolf sighted south of the Wall in two hundred years." + + +"I see one now," Jon replied. + + +Bran tore his eyes away from the monster. That was when he noticed the bundle in Robb's arms. He gave a cry of delight and moved closer. The pup was a tiny ball of grey-black fur, its eyes still closed. It nuzzled blindly against Robb's chest as he cradled it, searching for milk among his leathers, making a sad little whimpery sound. Bran reached out hesitantly. "Go on," Robb told him. "You can touch him." + + +Bran gave the pup a quick nervous stroke, then turned as Jon said, "Here you go." His half brother put a second pup into his arms. "There are five of them." Bran sat down in the snow and hugged the wolf pup to his face. Its fur was soft and warm against his cheek. + + +"Direwolves loose in the realm, after so many years," muttered Hullen, the master of horse. "I like it not." + + +"It is a sign," Jory said. + + +Father frowned. "This is only a dead animal, Jory," he said. Yet he seemed troubled. Snow crunched under his boots as he moved around the body. "Do we know what killed her?" + + +"There's something in the throat," Robb told him, proud to have found the answer before his father even asked. "There, just under the jaw." + + +His father knelt and groped under the beast's head with his hand. He gave a yank and held it up for all to see. A foot of shattered antler, tines snapped off, all wet with blood. + + +A sudden silence descended over the party. The men looked at the antler uneasily, and no one dared to speak. Even Bran could sense their fear, though he did not understand. + + +His father tossed the antler to the side and cleansed his hands in the snow. "I'm surprised she lived long enough to whelp," he said. His voice broke the spell. + + +"Maybe she didn't," Jory said. "I've heard tales . . . maybe the bitch was already dead when the pups came." + + +"Born with the dead," another man put in. "Worse luck." + + +"No matter," said Hullen. "They be dead soon enough too." + + +Bran gave a wordless cry of dismay. + + +"The sooner the better," Theon Greyjoy agreed. He drew his sword. "Give the beast here, Bran." + + +The little thing squirmed against him, as if it heard and understood. "No!" Bran cried out fiercely. "It's mine." + + +"Put away your sword, Greyjoy," Robb said. For a moment he sounded as commanding as their father, like the lord he would someday be. "We will keep these pups." + + +"You cannot do that, boy," said Harwin, who was Hullen's son. + + +"It be a mercy to kill them," Hullen said. + + +Bran looked to his lord father for rescue, but got only a frown, a furrowed brow. "Hullen speaks truly, son. Better a swift death than a hard one from cold and starvation." + + +"No!" He could feel tears welling in his eyes, and he looked away. He did not want to cry in front of his father. + + +Robb resisted stubbornly. "Ser Rodrik's red bitch whelped again last week," he said. "It was a small litter, only two live pups. She'll have milk enough." + + +"She'll rip them apart when they try to nurse." + + +"Lord Stark," Jon said. It was strange to hear him call Father that, so formal. Bran looked at him with desperate hope. "There are five pups," he told Father. "Three male, two female." + + +"What of it, Jon?" + + +"You have five trueborn children," Jon said. "Three sons, two daughters. The direwolf is the sigil of your House. Your children were meant to have these pups, my lord." + + +Bran saw his father's face change, saw the other men exchange glances. He loved Jon with all his heart at that moment. Even at seven, Bran understood what his brother had done. The count had come right only because Jon had omitted himself. He had included the girls, included even Rickon, the baby, but not the bastard who bore the surname Snow, the name that custom decreed be given to all those in the north unlucky enough to be born with no name of their own. + + +Their father understood as well. "You want no pup for yourself, Jon?" he asked softly. + + +"The direwolf graces the banners of House Stark," Jon pointed out. "I am no Stark, Father." + + +Their lord father regarded Jon thoughtfully. Robb rushed into the silence he left. "I will nurse him myself, Father," he promised. "I will soak a towel with warm milk, and give him suck from that." + + +"Me too!" Bran echoed. + + +The lord weighed his sons long and carefully with his eyes. "Easy to say, and harder to do. I will not have you wasting the servants' time with this. If you want these pups, you will feed them yourselves. Is that understood?" + + +Bran nodded eagerly. The pup squirmed in his grasp, licked at his face with a warm tongue. + + +"You must train them as well," their father said. "You must train them. The kennelmaster will have nothing to do with these monsters, I promise you that. And the gods help you if you neglect them, or brutalize them, or train them badly. These are not dogs to beg for treats and slink off at a kick. A direwolf will rip a man's arm off his shoulder as easily as a dog will kill a rat. Are you sure you want this?" + +"Yes, Father," Bran said. + + +"Yes," Robb agreed. + + +"The pups may die anyway, despite all you do." + + +"They won't die," Robb said. "We won't let them die." + + +"Keep them, then. Jory, Desmond, gather up the other pups. It's time we were back to Winterfell." + + +It was not until they were mounted and on their way that Bran allowed himself to taste the sweet air of victory. By then, his pup was snuggled inside his leathers, warm against him, safe for the long ride home. Bran was wondering what to name him. + + +Halfway across the bridge, Jon pulled up suddenly. + + +"What is it, Jon?" their lord father asked. + + +"Can't you hear it?" + + +Bran could hear the wind in the trees, the clatter of their hooves on the ironwood planks, the whimpering of his hungry pup, but Jon was listening to something else. + + +"There," Jon said. He swung his horse around and galloped back across the bridge. They watched him dismount where the direwolf lay dead in the snow, watched him kneel. A moment later he was riding back to them, smiling. + + +"He must have crawled away from the others," Jon said. + + +"Or been driven away," their father said, looking at the sixth pup. His fur was white, where the rest of the litter was grey. His eyes were as red as the blood of the ragged man who had died that morning. Bran thought it curious that this pup alone would have opened his eyes while the others were still blind. + + +"An albino," Theon Greyjoy said with wry amusement. "This one will die even faster than the others." + + +Jon Snow gave his father's ward a long, chilling look. "I think not, Greyjoy," he said. "This one belongs to me." \ No newline at end of file diff --git a/tortoise/do_tts.py b/tortoise/do_tts.py index b74466c..522afa0 100644 --- a/tortoise/do_tts.py +++ b/tortoise/do_tts.py @@ -1,10 +1,11 @@ import argparse import os +import torch import torchaudio -from api import TextToSpeech -from tortoise.utils.audio import load_audio, get_voices, load_voice +from api import TextToSpeech, MODELS_DIR +from utils.audio import load_voices if __name__ == '__main__': parser = argparse.ArgumentParser() @@ -12,26 +13,36 @@ if __name__ == '__main__': parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) ' 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='random') parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='fast') - parser.add_argument('--voice_diversity_intelligibility_slider', type=float, - help='How to balance vocal diversity with the quality/intelligibility of the spoken text. 0 means highly diverse voice (not recommended), 1 means maximize intellibility', - default=.5) parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/') parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this' - 'should only be specified if you have custom checkpoints.', default='.models') + 'should only be specified if you have custom checkpoints.', default=MODELS_DIR) parser.add_argument('--candidates', type=int, help='How many output candidates to produce per-voice.', default=3) + parser.add_argument('--seed', type=int, help='Random seed which can be used to reproduce results.', default=None) + parser.add_argument('--produce_debug_state', type=bool, help='Whether or not to produce debug_state.pth, which can aid in reproducing problems. Defaults to true.', default=True) + parser.add_argument('--cvvp_amount', type=float, help='How much the CVVP model should influence the output.' + 'Increasing this can in some cases reduce the likelihood of multiple speakers. Defaults to 0 (disabled)', default=.0) args = parser.parse_args() os.makedirs(args.output_path, exist_ok=True) tts = TextToSpeech(models_dir=args.model_dir) selected_voices = args.voice.split(',') - for k, voice in enumerate(selected_voices): - voice_samples, conditioning_latents = load_voice(voice) - gen = tts.tts_with_preset(args.text, k=args.candidates, voice_samples=voice_samples, conditioning_latents=conditioning_latents, - preset=args.preset, clvp_cvvp_slider=args.voice_diversity_intelligibility_slider) + for k, selected_voice in enumerate(selected_voices): + if '&' in selected_voice: + voice_sel = selected_voice.split('&') + else: + voice_sel = [selected_voice] + voice_samples, conditioning_latents = load_voices(voice_sel) + + gen, dbg_state = tts.tts_with_preset(args.text, k=args.candidates, voice_samples=voice_samples, conditioning_latents=conditioning_latents, + preset=args.preset, use_deterministic_seed=args.seed, return_deterministic_state=True, cvvp_amount=args.cvvp_amount) if isinstance(gen, list): for j, g in enumerate(gen): - torchaudio.save(os.path.join(args.output_path, f'{voice}_{k}_{j}.wav'), g.squeeze(0).cpu(), 24000) + torchaudio.save(os.path.join(args.output_path, f'{selected_voice}_{k}_{j}.wav'), g.squeeze(0).cpu(), 24000) else: - torchaudio.save(os.path.join(args.output_path, f'{voice}_{k}.wav'), gen.squeeze(0).cpu(), 24000) + torchaudio.save(os.path.join(args.output_path, f'{selected_voice}_{k}.wav'), gen.squeeze(0).cpu(), 24000) + + if args.produce_debug_state: + os.makedirs('debug_states', exist_ok=True) + torch.save(dbg_state, f'debug_states/do_tts_debug_{selected_voice}.pth') diff --git a/tortoise/models/arch_util.py b/tortoise/models/arch_util.py index 5d8c36e..661ee1f 100644 --- a/tortoise/models/arch_util.py +++ b/tortoise/models/arch_util.py @@ -1,3 +1,4 @@ +import os import functools import math @@ -42,7 +43,7 @@ def normalization(channels): class QKVAttentionLegacy(nn.Module): """ - A module which performs QKV attention. Matches legacy QKVAttention + input/ouput heads shaping + A module which performs QKV attention. Matches legacy QKVAttention + input/output heads shaping """ def __init__(self, n_heads): @@ -288,9 +289,12 @@ class AudioMiniEncoder(nn.Module): return h[:, :, 0] +DEFAULT_MEL_NORM_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../data/mel_norms.pth') + + class TorchMelSpectrogram(nn.Module): def __init__(self, filter_length=1024, hop_length=256, win_length=1024, n_mel_channels=80, mel_fmin=0, mel_fmax=8000, - sampling_rate=22050, normalize=False, mel_norm_file='tortoise/data/mel_norms.pth'): + sampling_rate=22050, normalize=False, mel_norm_file=DEFAULT_MEL_NORM_FILE): super().__init__() # These are the default tacotron values for the MEL spectrogram. self.filter_length = filter_length @@ -338,7 +342,7 @@ class CheckpointedLayer(nn.Module): for k, v in kwargs.items(): assert not (isinstance(v, torch.Tensor) and v.requires_grad) # This would screw up checkpointing. partial = functools.partial(self.wrap, **kwargs) - return torch.utils.checkpoint.checkpoint(partial, x, *args) + return partial(x, *args) class CheckpointedXTransformerEncoder(nn.Module): diff --git a/tortoise/models/classifier.py b/tortoise/models/classifier.py index ce574ea..f92d99e 100644 --- a/tortoise/models/classifier.py +++ b/tortoise/models/classifier.py @@ -1,6 +1,5 @@ import torch import torch.nn as nn -from torch.utils.checkpoint import checkpoint from tortoise.models.arch_util import Upsample, Downsample, normalization, zero_module, AttentionBlock @@ -64,14 +63,6 @@ class ResBlock(nn.Module): self.skip_connection = nn.Conv1d(dims, channels, self.out_channels, 1) def forward(self, x): - if self.do_checkpoint: - return checkpoint( - self._forward, x - ) - else: - return self._forward(x) - - def _forward(self, x): if self.updown: in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1] h = in_rest(x) @@ -125,7 +116,7 @@ class AudioMiniEncoder(nn.Module): h = self.res(h) h = self.final(h) for blk in self.attn: - h = checkpoint(blk, h) + h = blk(h) return h[:, :, 0] diff --git a/tortoise/models/cvvp.py b/tortoise/models/cvvp.py index d094649..544ca47 100644 --- a/tortoise/models/cvvp.py +++ b/tortoise/models/cvvp.py @@ -2,7 +2,6 @@ import torch import torch.nn as nn import torch.nn.functional as F from torch import einsum -from torch.utils.checkpoint import checkpoint from tortoise.models.arch_util import AttentionBlock from tortoise.models.xtransformers import ContinuousTransformerWrapper, Encoder @@ -14,7 +13,7 @@ def exists(val): def masked_mean(t, mask): t = t.masked_fill(~mask, 0.) - return t.sum(dim = 1) / mask.sum(dim = 1) + return t.sum(dim=1) / mask.sum(dim=1) class CollapsingTransformer(nn.Module): @@ -36,14 +35,15 @@ class CollapsingTransformer(nn.Module): **encoder_kwargs, )) self.pre_combiner = nn.Sequential(nn.Conv1d(model_dim, output_dims, 1), - AttentionBlock(output_dims, num_heads=heads, do_checkpoint=False), - nn.Conv1d(output_dims, output_dims, 1)) + AttentionBlock( + output_dims, num_heads=heads, do_checkpoint=False), + nn.Conv1d(output_dims, output_dims, 1)) self.mask_percentage = mask_percentage def forward(self, x, **transformer_kwargs): h = self.transformer(x, **transformer_kwargs) - h = h.permute(0,2,1) - h = checkpoint(self.pre_combiner, h).permute(0,2,1) + h = h.permute(0, 2, 1) + h = self.pre_combiner(h).permute(0, 2, 1) if self.training: mask = torch.rand_like(h.float()) > self.mask_percentage else: @@ -58,7 +58,7 @@ class ConvFormatEmbedding(nn.Module): def forward(self, x): y = self.emb(x) - return y.permute(0,2,1) + return y.permute(0, 2, 1) class CVVP(nn.Module): @@ -81,15 +81,20 @@ class CVVP(nn.Module): self.cond_emb = nn.Sequential(nn.Conv1d(mel_channels, model_dim//2, kernel_size=5, stride=2, padding=2), nn.Conv1d(model_dim//2, model_dim, kernel_size=3, stride=2, padding=1)) - self.conditioning_transformer = CollapsingTransformer(model_dim, model_dim, transformer_heads, dropout, conditioning_enc_depth, cond_mask_percentage) - self.to_conditioning_latent = nn.Linear(latent_dim, latent_dim, bias=False) + self.conditioning_transformer = CollapsingTransformer( + model_dim, model_dim, transformer_heads, dropout, conditioning_enc_depth, cond_mask_percentage) + self.to_conditioning_latent = nn.Linear( + latent_dim, latent_dim, bias=False) if mel_codes is None: - self.speech_emb = nn.Conv1d(mel_channels, model_dim, kernel_size=5, padding=2) + self.speech_emb = nn.Conv1d( + mel_channels, model_dim, kernel_size=5, padding=2) else: self.speech_emb = ConvFormatEmbedding(mel_codes, model_dim) - self.speech_transformer = CollapsingTransformer(model_dim, latent_dim, transformer_heads, dropout, speech_enc_depth, speech_mask_percentage) - self.to_speech_latent = nn.Linear(latent_dim, latent_dim, bias=False) + self.speech_transformer = CollapsingTransformer( + model_dim, latent_dim, transformer_heads, dropout, speech_enc_depth, speech_mask_percentage) + self.to_speech_latent = nn.Linear( + latent_dim, latent_dim, bias=False) def get_grad_norm_parameter_groups(self): return { @@ -103,31 +108,35 @@ class CVVP(nn.Module): mel_input, return_loss=False ): - cond_emb = self.cond_emb(mel_cond).permute(0,2,1) + cond_emb = self.cond_emb(mel_cond).permute(0, 2, 1) enc_cond = self.conditioning_transformer(cond_emb) cond_latents = self.to_conditioning_latent(enc_cond) - speech_emb = self.speech_emb(mel_input).permute(0,2,1) + speech_emb = self.speech_emb(mel_input).permute(0, 2, 1) enc_speech = self.speech_transformer(speech_emb) speech_latents = self.to_speech_latent(enc_speech) - - cond_latents, speech_latents = map(lambda t: F.normalize(t, p=2, dim=-1), (cond_latents, speech_latents)) + cond_latents, speech_latents = map(lambda t: F.normalize( + t, p=2, dim=-1), (cond_latents, speech_latents)) temp = self.temperature.exp() if not return_loss: - sim = einsum('n d, n d -> n', cond_latents, speech_latents) * temp + sim = einsum('n d, n d -> n', cond_latents, + speech_latents) * temp return sim - sim = einsum('i d, j d -> i j', cond_latents, speech_latents) * temp - labels = torch.arange(cond_latents.shape[0], device=mel_input.device) - loss = (F.cross_entropy(sim, labels) + F.cross_entropy(sim.t(), labels)) / 2 + sim = einsum('i d, j d -> i j', cond_latents, + speech_latents) * temp + labels = torch.arange( + cond_latents.shape[0], device=mel_input.device) + loss = (F.cross_entropy(sim, labels) + + F.cross_entropy(sim.t(), labels)) / 2 return loss if __name__ == '__main__': clvp = CVVP() - clvp(torch.randn(2,80,100), - torch.randn(2,80,95), - return_loss=True) \ No newline at end of file + clvp(torch.randn(2, 80, 100), + torch.randn(2, 80, 95), + return_loss=True) diff --git a/tortoise/models/transformer.py b/tortoise/models/transformer.py index aa59b46..707e9eb 100644 --- a/tortoise/models/transformer.py +++ b/tortoise/models/transformer.py @@ -216,4 +216,4 @@ class Transformer(nn.Module): self.layers = execute_type(layers, args_route = attn_route_map) def forward(self, x, **kwargs): - return self.layers(x, **kwargs) \ No newline at end of file + return self.layers(x, **kwargs) diff --git a/tortoise/models/vocoder.py b/tortoise/models/vocoder.py index 346f381..8b60dbd 100644 --- a/tortoise/models/vocoder.py +++ b/tortoise/models/vocoder.py @@ -223,7 +223,11 @@ class LVCBlock(torch.nn.Module): class UnivNetGenerator(nn.Module): - """UnivNet Generator""" + """ + UnivNet Generator + + Originally from https://github.com/mindslab-ai/univnet/blob/master/model/generator.py. + """ def __init__(self, noise_dim=64, channel_size=32, dilations=[1,3,9,27], strides=[8,8,4], lReLU_slope=.2, kpnet_conv_size=3, # Below are MEL configurations options that this generator requires. diff --git a/tortoise/models/xtransformers.py b/tortoise/models/xtransformers.py index df9ee25..8be2df4 100644 --- a/tortoise/models/xtransformers.py +++ b/tortoise/models/xtransformers.py @@ -1,16 +1,12 @@ -import functools import math -import torch -from torch import nn, einsum -import torch.nn.functional as F +from collections import namedtuple from functools import partial from inspect import isfunction -from collections import namedtuple -from einops import rearrange, repeat, reduce -from einops.layers.torch import Rearrange - -from torch.utils.checkpoint import checkpoint +import torch +import torch.nn.functional as F +from einops import rearrange, repeat +from torch import nn, einsum DEFAULT_DIM_HEAD = 64 @@ -969,16 +965,16 @@ class AttentionLayers(nn.Module): layer_past = None if layer_type == 'a': - out, inter, k, v = checkpoint(block, x, None, mask, None, attn_mask, self.pia_pos_emb, rotary_pos_emb, + out, inter, k, v = block(x, None, mask, None, attn_mask, self.pia_pos_emb, rotary_pos_emb, prev_attn, layer_mem, layer_past) elif layer_type == 'c': if exists(full_context): - out, inter, k, v = checkpoint(block, x, full_context[cross_attn_count], mask, context_mask, None, None, + out, inter, k, v = block(x, full_context[cross_attn_count], mask, context_mask, None, None, None, prev_attn, None, layer_past) else: - out, inter, k, v = checkpoint(block, x, context, mask, context_mask, None, None, None, prev_attn, None, layer_past) + out, inter, k, v = block(x, context, mask, context_mask, None, None, None, prev_attn, None, layer_past) elif layer_type == 'f': - out = checkpoint(block, x) + out = block(x) if layer_type == 'a' or layer_type == 'c' and present_key_values is not None: present_key_values.append((k.detach(), v.detach())) diff --git a/tortoise/read.py b/tortoise/read.py index e81bd71..05b6658 100644 --- a/tortoise/read.py +++ b/tortoise/read.py @@ -1,11 +1,12 @@ import argparse import os +from time import time import torch import torchaudio -from api import TextToSpeech -from utils.audio import load_audio, get_voices, load_voices +from api import TextToSpeech, MODELS_DIR +from utils.audio import load_audio, load_voices from utils.text import split_and_recombine_text @@ -17,11 +18,12 @@ if __name__ == '__main__': parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/longform/') parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard') parser.add_argument('--regenerate', type=str, help='Comma-separated list of clip numbers to re-generate, or nothing.', default=None) - parser.add_argument('--voice_diversity_intelligibility_slider', type=float, - help='How to balance vocal diversity with the quality/intelligibility of the spoken text. 0 means highly diverse voice (not recommended), 1 means maximize intellibility', - default=.5) + parser.add_argument('--candidates', type=int, help='How many output candidates to produce per-voice. Only the first candidate is actually used in the final product, the others can be used manually.', default=1) parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this' - 'should only be specified if you have custom checkpoints.', default='.models') + 'should only be specified if you have custom checkpoints.', default=MODELS_DIR) + parser.add_argument('--seed', type=int, help='Random seed which can be used to reproduce results.', default=None) + parser.add_argument('--produce_debug_state', type=bool, help='Whether or not to produce debug_state.pth, which can aid in reproducing problems. Defaults to true.', default=True) + args = parser.parse_args() tts = TextToSpeech(models_dir=args.model_dir) @@ -41,6 +43,7 @@ if __name__ == '__main__': else: texts = split_and_recombine_text(text) + seed = int(time()) if args.seed is None else args.seed for selected_voice in selected_voices: voice_outpath = os.path.join(outpath, selected_voice) os.makedirs(voice_outpath, exist_ok=True) @@ -57,10 +60,34 @@ if __name__ == '__main__': all_parts.append(load_audio(os.path.join(voice_outpath, f'{j}.wav'), 24000)) continue gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, - preset=args.preset, clvp_cvvp_slider=args.voice_diversity_intelligibility_slider) - gen = gen.squeeze(0).cpu() - torchaudio.save(os.path.join(voice_outpath, f'{j}.wav'), gen, 24000) + preset=args.preset, k=args.candidates, use_deterministic_seed=seed) + if args.candidates == 1: + gen = gen.squeeze(0).cpu() + torchaudio.save(os.path.join(voice_outpath, f'{j}.wav'), gen, 24000) + else: + candidate_dir = os.path.join(voice_outpath, str(j)) + os.makedirs(candidate_dir, exist_ok=True) + for k, g in enumerate(gen): + torchaudio.save(os.path.join(candidate_dir, f'{k}.wav'), g.squeeze(0).cpu(), 24000) + gen = gen[0].squeeze(0).cpu() all_parts.append(gen) - full_audio = torch.cat(all_parts, dim=-1) - torchaudio.save(os.path.join(voice_outpath, 'combined.wav'), full_audio, 24000) + if args.candidates == 1: + full_audio = torch.cat(all_parts, dim=-1) + torchaudio.save(os.path.join(voice_outpath, 'combined.wav'), full_audio, 24000) + + if args.produce_debug_state: + os.makedirs('debug_states', exist_ok=True) + dbg_state = (seed, texts, voice_samples, conditioning_latents) + torch.save(dbg_state, f'debug_states/read_debug_{selected_voice}.pth') + + # Combine each candidate's audio clips. + if args.candidates > 1: + audio_clips = [] + for candidate in range(args.candidates): + for line in range(len(texts)): + wav_file = os.path.join(voice_outpath, str(line), f"{candidate}.wav") + audio_clips.append(load_audio(wav_file, 24000)) + audio_clips = torch.cat(audio_clips, dim=-1) + torchaudio.save(os.path.join(voice_outpath, f"combined_{candidate:02d}.wav"), audio_clips, 24000) + audio_clips = [] diff --git a/tortoise/utils/audio.py b/tortoise/utils/audio.py index 2422e97..91237dd 100644 --- a/tortoise/utils/audio.py +++ b/tortoise/utils/audio.py @@ -10,6 +10,9 @@ from scipy.io.wavfile import read from tortoise.utils.stft import STFT +BUILTIN_VOICES_DIR = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../voices') + + def load_wav_to_torch(full_path): sampling_rate, data = read(full_path) if data.dtype == np.int32: @@ -82,21 +85,23 @@ def dynamic_range_decompression(x, C=1): return torch.exp(x) / C -def get_voices(): - subs = os.listdir('tortoise/voices') +def get_voices(extra_voice_dirs=[]): + dirs = [BUILTIN_VOICES_DIR] + extra_voice_dirs voices = {} - for sub in subs: - subj = os.path.join('tortoise/voices', sub) - if os.path.isdir(subj): - voices[sub] = list(glob(f'{subj}/*.wav')) + list(glob(f'{subj}/*.mp3')) + list(glob(f'{subj}/*.pth')) + for d in dirs: + subs = os.listdir(d) + for sub in subs: + subj = os.path.join(d, sub) + if os.path.isdir(subj): + voices[sub] = list(glob(f'{subj}/*.wav')) + list(glob(f'{subj}/*.mp3')) + list(glob(f'{subj}/*.pth')) return voices -def load_voice(voice): +def load_voice(voice, extra_voice_dirs=[]): if voice == 'random': return None, None - voices = get_voices() + voices = get_voices(extra_voice_dirs) paths = voices[voice] if len(paths) == 1 and paths[0].endswith('.pth'): return None, torch.load(paths[0]) @@ -108,25 +113,28 @@ def load_voice(voice): return conds, None -def load_voices(voices): +def load_voices(voices, extra_voice_dirs=[]): latents = [] clips = [] for voice in voices: if voice == 'random': - print("Cannot combine a random voice with a non-random voice. Just using a random voice.") + if len(voices) > 1: + print("Cannot combine a random voice with a non-random voice. Just using a random voice.") return None, None - clip, latent = load_voice(voice) + clip, latent = load_voice(voice, extra_voice_dirs) if latent is None: assert len(latents) == 0, "Can only combine raw audio voices or latent voices, not both. Do it yourself if you want this." clips.extend(clip) - elif voice is None: - assert len(voices) == 0, "Can only combine raw audio voices or latent voices, not both. Do it yourself if you want this." + elif clip is None: + assert len(clips) == 0, "Can only combine raw audio voices or latent voices, not both. Do it yourself if you want this." latents.append(latent) if len(latents) == 0: return clips, None else: - latents = torch.stack(latents, dim=0) - return None, latents.mean(dim=0) + latents_0 = torch.stack([l[0] for l in latents], dim=0).mean(dim=0) + latents_1 = torch.stack([l[1] for l in latents], dim=0).mean(dim=0) + latents = (latents_0,latents_1) + return None, latents class TacotronSTFT(torch.nn.Module): @@ -172,10 +180,10 @@ class TacotronSTFT(torch.nn.Module): return mel_output -def wav_to_univnet_mel(wav, do_normalization=False): +def wav_to_univnet_mel(wav, do_normalization=False, device='cuda'): stft = TacotronSTFT(1024, 256, 1024, 100, 24000, 0, 12000) - stft = stft.cuda() + stft = stft.to(device) mel = stft.mel_spectrogram(wav) if do_normalization: mel = normalize_tacotron_mel(mel) - return mel \ No newline at end of file + return mel diff --git a/tortoise/utils/samples_generator.py b/tortoise/utils/samples_generator.py deleted file mode 100644 index 61d3014..0000000 --- a/tortoise/utils/samples_generator.py +++ /dev/null @@ -1,51 +0,0 @@ -import os - -# This script builds the sample webpage. - -if __name__ == '__main__': - result = "These words were never spoken.

Handpicked results

" - for fv in os.listdir('../../results/favorites'): - url = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/{fv}' - result = result + f'
\n' - - result = result + "

Handpicked longform result:

" - url = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/favorite_riding_hood.mp3' - result = result + f'
\n' - - result = result + "

Compared to Tacotron2 (with the LJSpeech voice):

" - for k in range(2,5,1): - url1 = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/tacotron_comparison/{k}-tacotron2.mp3' - url2 = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/tacotron_comparison/{k}-tortoise.mp3' - result = result + f'' \ - f'' - result = result + "
Tacotron2+WaveglowTorToiSe

\n

\n
" - - result = result + "

Various spoken texts for all voices:

" - voices = ['angie', 'daniel', 'deniro', 'emma', 'freeman', 'geralt', 'halle', 'jlaw', 'lj', 'myself', - 'pat', 'snakes', 'tom', 'train_atkins', 'train_dotrice', 'train_kennard', 'weaver', 'william'] - lines = ['' + ''.join([f'' for v in voices])] - line = f'' - for v in voices: - url = f'https://github.com/neonbjb/tortoise-tts/raw/main/voices/{v}/1.wav' - line = line + f'' - line = line + "" - lines.append(line) - for txt in os.listdir('../../results/various/'): - if 'desktop' in txt: - continue - line = f'' - for v in voices: - url = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/various/{txt}/{v}.mp3' - line = line + f'' - line = line + "" - lines.append(line) - result = result + '\n'.join(lines) + "
text{v}
reference clip
{txt}
" - - result = result + "

Longform result for all voices:

" - for lf in os.listdir('../../results/riding_hood'): - url = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/{lf}' - result = result + f'
\n' - - result = result + "" - with open('result.html', 'w', encoding='utf-8') as f: - f.write(result) diff --git a/tortoise/utils/text.py b/tortoise/utils/text.py index 18bcebb..e28c867 100644 --- a/tortoise/utils/text.py +++ b/tortoise/utils/text.py @@ -13,18 +13,25 @@ def split_and_recombine_text(text, desired_length=200, max_length=300): current = "" split_pos = [] pos = -1 + end_pos = len(text) - 1 def seek(delta): - nonlocal pos, in_quote, text + nonlocal pos, in_quote, current is_neg = delta < 0 for _ in range(abs(delta)): if is_neg: pos -= 1 + current = current[:-1] else: pos += 1 + current += text[pos] if text[pos] == '"': in_quote = not in_quote - return text[pos], text[pos+1] if pos < len(text)-1 else "" + return text[pos] + + def peek(delta): + p = pos + delta + return text[p] if p < end_pos and p >= 0 else "" def commit(): nonlocal rv, current, split_pos @@ -32,37 +39,42 @@ def split_and_recombine_text(text, desired_length=200, max_length=300): current = "" split_pos = [] - while pos < len(text) - 1: - c, next_c = seek(1) - current += c + while pos < end_pos: + c = seek(1) # do we need to force a split? if len(current) >= max_length: if len(split_pos) > 0 and len(current) > (desired_length / 2): # we have at least one sentence and we are over half the desired length, seek back to the last split d = pos - split_pos[-1] seek(-d) - current = current[:-d] else: # no full sentences, seek back until we are not in the middle of a word and split there while c not in '!?.\n ' and pos > 0 and len(current) > desired_length: - c, _ = seek(-1) - current = current[:-1] + c = seek(-1) commit() # check for sentence boundaries - elif not in_quote and (c in '!?\n' or (c == '.' and next_c in '\n ')): + elif not in_quote and (c in '!?\n' or (c == '.' and peek(1) in '\n ')): + # seek forward if we have consecutive boundary markers but still within the max length + while pos < len(text) - 1 and len(current) < max_length and peek(1) in '!?.': + c = seek(1) split_pos.append(pos) if len(current) >= desired_length: commit() + # treat end of quote as a boundary if its followed by a space or newline + elif in_quote and peek(1) == '"' and peek(2) in '\n ': + seek(2) + split_pos.append(pos) rv.append(current) - # clean up + # clean up, remove lines with only whitespace or punctuation rv = [s.strip() for s in rv] - rv = [s for s in rv if len(s) > 0] + rv = [s for s in rv if len(s) > 0 and not re.match(r'^[\s\.,;:!?]*$', s)] return rv if __name__ == '__main__': + import os import unittest class Test(unittest.TestCase): @@ -81,4 +93,40 @@ if __name__ == '__main__': 'inthemiddlebutinotinthislongword.', '"Don\'t split my quote... please"']) + def test_split_and_recombine_text_2(self): + text = """ + When you are really angry sometimes you use consecutive exclamation marks!!!!!! Is this a good thing to do?!?!?! + I don't know but we should handle this situation.......................... + """ + self.assertEqual(split_and_recombine_text(text, desired_length=30, max_length=50), + ['When you are really angry sometimes you use', + 'consecutive exclamation marks!!!!!!', + 'Is this a good thing to do?!?!?!', + 'I don\'t know but we should handle this situation.']) + + def test_split_and_recombine_text_3(self): + text_src = os.path.join(os.path.dirname(__file__), '../data/riding_hood.txt') + with open(text_src, 'r') as f: + text = f.read() + self.assertEqual( + split_and_recombine_text(text), + [ + 'Once upon a time there lived in a certain village a little country girl, the prettiest creature who was ever seen. Her mother was excessively fond of her; and her grandmother doted on her still more. This good woman had a little red riding hood made for her.', + 'It suited the girl so extremely well that everybody called her Little Red Riding Hood. One day her mother, having made some cakes, said to her, "Go, my dear, and see how your grandmother is doing, for I hear she has been very ill. Take her a cake, and this little pot of butter."', + 'Little Red Riding Hood set out immediately to go to her grandmother, who lived in another village. As she was going through the wood, she met with a wolf, who had a very great mind to eat her up, but he dared not, because of some woodcutters working nearby in the forest.', + 'He asked her where she was going. The poor child, who did not know that it was dangerous to stay and talk to a wolf, said to him, "I am going to see my grandmother and carry her a cake and a little pot of butter from my mother." "Does she live far off?" said the wolf "Oh I say,"', + 'answered Little Red Riding Hood; "it is beyond that mill you see there, at the first house in the village." "Well," said the wolf, "and I\'ll go and see her too. I\'ll go this way and go you that, and we shall see who will be there first."', + 'The wolf ran as fast as he could, taking the shortest path, and the little girl took a roundabout way, entertaining herself by gathering nuts, running after butterflies, and gathering bouquets of little flowers.', + 'It was not long before the wolf arrived at the old woman\'s house. He knocked at the door: tap, tap. "Who\'s there?" "Your grandchild, Little Red Riding Hood," replied the wolf, counterfeiting her voice; "who has brought you a cake and a little pot of butter sent you by mother."', + 'The good grandmother, who was in bed, because she was somewhat ill, cried out, "Pull the bobbin, and the latch will go up."', + 'The wolf pulled the bobbin, and the door opened, and then he immediately fell upon the good woman and ate her up in a moment, for it been more than three days since he had eaten.', + 'He then shut the door and got into the grandmother\'s bed, expecting Little Red Riding Hood, who came some time afterwards and knocked at the door: tap, tap. "Who\'s there?"', + 'Little Red Riding Hood, hearing the big voice of the wolf, was at first afraid; but believing her grandmother had a cold and was hoarse, answered, "It is your grandchild Little Red Riding Hood, who has brought you a cake and a little pot of butter mother sends you."', + 'The wolf cried out to her, softening his voice as much as he could, "Pull the bobbin, and the latch will go up." Little Red Riding Hood pulled the bobbin, and the door opened.', + 'The wolf, seeing her come in, said to her, hiding himself under the bedclothes, "Put the cake and the little pot of butter upon the stool, and come get into bed with me." Little Red Riding Hood took off her clothes and got into bed.', + 'She was greatly amazed to see how her grandmother looked in her nightclothes, and said to her, "Grandmother, what big arms you have!" "All the better to hug you with, my dear." "Grandmother, what big legs you have!" "All the better to run with, my child." "Grandmother, what big ears you have!"', + '"All the better to hear with, my child." "Grandmother, what big eyes you have!" "All the better to see with, my child." "Grandmother, what big teeth you have got!" "All the better to eat you up with." And, saying these words, this wicked wolf fell upon Little Red Riding Hood, and ate her all up.', + ] + ) + unittest.main() diff --git a/tortoise/utils/tokenizer.py b/tortoise/utils/tokenizer.py index 2f36a06..3ab1c31 100644 --- a/tortoise/utils/tokenizer.py +++ b/tortoise/utils/tokenizer.py @@ -1,3 +1,4 @@ +import os import re import inflect @@ -148,6 +149,7 @@ def english_cleaners(text): text = text.replace('"', '') return text + def lev_distance(s1, s2): if len(s1) > len(s2): s1, s2 = s2, s1 @@ -163,8 +165,12 @@ def lev_distance(s1, s2): distances = distances_ return distances[-1] + +DEFAULT_VOCAB_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../data/tokenizer.json') + + class VoiceBpeTokenizer: - def __init__(self, vocab_file='tortoise/data/tokenizer.json'): + def __init__(self, vocab_file=DEFAULT_VOCAB_FILE): if vocab_file is not None: self.tokenizer = Tokenizer.from_file(vocab_file) diff --git a/tortoise/utils/wav2vec_alignment.py b/tortoise/utils/wav2vec_alignment.py index bfcb7e1..aeadb73 100644 --- a/tortoise/utils/wav2vec_alignment.py +++ b/tortoise/utils/wav2vec_alignment.py @@ -7,13 +7,15 @@ from transformers import Wav2Vec2ForCTC, Wav2Vec2FeatureExtractor, Wav2Vec2CTCTo from tortoise.utils.audio import load_audio -def max_alignment(s1, s2, skip_character='~', record={}): +def max_alignment(s1, s2, skip_character='~', record=None): """ A clever function that aligns s1 to s2 as best it can. Wherever a character from s1 is not found in s2, a '~' is used to replace that character. Finally got to use my DP skills! """ + if record is None: + record = {} assert skip_character not in s1, f"Found the skip character {skip_character} in the provided string, {s1}" if len(s1) == 0: return '' @@ -47,17 +49,18 @@ class Wav2VecAlignment: """ Uses wav2vec2 to perform audio<->text alignment. """ - def __init__(self): + def __init__(self, device='cuda'): self.model = Wav2Vec2ForCTC.from_pretrained("jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli").cpu() self.feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(f"facebook/wav2vec2-large-960h") self.tokenizer = Wav2Vec2CTCTokenizer.from_pretrained('jbetker/tacotron-symbols') + self.device = device def align(self, audio, expected_text, audio_sample_rate=24000): orig_len = audio.shape[-1] with torch.no_grad(): - self.model = self.model.cuda() - audio = audio.to('cuda') + self.model = self.model.to(self.device) + audio = audio.to(self.device) audio = torchaudio.functional.resample(audio, audio_sample_rate, 16000) clip_norm = (audio - audio.mean()) / torch.sqrt(audio.var() + 1e-7) logits = self.model(clip_norm).logits @@ -145,4 +148,3 @@ class Wav2VecAlignment: start, stop = nri output_audio.append(audio[:, alignments[start]:alignments[stop]]) return torch.cat(output_audio, dim=-1) - diff --git a/tortoise/voices/cond_latent_example/pat.pth b/tortoise/voices/cond_latent_example/pat.pth new file mode 100644 index 0000000..2c369be Binary files /dev/null and b/tortoise/voices/cond_latent_example/pat.pth differ diff --git a/tortoise/voices/train_empire/2.mp3 b/tortoise/voices/train_empire/2.mp3 index 45aa4da..0a59abd 100644 Binary files a/tortoise/voices/train_empire/2.mp3 and b/tortoise/voices/train_empire/2.mp3 differ diff --git a/tortoise/voices/train_lescault/1.wav b/tortoise/voices/train_lescault/1.wav deleted file mode 100644 index f64a714..0000000 Binary files a/tortoise/voices/train_lescault/1.wav and /dev/null differ diff --git a/tortoise/voices/train_lescault/2.wav b/tortoise/voices/train_lescault/2.wav deleted file mode 100644 index cb42f94..0000000 Binary files a/tortoise/voices/train_lescault/2.wav and /dev/null differ diff --git a/tortoise/voices/train_lescault/lescault_new1.wav b/tortoise/voices/train_lescault/lescault_new1.wav new file mode 100644 index 0000000..56673ae Binary files /dev/null and b/tortoise/voices/train_lescault/lescault_new1.wav differ diff --git a/tortoise/voices/train_lescault/lescault_new2.wav b/tortoise/voices/train_lescault/lescault_new2.wav new file mode 100644 index 0000000..5ef7635 Binary files /dev/null and b/tortoise/voices/train_lescault/lescault_new2.wav differ diff --git a/tortoise/voices/train_lescault/lescault_new3.wav b/tortoise/voices/train_lescault/lescault_new3.wav new file mode 100644 index 0000000..85f416e Binary files /dev/null and b/tortoise/voices/train_lescault/lescault_new3.wav differ diff --git a/tortoise/voices/train_lescault/lescault_new4.wav b/tortoise/voices/train_lescault/lescault_new4.wav new file mode 100644 index 0000000..92d6580 Binary files /dev/null and b/tortoise/voices/train_lescault/lescault_new4.wav differ diff --git a/tortoise/voices/train_lescault/lescault_new5.wav b/tortoise/voices/train_lescault/lescault_new5.wav new file mode 100644 index 0000000..17496bf Binary files /dev/null and b/tortoise/voices/train_lescault/lescault_new5.wav differ diff --git a/tortoise/voices/train_mouse/3.mp3 b/tortoise/voices/train_mouse/3.mp3 deleted file mode 100644 index fe197c7..0000000 Binary files a/tortoise/voices/train_mouse/3.mp3 and /dev/null differ diff --git a/tortoise_v2_examples.html b/tortoise_v2_examples.html index 6b4b6c7..1a457d1 100644 --- a/tortoise_v2_examples.html +++ b/tortoise_v2_examples.html @@ -29,13 +29,13 @@ available at https://github.co

-

Short-form

+

Long-form


-

Compared to Tacotron2 (with the LJSpeech voice): 🐢

+

Comparisons (with the LJSpeech voice): 🐢

LJSpeech is a popular dataset used to train small-scale TTS models. TorToiSe is a multi-voice model, following is how -it renders the LJSpeech voice with no fine-tuning, compared with results for the same text from the popular Tacotron2 -model paired with the Waveglow transformer:

+it renders the LJSpeech voice with and without fine-tuning, compared with results for the same text from the popular Tacotron2 +model paired with the Waveglow vocoder.

@@ -50,6 +50,22 @@ model paired with the Waveglow transformer:

Tacotron2+WaveglowTorToiSeTorToiSe Finetuned


+

NaturalVoice is a SOTA TTS engine developed by Microsoft Research Asia in May 2022. It features realistic prosody +and end-to-end generation with no need for a vocoder. While not much has actually been released about this model other +than five samples, those samples are quite good and I would consider this the most competitive TTS engine out there +right now.

+ + + + + + +
Natural VoiceTorToiSe Finetuned





+

+

It is important to note that it is not actually fair to compare any of these models: Tortoise is a multi-voice probabilistic +model trained on millions of hours of speech with an exceptionally slow inference time. Tacotron and NaturalVoice are efficient, +fast, single-voice models trained on 24 hours of speech. Unfortunately, there isn't much in the way of actually comparable +research to Tortoise.

All Results 🐢

Following are all the results from which the hand-picked results were drawn from. Also included is the reference @@ -109,4 +125,4 @@ less effected by this.

Happy:
Scared:
- \ No newline at end of file +