diff --git a/README.md b/README.md index 40c242c8..63b8931a 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github. ## Features -- Supports multiple text generation backends in one UI/API, including [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp), and [ExLlamaV2](https://github.com/turboderp-org/exllamav2). [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is supported via its own [Dockerfile](https://github.com/oobabooga/text-generation-webui/blob/main/docker/TensorRT-LLM/Dockerfile), and the Transformers loader is compatible with libraries like [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [HQQ](https://github.com/mobiusml/hqq), and [AQLM](https://github.com/Vahe1994/AQLM), but they must be installed manually. +- Supports multiple text generation backends in one UI/API, including [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp), [ExLlamaV3](https://github.com/turboderp-org/exllamav3), and [ExLlamaV2](https://github.com/turboderp-org/exllamav2). [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is supported via its own [Dockerfile](https://github.com/oobabooga/text-generation-webui/blob/main/docker/TensorRT-LLM/Dockerfile), and the Transformers loader is compatible with libraries like [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [HQQ](https://github.com/mobiusml/hqq), and [AQLM](https://github.com/Vahe1994/AQLM), but they must be installed manually. - OpenAI-compatible API with Chat and Completions endpoints – see [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples). - Automatic prompt formatting using Jinja2 templates. - Three chat modes: `instruct`, `chat-instruct`, and `chat`, with automatic prompt templates in `chat-instruct`. @@ -78,25 +78,19 @@ conda activate textgen | System | GPU | Command | |--------|---------|---------| -| Linux/WSL | NVIDIA | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121` | -| Linux/WSL | CPU only | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cpu` | -| Linux | AMD | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/rocm6.1` | -| MacOS + MPS | Any | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1` | -| Windows | NVIDIA | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121` | -| Windows | CPU only | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1` | +| Linux/WSL | NVIDIA | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124` | +| Linux/WSL | CPU only | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu` | +| Linux | AMD | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/rocm6.1` | +| MacOS + MPS | Any | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0` | +| Windows | NVIDIA | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124` | +| Windows | CPU only | `pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0` | The up-to-date commands can be found here: https://pytorch.org/get-started/locally/. -For NVIDIA, you also need to install the CUDA runtime libraries: +If you need `nvcc` to compile some library manually, you will additionally need to install this: ``` -conda install -y -c "nvidia/label/cuda-12.1.1" cuda-runtime -``` - -If you need `nvcc` to compile some library manually, replace the command above with - -``` -conda install -y -c "nvidia/label/cuda-12.1.1" cuda +conda install -y -c "nvidia/label/cuda-12.4.1" cuda ``` #### 3. Install the web UI @@ -143,19 +137,6 @@ Then browse to 3) Manually install AutoGPTQ: [Installation](https://github.com/PanQiWei/AutoGPTQ#install-from-source). * Perform the from-source installation - there are no prebuilt ROCm packages for Windows. -##### Older NVIDIA GPUs - -1) For Kepler GPUs and older, you will need to install CUDA 11.8 instead of 12: - -``` -pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118 -conda install -y -c "nvidia/label/cuda-11.8.0" cuda-runtime -``` - -2) bitsandbytes >= 0.39 may not work. In that case, to use `--load-in-8bit`, you may have to downgrade like this: - * Linux: `pip install bitsandbytes==0.38.1` - * Windows: `pip install https://github.com/jllllll/bitsandbytes-windows-webui/raw/main/bitsandbytes-0.38.1-py3-none-any.whl` - ##### Manual install The `requirements*.txt` above contain various wheels precompiled through GitHub Actions. If you wish to compile things manually, or if you need to because no suitable wheels are available for your hardware, you can use `requirements_nowheels.txt` and then install your desired loaders manually. diff --git a/css/chat_style-Dark.css b/css/chat_style-Dark.css new file mode 100644 index 00000000..368a2a16 --- /dev/null +++ b/css/chat_style-Dark.css @@ -0,0 +1,128 @@ +.message { + display: grid; + grid-template-columns: 60px minmax(0, 1fr); + padding-bottom: 28px; + font-size: 18px; + font-family: Roboto, Arial, sans-serif; /* Modern font */ + line-height: 1.5; +} + +.circle-you, +.circle-bot { + background-color: #2b2b2b; /* Darker background for circles */ + border-radius: 50%; /* Perfect circle */ + border: 1px solid #4a90e2; /* Soft blue border */ + box-shadow: 0 4px 8px rgb(0 0 0 / 50%); /* Soft shadow for depth */ +} + +.circle-bot img, +.circle-you img { + border-radius: 50%; /* Make images circular */ + width: 100%; + height: 100%; + object-fit: cover; +} + +.circle-you, .circle-bot { + width: 64px; /* Smaller size for modern look */ + height: 64px; +} + +.text { + padding-left: 12px; /* Reduced padding for a cleaner layout */ + color: #f0f0f0; /* Light text color for readability */ +} + +.text p { + margin-top: 2px; +} + +.username { + padding-left: 10px; + font-size: 20px; + font-weight: bold; + color: #e0e0e0; /* Light gray text */ + transition: color 0.3s ease; /* Smooth color transition */ +} + +.username:hover { + color: #4a90e2; /* Blue color on hover */ +} + +.message-body { + position: relative; + border: 1px solid rgb(255 255 255 / 10%); /* Soft white border */ + border-radius: 8px; /* Slightly rounded corners */ + padding: 15px; + background: #1e1e1e; /* Dark background */ + box-shadow: 0 4px 10px rgb(0 0 0 / 30%); /* Subtle shadow for depth */ + transition: background 0.3s ease; /* Smooth transition for background */ +} + +.message-body:hover { + background: #252525; /* Slightly lighter on hover */ +} + +/* Adds 2 extra lines at the top and bottom of the message */ +.message-body::before, +.message-body::after { + content: ""; + position: absolute; + left: 10px; + right: 10px; + height: 1px; + background-color: rgb(255 255 255 / 5%); /* Faded lines for subtle separation */ +} + +.message-body::before { + top: 4px; +} + +.message-body::after { + bottom: 4px; +} + +.message-body img { + max-width: 300px; + max-height: 300px; + border-radius: 10px; /* Rounded corners for images */ +} + +.message-body p { + margin-bottom: 0 !important; + font-size: 16px !important; + line-height: 1.5 !important; + color: #e0e0e0 !important; /* Light color for text */ +} + +.message-body p em { + color: #a6a6a6 !important; /* Softer gray for emphasized text */ +} + +@media screen and (width <= 688px) { + .message { + display: grid; + grid-template-columns: 60px minmax(0, 1fr); + padding-bottom: 25px; + font-size: 15px; + font-family: Roboto, Arial, sans-serif; /* Modern font */ + line-height: 1.5; + } + + .circle-you, .circle-bot { + width: 40px; /* Smaller size for mobile */ + height: 40px; + } + + .text { + padding-left: 10px; /* Reduced padding for mobile */ + } + + .message-body p { + font-size: 14px !important; /* Smaller text for mobile */ + } + + .username { + font-size: 18px; /* Smaller username for mobile */ + } +} diff --git a/docker/amd/Dockerfile b/docker/amd/Dockerfile index cfbcf7e4..66e5863c 100644 --- a/docker/amd/Dockerfile +++ b/docker/amd/Dockerfile @@ -13,7 +13,7 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked,rw \ WORKDIR /home/app/ RUN git clone https://github.com/oobabooga/text-generation-webui.git WORKDIR /home/app/text-generation-webui -RUN GPU_CHOICE=C LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh --verbose +RUN GPU_CHOICE=B LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh --verbose COPY CMD_FLAGS.txt /home/app/text-generation-webui/ EXPOSE ${CONTAINER_PORT:-7860} ${CONTAINER_API_PORT:-5000} ${CONTAINER_API_STREAM_PORT:-5005} WORKDIR /home/app/text-generation-webui diff --git a/docker/intel/Dockerfile b/docker/intel/Dockerfile index d2ed671e..cab62442 100644 --- a/docker/intel/Dockerfile +++ b/docker/intel/Dockerfile @@ -13,7 +13,7 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked,rw \ WORKDIR /home/app/ RUN git clone https://github.com/oobabooga/text-generation-webui.git WORKDIR /home/app/text-generation-webui -RUN GPU_CHOICE=E LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh --verbose +RUN GPU_CHOICE=D LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh --verbose COPY CMD_FLAGS.txt /home/app/text-generation-webui/ EXPOSE ${CONTAINER_PORT:-7860} ${CONTAINER_API_PORT:-5000} ${CONTAINER_API_STREAM_PORT:-5005} # set umask to ensure group read / write at runtime diff --git a/extensions/ngrok/README.md b/extensions/ngrok/README.md index 0324bf98..2e9eb82d 100644 --- a/extensions/ngrok/README.md +++ b/extensions/ngrok/README.md @@ -9,9 +9,9 @@ the `settings.json` file, see the Examples below. Retrieve your authtoken on the # Documentation -For a list of all available options, see [the configuration documentation](https://ngrok.com/docs/ngrok-agent/config/) or [the connect example](https://github.com/ngrok/ngrok-py/blob/main/examples/ngrok-connect-full.py). +For a list of all available options, see [the configuration documentation](https://ngrok.com/docs/ngrok-agent/config/) or [the forward example](https://github.com/ngrok/ngrok-python/blob/main/examples/ngrok-forward-full.py). -The ngrok Python SDK is [on github here](https://github.com/ngrok/ngrok-py). A quickstart guide and a full API reference are included in the [ngrok-py Python API documentation](https://ngrok.github.io/ngrok-py/). +The ngrok Python SDK is [on github here](https://github.com/ngrok/ngrok-py). A quickstart guide and a full API reference are included in the [ngrok-py Python API documentation](https://ngrok.github.io/ngrok-python/). # Running @@ -66,4 +66,4 @@ To add an authtoken instead of using the NGROK_AUTHTOKEN environment variable: "authtoken_from_env":false } } -``` \ No newline at end of file +``` diff --git a/modules/exllamav3_hf.py b/modules/exllamav3_hf.py new file mode 100644 index 00000000..3bf44c9b --- /dev/null +++ b/modules/exllamav3_hf.py @@ -0,0 +1,179 @@ +import os +import traceback +from pathlib import Path +from typing import Any, Dict, Optional, Union + +import torch +from exllamav3 import Cache, Config, Model +from torch.nn import CrossEntropyLoss +from transformers import GenerationConfig, PretrainedConfig, PreTrainedModel +from transformers.modeling_outputs import CausalLMOutputWithPast + +from modules import shared +from modules.logging_colors import logger + +try: + import flash_attn +except Exception: + logger.warning('Failed to load flash-attention due to the following error:\n') + traceback.print_exc() + + +class Exllamav3HF(PreTrainedModel): + def __init__(self, model_dir): + super().__init__(PretrainedConfig()) + self.generation_config = GenerationConfig() + + config = Config.from_directory(model_dir) + self.ex_model = Model.from_config(config) + + # Calculate the closest multiple of 256 at or above the chosen value + max_tokens = shared.args.max_seq_len + if max_tokens % 256 != 0: + adjusted_tokens = ((max_tokens // 256) + 1) * 256 + logger.warning(f"max_num_tokens must be a multiple of 256. Adjusting from {max_tokens} to {adjusted_tokens}") + max_tokens = adjusted_tokens + + self.ex_cache = Cache(self.ex_model, max_num_tokens=max_tokens) + + # Create load parameters dictionary + load_params = {'progressbar': True} + if shared.args.gpu_split: + split = [float(alloc) for alloc in shared.args.gpu_split.split(",")] + load_params['use_per_device'] = split + + self.ex_model.load(**load_params) + self.past_seq = None + self.max_tokens = max_tokens + + def _validate_model_class(self): + pass + + def _validate_model_kwargs(self, model_kwargs: Dict[str, Any]): + pass + + def prepare_inputs_for_generation(self, input_ids, **kwargs): + return {'input_ids': input_ids, **kwargs} + + @property + def device(self) -> torch.device: + return torch.device(0) + + def __call__(self, *args, **kwargs): + use_cache = kwargs.get('use_cache', True) + labels = kwargs.get('labels', None) + past_key_values = kwargs.get('past_key_values', None) + + if len(args) > 0: + if not shared.args.cfg_cache: + logger.error("Please enable the cfg-cache option to use CFG with ExLlamav3_HF.") + return + + input_ids = args[0] + is_negative = True + past_seq = self.past_seq_negative + ex_cache = self.ex_cache_negative + else: + input_ids = kwargs['input_ids'] + is_negative = False + past_seq = self.past_seq + ex_cache = self.ex_cache + + seq = input_ids[0].tolist() + if is_negative and past_key_values is not None: + seq = past_key_values + seq + + seq_tensor = torch.tensor(seq) + reset = True + + # Make the forward call + if labels is None: + if past_seq is not None: + min_length = min(past_seq.shape[0], seq_tensor.shape[0]) + indices = torch.nonzero(~torch.eq(past_seq[:min_length], seq_tensor[:min_length])) + if len(indices) > 0: + longest_prefix = indices[0].item() + else: + longest_prefix = min_length + + if longest_prefix > 0: + reset = False + current_len = longest_prefix + if len(seq_tensor) - longest_prefix > 1: + self.ex_model.forward( + input_ids=seq_tensor[longest_prefix:-1].view(1, -1), + params={ + "attn_mode": "flash_attn", + "cache": ex_cache, + "past_len": longest_prefix, + "batch_shape": (1, self.max_tokens) + } + ) + + current_len = longest_prefix + len(seq_tensor) - longest_prefix - 1 + + if reset: + if len(seq_tensor) > 1: + self.ex_model.forward( + input_ids=seq_tensor[:-1].view(1, -1), + params={ + "attn_mode": "flash_attn", + "cache": ex_cache, + "past_len": 0, + "batch_shape": (1, self.max_tokens) + } + ) + + current_len = len(seq_tensor) - 1 + else: + current_len = 0 + + logits = self.ex_model.forward( + input_ids=seq_tensor[-1:].view(1, -1), + params={ + "attn_mode": "flash_attn", + "cache": ex_cache, + "past_len": current_len, + "batch_shape": (1, self.max_tokens) + } + ).to(input_ids.device).float() + else: + logits = self.ex_model.forward( + input_ids=seq_tensor.view(1, -1), + params={ + "attn_mode": "flash_attn", + "cache": ex_cache, + "past_len": 0, + "batch_shape": (1, self.max_tokens) + } + ).float() + + if is_negative: + self.past_seq_negative = seq_tensor + else: + self.past_seq = seq_tensor + + loss = None + if labels is not None: + # Shift so that tokens < n predict n + shift_logits = logits[..., :-1, :].contiguous() + shift_labels = labels[..., 1:].contiguous() + # Flatten the tokens + loss_fct = CrossEntropyLoss() + shift_logits = shift_logits.view(-1, logits.shape[-1]) + shift_labels = shift_labels.view(-1) + # Enable model parallelism + shift_labels = shift_labels.to(shift_logits.device) + loss = loss_fct(shift_logits, shift_labels) + + return CausalLMOutputWithPast(logits=logits, past_key_values=seq if use_cache else None, loss=loss) + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.PathLike]], *model_args, **kwargs): + assert len(model_args) == 0 and len(kwargs) == 0, "extra args is currently not supported" + if isinstance(pretrained_model_name_or_path, str): + pretrained_model_name_or_path = Path(pretrained_model_name_or_path) + + pretrained_model_name_or_path = Path(f'{shared.args.model_dir}') / Path(pretrained_model_name_or_path) + + return Exllamav3HF(pretrained_model_name_or_path) diff --git a/modules/loaders.py b/modules/loaders.py index 88ded1d1..980a13e6 100644 --- a/modules/loaders.py +++ b/modules/loaders.py @@ -23,7 +23,6 @@ loaders_and_params = OrderedDict({ 'use_double_quant', 'use_eager_attention', 'bf16', - 'trust_remote_code', 'no_use_fast', ], @@ -76,6 +75,13 @@ loaders_and_params = OrderedDict({ 'no_use_fast', 'llamacpp_HF_info', ], + 'ExLlamav3_HF': [ + 'max_seq_len', + 'gpu_split', + 'cfg_cache', + 'trust_remote_code', + 'no_use_fast', + ], 'ExLlamav2_HF': [ 'max_seq_len', 'cache_type', @@ -174,30 +180,38 @@ def transformers_samplers(): loaders_samplers = { 'Transformers': transformers_samplers(), 'HQQ': transformers_samplers(), - 'ExLlamav2': { + 'ExLlamav3_HF': { 'temperature', 'dynatemp_low', 'dynatemp_high', 'dynatemp_exponent', 'smoothing_factor', + 'smoothing_curve', 'min_p', 'top_p', 'top_k', 'typical_p', 'xtc_threshold', 'xtc_probability', + 'epsilon_cutoff', + 'eta_cutoff', 'tfs', 'top_a', + 'top_n_sigma', 'dry_multiplier', 'dry_allowed_length', 'dry_base', 'repetition_penalty', 'frequency_penalty', 'presence_penalty', + 'encoder_repetition_penalty', + 'no_repeat_ngram_size', 'repetition_penalty_range', + 'guidance_scale', 'mirostat_mode', 'mirostat_tau', 'mirostat_eta', + 'do_sample', 'dynamic_temperature', 'temperature_last', 'auto_max_new_tokens', @@ -205,8 +219,12 @@ loaders_samplers = { 'add_bos_token', 'skip_special_tokens', 'seed', + 'sampler_priority', 'custom_token_bans', + 'negative_prompt', 'dry_sequence_breakers', + 'grammar_string', + 'grammar_file_row', }, 'ExLlamav2_HF': { 'temperature', @@ -254,6 +272,40 @@ loaders_samplers = { 'grammar_string', 'grammar_file_row', }, + 'ExLlamav2': { + 'temperature', + 'dynatemp_low', + 'dynatemp_high', + 'dynatemp_exponent', + 'smoothing_factor', + 'min_p', + 'top_p', + 'top_k', + 'typical_p', + 'xtc_threshold', + 'xtc_probability', + 'tfs', + 'top_a', + 'dry_multiplier', + 'dry_allowed_length', + 'dry_base', + 'repetition_penalty', + 'frequency_penalty', + 'presence_penalty', + 'repetition_penalty_range', + 'mirostat_mode', + 'mirostat_tau', + 'mirostat_eta', + 'dynamic_temperature', + 'temperature_last', + 'auto_max_new_tokens', + 'ban_eos_token', + 'add_bos_token', + 'skip_special_tokens', + 'seed', + 'custom_token_bans', + 'dry_sequence_breakers', + }, 'llama.cpp': { 'temperature', 'min_p', diff --git a/modules/models.py b/modules/models.py index 3951fe82..288bc1b6 100644 --- a/modules/models.py +++ b/modules/models.py @@ -69,8 +69,9 @@ def load_model(model_name, loader=None): 'Transformers': huggingface_loader, 'llama.cpp': llamacpp_loader, 'llamacpp_HF': llamacpp_HF_loader, - 'ExLlamav2': ExLlamav2_loader, + 'ExLlamav3_HF': ExLlamav3_HF_loader, 'ExLlamav2_HF': ExLlamav2_HF_loader, + 'ExLlamav2': ExLlamav2_loader, 'HQQ': HQQ_loader, 'TensorRT-LLM': TensorRT_LLM_loader, } @@ -304,11 +305,10 @@ def llamacpp_HF_loader(model_name): return model -def ExLlamav2_loader(model_name): - from modules.exllamav2 import Exllamav2Model +def ExLlamav3_HF_loader(model_name): + from modules.exllamav3_hf import Exllamav3HF - model, tokenizer = Exllamav2Model.from_pretrained(model_name) - return model, tokenizer + return Exllamav3HF.from_pretrained(model_name) def ExLlamav2_HF_loader(model_name): @@ -317,6 +317,13 @@ def ExLlamav2_HF_loader(model_name): return Exllamav2HF.from_pretrained(model_name) +def ExLlamav2_loader(model_name): + from modules.exllamav2 import Exllamav2Model + + model, tokenizer = Exllamav2Model.from_pretrained(model_name) + return model, tokenizer + + def HQQ_loader(model_name): try: from hqq.core.quantize import HQQBackend, HQQLinear diff --git a/modules/models_settings.py b/modules/models_settings.py index 8d658523..51994e23 100644 --- a/modules/models_settings.py +++ b/modules/models_settings.py @@ -17,6 +17,7 @@ def get_fallback_settings(): 'compress_pos_emb': 1, 'alpha_value': 1, 'truncation_length': shared.settings['truncation_length'], + 'truncation_length_info': shared.settings['truncation_length'], 'skip_special_tokens': shared.settings['skip_special_tokens'], 'custom_stopping_strings': shared.settings['custom_stopping_strings'], } @@ -53,7 +54,8 @@ def get_model_metadata(model): for k in metadata: if k.endswith('context_length'): - model_settings['n_ctx'] = metadata[k] + model_settings['n_ctx'] = min(metadata[k], 8192) + model_settings['truncation_length_info'] = metadata[k] elif k.endswith('rope.freq_base'): model_settings['rope_freq_base'] = metadata[k] elif k.endswith('rope.scale_linear'): @@ -89,7 +91,8 @@ def get_model_metadata(model): for k in ['max_position_embeddings', 'model_max_length', 'max_seq_len']: if k in metadata: model_settings['truncation_length'] = metadata[k] - model_settings['max_seq_len'] = metadata[k] + model_settings['truncation_length_info'] = metadata[k] + model_settings['max_seq_len'] = min(metadata[k], 8192) if 'rope_theta' in metadata: model_settings['rope_freq_base'] = metadata['rope_theta'] @@ -155,14 +158,14 @@ def infer_loader(model_name, model_settings): path_to_model = Path(f'{shared.args.model_dir}/{model_name}') if not path_to_model.exists(): loader = None - elif (path_to_model / 'quantize_config.json').exists(): # Old GPTQ metadata file - loader = 'ExLlamav2_HF' elif len(list(path_to_model.glob('*.gguf'))) > 0 and path_to_model.is_dir() and (path_to_model / 'tokenizer_config.json').exists(): loader = 'llamacpp_HF' elif len(list(path_to_model.glob('*.gguf'))) > 0: loader = 'llama.cpp' elif re.match(r'.*\.gguf', model_name.lower()): loader = 'llama.cpp' + elif re.match(r'.*exl3', model_name.lower()): + loader = 'ExLlamav3_HF' elif re.match(r'.*exl2', model_name.lower()): loader = 'ExLlamav2_HF' elif re.match(r'.*-hqq', model_name.lower()): diff --git a/modules/shared.py b/modules/shared.py index 2e91f4d5..0981f6fb 100644 --- a/modules/shared.py +++ b/modules/shared.py @@ -53,7 +53,7 @@ settings = { 'skip_special_tokens': True, 'stream': True, 'static_cache': False, - 'truncation_length': 2048, + 'truncation_length': 8192, 'seed': -1, 'custom_stopping_strings': '', 'custom_token_bans': '', @@ -79,7 +79,6 @@ group.add_argument('--model', type=str, help='Name of the model to load by defau group.add_argument('--lora', type=str, nargs='+', help='The list of LoRAs to load. If you want to load more than one LoRA, write the names separated by spaces.') group.add_argument('--model-dir', type=str, default='models/', help='Path to directory with all the models.') group.add_argument('--lora-dir', type=str, default='loras/', help='Path to directory with all the loras.') -group.add_argument('--model-menu', action='store_true', help='Show a model menu in the terminal when the web UI is first launched.') group.add_argument('--settings', type=str, help='Load the default interface settings from this yaml file. See settings-template.yaml for an example. If you create a file called settings.yaml, this file will be loaded by default without the need to use the --settings flag.') group.add_argument('--extensions', type=str, nargs='+', help='The list of extensions to load. If you want to load more than one extension, write the names separated by spaces.') group.add_argument('--verbose', action='store_true', help='Print the prompts to the terminal.') @@ -87,7 +86,7 @@ group.add_argument('--idle-timeout', type=int, default=0, help='Unload model aft # Model loader group = parser.add_argument_group('Model loader') -group.add_argument('--loader', type=str, help='Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav2_HF, ExLlamav2, HQQ, TensorRT-LLM.') +group.add_argument('--loader', type=str, help='Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav3_HF, ExLlamav2_HF, ExLlamav2, HQQ, TensorRT-LLM.') # Transformers/Accelerate group = parser.add_argument_group('Transformers/Accelerate') @@ -118,7 +117,7 @@ group.add_argument('--quant_type', type=str, default='nf4', help='quant_type for group = parser.add_argument_group('llama.cpp') group.add_argument('--flash-attn', action='store_true', help='Use flash-attention.') group.add_argument('--tensorcores', action='store_true', help='NVIDIA only: use llama-cpp-python compiled without GGML_CUDA_FORCE_MMQ. This may improve performance on newer cards.') -group.add_argument('--n_ctx', type=int, default=2048, help='Size of the prompt context.') +group.add_argument('--n_ctx', type=int, default=8192, help='Size of the prompt context.') group.add_argument('--threads', type=int, default=0, help='Number of threads to use.') group.add_argument('--threads-batch', type=int, default=0, help='Number of threads to use for batches/prompt processing.') group.add_argument('--no_mul_mat_q', action='store_true', help='Disable the mulmat kernels.') @@ -140,7 +139,7 @@ group.add_argument('--tokenizer-dir', type=str, help='Load the tokenizer from th group = parser.add_argument_group('ExLlamaV2') group.add_argument('--gpu-split', type=str, help='Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.') group.add_argument('--autosplit', action='store_true', help='Autosplit the model tensors across the available GPUs. This causes --gpu-split to be ignored.') -group.add_argument('--max_seq_len', type=int, default=2048, help='Maximum sequence length.') +group.add_argument('--max_seq_len', type=int, default=8192, help='Maximum sequence length.') group.add_argument('--cfg-cache', action='store_true', help='ExLlamav2_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader.') group.add_argument('--no_flash_attn', action='store_true', help='Force flash-attention to not be used.') group.add_argument('--no_xformers', action='store_true', help='Force xformers to not be used.') @@ -215,6 +214,7 @@ group.add_argument('--disable_exllama', action='store_true', help='DEPRECATED') group.add_argument('--disable_exllamav2', action='store_true', help='DEPRECATED') group.add_argument('--wbits', type=int, default=0, help='DEPRECATED') group.add_argument('--groupsize', type=int, default=-1, help='DEPRECATED') +group.add_argument('--model-menu', action='store_true', help='DEPRECATED') args = parser.parse_args() args_defaults = parser.parse_args([]) @@ -273,6 +273,8 @@ def fix_loader_name(name): return 'ExLlamav2' elif name in ['exllamav2-hf', 'exllamav2_hf', 'exllama-v2-hf', 'exllama_v2_hf', 'exllama-v2_hf', 'exllama2-hf', 'exllama2_hf', 'exllama-2-hf', 'exllama_2_hf', 'exllama-2_hf']: return 'ExLlamav2_HF' + elif name in ['exllamav3-hf', 'exllamav3_hf', 'exllama-v3-hf', 'exllama_v3_hf', 'exllama-v3_hf', 'exllama3-hf', 'exllama3_hf', 'exllama-3-hf', 'exllama_3_hf', 'exllama-3_hf']: + return 'ExLlamav3_HF' elif name in ['hqq']: return 'HQQ' elif name in ['tensorrt', 'tensorrtllm', 'tensorrt_llm', 'tensorrt-llm', 'tensort', 'tensortllm']: diff --git a/modules/ui_model_menu.py b/modules/ui_model_menu.py index 1264a9fd..4fc1de08 100644 --- a/modules/ui_model_menu.py +++ b/modules/ui_model_menu.py @@ -200,8 +200,10 @@ def create_event_handlers(): def load_model_wrapper(selected_model, loader, autoload=False): + settings = get_model_metadata(selected_model) + if not autoload: - yield f"The settings for `{selected_model}` have been updated.\n\nClick on \"Load\" to load it." + yield "### {}\n\n- Settings updated: Click \"Load\" to load the model\n- Max sequence length: {}".format(selected_model, settings['truncation_length_info']) return if selected_model == 'None': @@ -214,13 +216,7 @@ def load_model_wrapper(selected_model, loader, autoload=False): shared.model, shared.tokenizer = load_model(selected_model, loader) if shared.model is not None: - output = f"Successfully loaded `{selected_model}`." - - settings = get_model_metadata(selected_model) - if 'instruction_template' in settings: - output += '\n\nIt seems to be an instruction-following model with template "{}". In the chat tab, instruct or chat-instruct modes should be used.'.format(settings['instruction_template']) - - yield output + yield f"Successfully loaded `{selected_model}`." else: yield f"Failed to load `{selected_model}`." except: diff --git a/modules/ui_parameters.py b/modules/ui_parameters.py index 846fcfe7..c3245a9d 100644 --- a/modules/ui_parameters.py +++ b/modules/ui_parameters.py @@ -87,7 +87,7 @@ def create_ui(default_preset): shared.gradio['static_cache'] = gr.Checkbox(value=shared.settings['static_cache'], label='Static KV cache', info='Use a static cache for improved performance.') with gr.Column(): - shared.gradio['truncation_length'] = gr.Number(precision=0, step=256, value=get_truncation_length(), label='Truncate the prompt up to this length', info='The leftmost tokens are removed if the prompt exceeds this length. Most models require this to be at most 2048.') + shared.gradio['truncation_length'] = gr.Number(precision=0, step=256, value=get_truncation_length(), label='Truncate the prompt up to this length', info='The leftmost tokens are removed if the prompt exceeds this length.') shared.gradio['seed'] = gr.Number(value=shared.settings['seed'], label='Seed (-1 for random)') shared.gradio['sampler_priority'] = gr.Textbox(value=generate_params['sampler_priority'], lines=12, label='Sampler priority', info='Parameter names separated by new lines or commas.', elem_classes=['add_scrollbar']) diff --git a/one_click.py b/one_click.py index effc7d43..9f46a2df 100644 --- a/one_click.py +++ b/one_click.py @@ -16,10 +16,12 @@ import sys # os.environ["HCC_AMDGPU_TARGET"] = 'gfx1030' -# Define the required PyTorch version -TORCH_VERSION = "2.4.1" -TORCHVISION_VERSION = "0.19.1" -TORCHAUDIO_VERSION = "2.4.1" +# Define the required versions +TORCH_VERSION = "2.6.0" +TORCHVISION_VERSION = "0.21.0" +TORCHAUDIO_VERSION = "2.6.0" +PYTHON_VERSION = "3.11" +LIBSTDCXX_VERSION_LINUX = "12.1.0" # Environment script_dir = os.getcwd() @@ -101,15 +103,20 @@ def torch_version(): return torver -def update_pytorch(): +def update_pytorch_and_python(): print_big_message("Checking for PyTorch updates.") + + # Update the Python version. Left here for future reference in case this becomes necessary. + # print_big_message("Checking for PyTorch and Python updates.") + # current_python_version = f"{sys.version_info.major}.{sys.version_info.minor}" + # if current_python_version != PYTHON_VERSION: + # run_cmd(f"conda install -y python={PYTHON_VERSION}", assert_success=True, environment=True) + torver = torch_version() base_cmd = f"python -m pip install --upgrade torch=={TORCH_VERSION} torchvision=={TORCHVISION_VERSION} torchaudio=={TORCHAUDIO_VERSION}" - if "+cu118" in torver: - install_cmd = f"{base_cmd} --index-url https://download.pytorch.org/whl/cu118" - elif "+cu" in torver: - install_cmd = f"{base_cmd} --index-url https://download.pytorch.org/whl/cu121" + if "+cu" in torver: + install_cmd = f"{base_cmd} --index-url https://download.pytorch.org/whl/cu124" elif "+rocm" in torver: install_cmd = f"{base_cmd} --index-url https://download.pytorch.org/whl/rocm6.1" elif "+cpu" in torver: @@ -236,24 +243,21 @@ def install_webui(): choice = os.environ["GPU_CHOICE"].upper() print_big_message(f"Selected GPU choice \"{choice}\" based on the GPU_CHOICE environment variable.") - # Warn about changed meanings and handle old NVIDIA choice + # Warn about changed meanings and handle old choices if choice == "B": - print_big_message("Warning: GPU_CHOICE='B' now means 'NVIDIA (CUDA 11.8)' in the new version.") + print_big_message("Warning: GPU_CHOICE='B' now means 'AMD' in the new version.") elif choice == "C": - print_big_message("Warning: GPU_CHOICE='C' now means 'AMD' in the new version.") + print_big_message("Warning: GPU_CHOICE='C' now means 'Apple M Series' in the new version.") elif choice == "D": - print_big_message("Warning: GPU_CHOICE='D' now means 'Apple M Series' in the new version.") - elif choice == "A" and "USE_CUDA118" in os.environ: - choice = "B" if os.environ.get("USE_CUDA118", "").lower() in ("yes", "y", "true", "1", "t", "on") else "A" + print_big_message("Warning: GPU_CHOICE='D' now means 'Intel Arc' in the new version.") else: choice = get_user_choice( "What is your GPU?", { - 'A': 'NVIDIA - CUDA 12.1 (recommended)', - 'B': 'NVIDIA - CUDA 11.8 (legacy GPUs)', - 'C': 'AMD - Linux/macOS only, requires ROCm 6.1', - 'D': 'Apple M Series', - 'E': 'Intel Arc (beta)', + 'A': 'NVIDIA - CUDA 12.4', + 'B': 'AMD - Linux/macOS only, requires ROCm 6.1', + 'C': 'Apple M Series', + 'D': 'Intel Arc (beta)', 'N': 'CPU mode' }, ) @@ -261,15 +265,13 @@ def install_webui(): # Convert choices to GPU names for compatibility gpu_choice_to_name = { "A": "NVIDIA", - "B": "NVIDIA", - "C": "AMD", - "D": "APPLE", - "E": "INTEL", + "B": "AMD", + "C": "APPLE", + "D": "INTEL", "N": "NONE" } selected_gpu = gpu_choice_to_name[choice] - use_cuda118 = (choice == "B") # CUDA version is now determined by menu choice # Write a flag to CMD_FLAGS.txt for CPU mode if selected_gpu == "NONE": @@ -280,10 +282,7 @@ def install_webui(): # Handle CUDA version display elif any((is_windows(), is_linux())) and selected_gpu == "NVIDIA": - if use_cuda118: - print("CUDA: 11.8") - else: - print("CUDA: 12.1") + print("CUDA: 12.4") # No PyTorch for AMD on Windows (?) elif is_windows() and selected_gpu == "AMD": @@ -294,10 +293,7 @@ def install_webui(): install_pytorch = f"python -m pip install torch=={TORCH_VERSION} torchvision=={TORCHVISION_VERSION} torchaudio=={TORCHAUDIO_VERSION} " if selected_gpu == "NVIDIA": - if use_cuda118 == 'Y': - install_pytorch += "--index-url https://download.pytorch.org/whl/cu118" - else: - install_pytorch += "--index-url https://download.pytorch.org/whl/cu121" + install_pytorch += "--index-url https://download.pytorch.org/whl/cu124" elif selected_gpu == "AMD": install_pytorch += "--index-url https://download.pytorch.org/whl/rocm6.1" elif selected_gpu in ["APPLE", "NONE"]: @@ -310,14 +306,14 @@ def install_webui(): # Install Git and then Pytorch print_big_message("Installing PyTorch.") - run_cmd(f"conda install -y -k ninja git && {install_pytorch} && python -m pip install py-cpuinfo==9.0.0", assert_success=True, environment=True) + run_cmd(f"conda install -y ninja git && {install_pytorch} && python -m pip install py-cpuinfo==9.0.0", assert_success=True, environment=True) if selected_gpu == "INTEL": # Install oneAPI dependencies via conda print_big_message("Installing Intel oneAPI runtime libraries.") - run_cmd("conda install -y -c https://software.repos.intel.com/python/conda/ -c conda-forge dpcpp-cpp-rt=2024.0 mkl-dpcpp=2024.0") + run_cmd("conda install -y -c https://software.repos.intel.com/python/conda/ -c conda-forge dpcpp-cpp-rt=2024.0 mkl-dpcpp=2024.0", environment=True) # Install libuv required by Intel-patched torch - run_cmd("conda install -y libuv") + run_cmd("conda install -y libuv", environment=True) # Install the webui requirements update_requirements(initial_installation=True, pull=False) @@ -336,6 +332,24 @@ def install_extensions_requirements(): run_cmd(f"python -m pip install -r {extension_req_path} --upgrade", assert_success=False, environment=True) +def clean_outdated_pytorch_cuda_dependencies(): + patterns = ["cu121", "cu122", "torch2.4"] + result = run_cmd("python -m pip list --format=freeze", capture_output=True, environment=True) + matching_packages = [] + + for line in result.stdout.decode('utf-8').splitlines(): + if "==" in line: + pkg_name, version = line.split('==', 1) + if any(pattern in version for pattern in patterns): + matching_packages.append(pkg_name) + + if matching_packages: + print(f"\nUninstalling: {', '.join(matching_packages)}\n") + run_cmd(f"python -m pip uninstall -y {' '.join(matching_packages)}", assert_success=True, environment=True) + + return matching_packages + + def update_requirements(initial_installation=False, pull=True): # Create .git directory if missing if not os.path.exists(os.path.join(script_dir, ".git")): @@ -421,9 +435,14 @@ def update_requirements(initial_installation=False, pull=True): if os.environ.get("INSTALL_EXTENSIONS", "").lower() in ("yes", "y", "true", "1", "t", "on"): install_extensions_requirements() + if is_linux(): + run_cmd(f"conda install -y -c conda-forge libstdcxx-ng=={LIBSTDCXX_VERSION_LINUX}", assert_success=True, environment=True) + # Update PyTorch if not initial_installation: - update_pytorch() + update_pytorch_and_python() + torver = torch_version() + clean_outdated_pytorch_cuda_dependencies() print_big_message(f"Installing webui requirements from file: {requirements_file}") print(f"TORCH: {torver}\n") @@ -434,16 +453,6 @@ def update_requirements(initial_installation=False, pull=True): if not initial_installation and not wheels_changed: textgen_requirements = [line for line in textgen_requirements if '.whl' not in line] - if "+cu118" in torver: - textgen_requirements = [ - req.replace('+cu121', '+cu118').replace('+cu122', '+cu118') - for req in textgen_requirements - if "autoawq" not in req.lower() - ] - - if is_windows() and "+cu118" in torver: # No flash-attention on Windows for CUDA 11 - textgen_requirements = [req for req in textgen_requirements if 'oobabooga/flash-attention' not in req] - with open('temp_requirements.txt', 'w') as file: file.write('\n'.join(textgen_requirements)) diff --git a/requirements.txt b/requirements.txt index 83bd3a53..b9b4ea7a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,4 @@ -accelerate==1.4.* +accelerate==1.5.* bitsandbytes==0.45.* colorama datasets @@ -10,7 +10,7 @@ markdown numba==0.59.* numpy==1.26.* pandas -peft==0.12.* +peft==0.15.* Pillow>=9.5.0 psutil pydantic==2.8.2 @@ -21,7 +21,7 @@ safetensors==0.5.* scipy sentencepiece tensorboard -transformers==4.49.* +transformers==4.50.* tqdm wandb @@ -33,29 +33,21 @@ tiktoken # llama-cpp-python (CPU only, AVX2) https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" # llama-cpp-python (CUDA, with GGML_CUDA_FORCE_MMQ) -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" +https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu124-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" +https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu124-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" # llama-cpp-python (CUDA, without GGML_CUDA_FORCE_MMQ) -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" +https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu124-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" +https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu124-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" # CUDA wheels -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" +https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" +https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" +https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" +https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64" -https://github.com/oobabooga/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu122torch2.4.1cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu122torch2.4.1cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" -https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" +https://github.com/oobabooga/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu124torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" +https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" diff --git a/requirements_amd.txt b/requirements_amd.txt index 1e757ffe..3d24891f 100644 --- a/requirements_amd.txt +++ b/requirements_amd.txt @@ -1,4 +1,4 @@ -accelerate==1.4.* +accelerate==1.5.* colorama datasets einops @@ -9,7 +9,7 @@ markdown numba==0.59.* numpy==1.26.* pandas -peft==0.12.* +peft==0.15.* Pillow>=9.5.0 psutil pydantic==2.8.2 @@ -20,7 +20,7 @@ safetensors==0.5.* scipy sentencepiece tensorboard -transformers==4.49.* +transformers==4.50.* tqdm wandb @@ -32,13 +32,9 @@ tiktoken # llama-cpp-python (CPU only, AVX2) https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" # AMD wheels https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.3.8+rocm6.1.2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.3.8+rocm6.1.2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" +https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64" diff --git a/requirements_amd_noavx2.txt b/requirements_amd_noavx2.txt index f74ebf69..057b631d 100644 --- a/requirements_amd_noavx2.txt +++ b/requirements_amd_noavx2.txt @@ -1,4 +1,4 @@ -accelerate==1.4.* +accelerate==1.5.* colorama datasets einops @@ -9,7 +9,7 @@ markdown numba==0.59.* numpy==1.26.* pandas -peft==0.12.* +peft==0.15.* Pillow>=9.5.0 psutil pydantic==2.8.2 @@ -20,7 +20,7 @@ safetensors==0.5.* scipy sentencepiece tensorboard -transformers==4.49.* +transformers==4.50.* tqdm wandb @@ -32,11 +32,8 @@ tiktoken # llama-cpp-python (CPU only, no AVX2) https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" # AMD wheels -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" +https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+rocm6.1.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64" diff --git a/requirements_apple_intel.txt b/requirements_apple_intel.txt index dcdeae3f..eba21ec2 100644 --- a/requirements_apple_intel.txt +++ b/requirements_apple_intel.txt @@ -1,4 +1,4 @@ -accelerate==1.4.* +accelerate==1.5.* colorama datasets einops @@ -9,7 +9,7 @@ markdown numba==0.59.* numpy==1.26.* pandas -peft==0.12.* +peft==0.15.* Pillow>=9.5.0 psutil pydantic==2.8.2 @@ -20,7 +20,7 @@ safetensors==0.5.* scipy sentencepiece tensorboard -transformers==4.49.* +transformers==4.50.* tqdm wandb @@ -32,7 +32,6 @@ tiktoken # Mac wheels https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp311-cp311-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp310-cp310-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.10" https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp311-cp311-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp310-cp310-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10" +https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1-py3-none-any.whl https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl diff --git a/requirements_apple_silicon.txt b/requirements_apple_silicon.txt index b823e40e..2048c99b 100644 --- a/requirements_apple_silicon.txt +++ b/requirements_apple_silicon.txt @@ -1,4 +1,4 @@ -accelerate==1.4.* +accelerate==1.5.* colorama datasets einops @@ -9,7 +9,7 @@ markdown numba==0.59.* numpy==1.26.* pandas -peft==0.12.* +peft==0.15.* Pillow>=9.5.0 psutil pydantic==2.8.2 @@ -20,7 +20,7 @@ safetensors==0.5.* scipy sentencepiece tensorboard -transformers==4.49.* +transformers==4.50.* tqdm wandb @@ -32,9 +32,7 @@ tiktoken # Mac wheels https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp311-cp311-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp310-cp310-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.10" https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp311-cp311-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp310-cp310-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10" https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp311-cp311-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.8-cp310-cp310-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.10" +https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1-py3-none-any.whl https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl diff --git a/requirements_cpu_only.txt b/requirements_cpu_only.txt index fe3f522a..c7e2687c 100644 --- a/requirements_cpu_only.txt +++ b/requirements_cpu_only.txt @@ -1,4 +1,4 @@ -accelerate==1.4.* +accelerate==1.5.* colorama datasets einops @@ -9,7 +9,7 @@ markdown numba==0.59.* numpy==1.26.* pandas -peft==0.12.* +peft==0.15.* Pillow>=9.5.0 psutil pydantic==2.8.2 @@ -20,7 +20,7 @@ safetensors==0.5.* scipy sentencepiece tensorboard -transformers==4.49.* +transformers==4.50.* tqdm wandb @@ -32,6 +32,4 @@ tiktoken # llama-cpp-python (CPU only, AVX2) https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" diff --git a/requirements_cpu_only_noavx2.txt b/requirements_cpu_only_noavx2.txt index 014e2e5d..2003c544 100644 --- a/requirements_cpu_only_noavx2.txt +++ b/requirements_cpu_only_noavx2.txt @@ -1,4 +1,4 @@ -accelerate==1.4.* +accelerate==1.5.* colorama datasets einops @@ -9,7 +9,7 @@ markdown numba==0.59.* numpy==1.26.* pandas -peft==0.12.* +peft==0.15.* Pillow>=9.5.0 psutil pydantic==2.8.2 @@ -20,7 +20,7 @@ safetensors==0.5.* scipy sentencepiece tensorboard -transformers==4.49.* +transformers==4.50.* tqdm wandb @@ -32,6 +32,4 @@ tiktoken # llama-cpp-python (CPU only, no AVX2) https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" diff --git a/requirements_noavx2.txt b/requirements_noavx2.txt index 6139c46e..60b71ac1 100644 --- a/requirements_noavx2.txt +++ b/requirements_noavx2.txt @@ -1,4 +1,4 @@ -accelerate==1.4.* +accelerate==1.5.* bitsandbytes==0.45.* colorama datasets @@ -10,7 +10,7 @@ markdown numba==0.59.* numpy==1.26.* pandas -peft==0.12.* +peft==0.15.* Pillow>=9.5.0 psutil pydantic==2.8.2 @@ -21,7 +21,7 @@ safetensors==0.5.* scipy sentencepiece tensorboard -transformers==4.49.* +transformers==4.50.* tqdm wandb @@ -33,29 +33,21 @@ tiktoken # llama-cpp-python (CPU only, no AVX2) https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.8+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" # llama-cpp-python (CUDA, with GGML_CUDA_FORCE_MMQ) -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121avx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu121avx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" +https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu124avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" +https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.8+cu124avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" # llama-cpp-python (CUDA, without GGML_CUDA_FORCE_MMQ) -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121avx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu121avx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" +https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu124avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" +https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.8+cu124avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" # CUDA wheels -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu121.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" +https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" +https://github.com/oobabooga/exllamav3/releases/download/v0.0.1/exllamav3-0.0.1+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" +https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" +https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/oobabooga/exllamav2/releases/download/v0.2.8/exllamav2-0.2.8-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64" -https://github.com/oobabooga/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu122torch2.4.1cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" -https://github.com/oobabooga/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu122torch2.4.1cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10" -https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" -https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10" +https://github.com/oobabooga/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu124torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" +https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" diff --git a/requirements_nowheels.txt b/requirements_nowheels.txt index 858ffff5..3b61ca39 100644 --- a/requirements_nowheels.txt +++ b/requirements_nowheels.txt @@ -1,4 +1,4 @@ -accelerate==1.4.* +accelerate==1.5.* colorama datasets einops @@ -9,7 +9,7 @@ markdown numba==0.59.* numpy==1.26.* pandas -peft==0.12.* +peft==0.15.* Pillow>=9.5.0 psutil pydantic==2.8.2 @@ -20,7 +20,7 @@ safetensors==0.5.* scipy sentencepiece tensorboard -transformers==4.49.* +transformers==4.50.* tqdm wandb diff --git a/server.py b/server.py index 31e1c4c6..1f227350 100644 --- a/server.py +++ b/server.py @@ -218,28 +218,10 @@ if __name__ == "__main__": if extension not in shared.args.extensions: shared.args.extensions.append(extension) - available_models = utils.get_available_models() - # Model defined through --model if shared.args.model is not None: shared.model_name = shared.args.model - # Select the model from a command-line menu - elif shared.args.model_menu: - if len(available_models) == 0: - logger.error('No models are available! Please download at least one.') - sys.exit(0) - else: - print('The following models are available:\n') - for i, model in enumerate(available_models): - print(f'{i+1}. {model}') - - print(f'\nWhich one do you want to load? 1-{len(available_models)}\n') - i = int(input()) - 1 - print() - - shared.model_name = available_models[i] - # If any model has been selected, load it if shared.model_name != 'None': p = Path(shared.model_name) diff --git a/settings-template.yaml b/settings-template.yaml index 74935a60..0343df0a 100644 --- a/settings-template.yaml +++ b/settings-template.yaml @@ -25,7 +25,7 @@ add_bos_token: true skip_special_tokens: true stream: true static_cache: false -truncation_length: 2048 +truncation_length: 8192 seed: -1 custom_stopping_strings: '' custom_token_bans: ''