Merge pull request #7205 from oobabooga/dev

Merge dev branch
This commit is contained in:
oobabooga 2025-08-19 11:51:06 -03:00 committed by GitHub
commit cb00db15c9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
33 changed files with 217 additions and 246 deletions

118
README.md
View file

@ -2,8 +2,6 @@
A Gradio web UI for Large Language Models. A Gradio web UI for Large Language Models.
Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) of text generation.
[Try the Deep Reason extension](https://oobabooga.gumroad.com/l/deep_reason) [Try the Deep Reason extension](https://oobabooga.gumroad.com/l/deep_reason)
|![Image1](https://github.com/oobabooga/screenshots/raw/main/INSTRUCT-3.5.png) | ![Image2](https://github.com/oobabooga/screenshots/raw/main/CHAT-3.5.png) | |![Image1](https://github.com/oobabooga/screenshots/raw/main/INSTRUCT-3.5.png) | ![Image2](https://github.com/oobabooga/screenshots/raw/main/CHAT-3.5.png) |
@ -32,54 +30,15 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
## How to install ## How to install
#### Option 1: Portable builds (get started in 1 minute) #### Option 1: Portable builds (get started in 1 minute)
No installation needed just download, unzip and run. All dependencies included. No installation needed just download, unzip and run. All dependencies included.
Compatible with GGUF (llama.cpp) models on Windows, Linux, and macOS. Compatible with GGUF (llama.cpp) models on Windows, Linux, and macOS.
Download from here: https://github.com/oobabooga/text-generation-webui/releases Download from here: **https://github.com/oobabooga/text-generation-webui/releases**
#### Option 2: One-click installer #### Option 2: Manual portable install with venv
For users who need additional backends (ExLlamaV3, Transformers) or extensions (TTS, voice input, translation, etc). Requires ~10GB disk space and downloads PyTorch.
1. Clone the repository, or [download its source code](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip) and extract it.
2. Run the startup script for your OS: `start_windows.bat`, `start_linux.sh`, or `start_macos.sh`.
3. When prompted, select your GPU vendor.
4. After installation, open `http://127.0.0.1:7860` in your browser.
To restart the web UI later, run the same `start_` script.
To reinstall with a fresh Python environment, delete the `installer_files` folder and run the `start_` script again.
You can pass command-line flags directly (e.g., `./start_linux.sh --help`), or add them to `user_data/CMD_FLAGS.txt` (e.g., `--api` to enable the API).
To update, run the update script for your OS: `update_wizard_windows.bat`, `update_wizard_linux.sh`, or `update_wizard_macos.sh`.
<details>
<summary>
One-click installer details
</summary>
### One-click-installer
The script uses Miniforge to set up a Conda environment in the `installer_files` folder.
If you ever need to install something manually in the `installer_files` environment, you can launch an interactive shell using the cmd script: `cmd_linux.sh`, `cmd_windows.bat`, or `cmd_macos.sh`.
* There is no need to run any of those scripts (`start_`, `update_wizard_`, or `cmd_`) as admin/root.
* To install requirements for extensions, it is recommended to use the update wizard script with the "Install/update extensions requirements" option. At the end, this script will install the main requirements for the project to make sure that they take precedence in case of version conflicts.
* For automated installation, you can use the `GPU_CHOICE`, `LAUNCH_AFTER_INSTALL`, and `INSTALL_EXTENSIONS` environment variables. For instance: `GPU_CHOICE=A LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh`.
</details>
<details>
<summary>
Manual portable installation with venv
</summary>
### Manual portable installation with venv
Very fast setup that should work on any Python 3.9+: Very fast setup that should work on any Python 3.9+:
@ -98,7 +57,7 @@ venv\Scripts\activate
source venv/bin/activate source venv/bin/activate
# Install dependencies (choose appropriate file under requirements/portable for your hardware) # Install dependencies (choose appropriate file under requirements/portable for your hardware)
pip install -r requirements/portable/requirements.txt pip install -r requirements/portable/requirements.txt --upgrade
# Launch server (basic command) # Launch server (basic command)
python server.py --portable --api --auto-launch python server.py --portable --api --auto-launch
@ -106,6 +65,39 @@ python server.py --portable --api --auto-launch
# When done working, deactivate # When done working, deactivate
deactivate deactivate
``` ```
#### Option 3: One-click installer
For users who need additional backends (ExLlamaV3, Transformers) or extensions (TTS, voice input, translation, etc). Requires ~10GB disk space and downloads PyTorch.
1. Clone the repository, or [download its source code](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip) and extract it.
2. Run the startup script for your OS: `start_windows.bat`, `start_linux.sh`, or `start_macos.sh`.
3. When prompted, select your GPU vendor.
4. After installation, open `http://127.0.0.1:7860` in your browser.
To restart the web UI later, run the same `start_` script.
You can pass command-line flags directly (e.g., `./start_linux.sh --help`), or add them to `user_data/CMD_FLAGS.txt` (e.g., `--api` to enable the API).
To update, run the update script for your OS: `update_wizard_windows.bat`, `update_wizard_linux.sh`, or `update_wizard_macos.sh`.
To reinstall with a fresh Python environment, delete the `installer_files` folder and run the `start_` script again.
<details>
<summary>
One-click installer details
</summary>
### One-click-installer
The script uses Miniforge to set up a Conda environment in the `installer_files` folder.
If you ever need to install something manually in the `installer_files` environment, you can launch an interactive shell using the cmd script: `cmd_linux.sh`, `cmd_windows.bat`, or `cmd_macos.sh`.
* There is no need to run any of those scripts (`start_`, `update_wizard_`, or `cmd_`) as admin/root.
* To install requirements for extensions, it is recommended to use the update wizard script with the "Install/update extensions requirements" option. At the end, this script will install the main requirements for the project to make sure that they take precedence in case of version conflicts.
* For automated installation, you can use the `GPU_CHOICE`, `LAUNCH_AFTER_INSTALL`, and `INSTALL_EXTENSIONS` environment variables. For instance: `GPU_CHOICE=A LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh`.
</details> </details>
<details> <details>
@ -139,19 +131,19 @@ conda activate textgen
| System | GPU | Command | | System | GPU | Command |
|--------|---------|---------| |--------|---------|---------|
| Linux/WSL | NVIDIA | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124` | | Linux/WSL | NVIDIA | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128` |
| Linux/WSL | CPU only | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu` | | Linux/WSL | CPU only | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/cpu` |
| Linux | AMD | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/rocm6.2.4` | | Linux | AMD | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/rocm6.2.4` |
| MacOS + MPS | Any | `pip3 install torch==2.6.0` | | MacOS + MPS | Any | `pip3 install torch==2.7.1` |
| Windows | NVIDIA | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124` | | Windows | NVIDIA | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128` |
| Windows | CPU only | `pip3 install torch==2.6.0` | | Windows | CPU only | `pip3 install torch==2.7.1` |
The up-to-date commands can be found here: https://pytorch.org/get-started/locally/. The up-to-date commands can be found here: https://pytorch.org/get-started/locally/.
If you need `nvcc` to compile some library manually, you will additionally need to install this: If you need `nvcc` to compile some library manually, you will additionally need to install this:
``` ```
conda install -y -c "nvidia/label/cuda-12.4.1" cuda conda install -y -c "nvidia/label/cuda-12.8.1" cuda
``` ```
#### 3. Install the web UI #### 3. Install the web UI
@ -238,13 +230,13 @@ usage: server.py [-h] [--multi-user] [--model MODEL] [--lora LORA [LORA ...]] [-
[--extensions EXTENSIONS [EXTENSIONS ...]] [--verbose] [--idle-timeout IDLE_TIMEOUT] [--loader LOADER] [--cpu] [--cpu-memory CPU_MEMORY] [--disk] [--disk-cache-dir DISK_CACHE_DIR] [--extensions EXTENSIONS [EXTENSIONS ...]] [--verbose] [--idle-timeout IDLE_TIMEOUT] [--loader LOADER] [--cpu] [--cpu-memory CPU_MEMORY] [--disk] [--disk-cache-dir DISK_CACHE_DIR]
[--load-in-8bit] [--bf16] [--no-cache] [--trust-remote-code] [--force-safetensors] [--no_use_fast] [--attn-implementation IMPLEMENTATION] [--load-in-4bit] [--use_double_quant] [--load-in-8bit] [--bf16] [--no-cache] [--trust-remote-code] [--force-safetensors] [--no_use_fast] [--attn-implementation IMPLEMENTATION] [--load-in-4bit] [--use_double_quant]
[--compute_dtype COMPUTE_DTYPE] [--quant_type QUANT_TYPE] [--flash-attn] [--threads THREADS] [--threads-batch THREADS_BATCH] [--batch-size BATCH_SIZE] [--no-mmap] [--mlock] [--compute_dtype COMPUTE_DTYPE] [--quant_type QUANT_TYPE] [--flash-attn] [--threads THREADS] [--threads-batch THREADS_BATCH] [--batch-size BATCH_SIZE] [--no-mmap] [--mlock]
[--gpu-layers N] [--tensor-split TENSOR_SPLIT] [--numa] [--no-kv-offload] [--row-split] [--extra-flags EXTRA_FLAGS] [--streaming-llm] [--ctx-size N] [--cache-type N] [--gpu-layers N] [--tensor-split TENSOR_SPLIT] [--numa] [--no-kv-offload] [--row-split] [--extra-flags EXTRA_FLAGS] [--streaming-llm] [--mmproj MMPROJ] [--ctx-size N] [--cache-type N]
[--model-draft MODEL_DRAFT] [--draft-max DRAFT_MAX] [--gpu-layers-draft GPU_LAYERS_DRAFT] [--device-draft DEVICE_DRAFT] [--ctx-size-draft CTX_SIZE_DRAFT] [--gpu-split GPU_SPLIT] [--model-draft MODEL_DRAFT] [--draft-max DRAFT_MAX] [--gpu-layers-draft GPU_LAYERS_DRAFT] [--device-draft DEVICE_DRAFT] [--ctx-size-draft CTX_SIZE_DRAFT] [--enable-tp]
[--autosplit] [--cfg-cache] [--no_flash_attn] [--no_xformers] [--no_sdpa] [--num_experts_per_token N] [--enable_tp] [--cpp-runner] [--deepspeed] [--nvme-offload-dir NVME_OFFLOAD_DIR] [--tp-backend TP_BACKEND] [--gpu-split GPU_SPLIT] [--autosplit] [--cfg-cache] [--no_flash_attn] [--no_xformers] [--no_sdpa] [--num_experts_per_token N] [--cpp-runner] [--deepspeed]
[--local_rank LOCAL_RANK] [--alpha_value ALPHA_VALUE] [--rope_freq_base ROPE_FREQ_BASE] [--compress_pos_emb COMPRESS_POS_EMB] [--listen] [--listen-port LISTEN_PORT] [--nvme-offload-dir NVME_OFFLOAD_DIR] [--local_rank LOCAL_RANK] [--alpha_value ALPHA_VALUE] [--rope_freq_base ROPE_FREQ_BASE] [--compress_pos_emb COMPRESS_POS_EMB] [--listen]
[--listen-host LISTEN_HOST] [--share] [--auto-launch] [--gradio-auth GRADIO_AUTH] [--gradio-auth-path GRADIO_AUTH_PATH] [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE] [--listen-port LISTEN_PORT] [--listen-host LISTEN_HOST] [--share] [--auto-launch] [--gradio-auth GRADIO_AUTH] [--gradio-auth-path GRADIO_AUTH_PATH] [--ssl-keyfile SSL_KEYFILE]
[--subpath SUBPATH] [--old-colors] [--portable] [--api] [--public-api] [--public-api-id PUBLIC_API_ID] [--api-port API_PORT] [--api-key API_KEY] [--admin-key ADMIN_KEY] [--ssl-certfile SSL_CERTFILE] [--subpath SUBPATH] [--old-colors] [--portable] [--api] [--public-api] [--public-api-id PUBLIC_API_ID] [--api-port API_PORT] [--api-key API_KEY]
[--api-enable-ipv6] [--api-disable-ipv4] [--nowebui] [--admin-key ADMIN_KEY] [--api-enable-ipv6] [--api-disable-ipv4] [--nowebui]
Text generation web UI Text generation web UI
@ -301,6 +293,7 @@ llama.cpp:
--row-split Split the model by rows across GPUs. This may improve multi-gpu performance. --row-split Split the model by rows across GPUs. This may improve multi-gpu performance.
--extra-flags EXTRA_FLAGS Extra flags to pass to llama-server. Format: "flag1=value1,flag2,flag3=value3". Example: "override-tensor=exps=CPU" --extra-flags EXTRA_FLAGS Extra flags to pass to llama-server. Format: "flag1=value1,flag2,flag3=value3". Example: "override-tensor=exps=CPU"
--streaming-llm Activate StreamingLLM to avoid re-evaluating the entire prompt when old messages are removed. --streaming-llm Activate StreamingLLM to avoid re-evaluating the entire prompt when old messages are removed.
--mmproj MMPROJ Path to the mmproj file for vision models.
Context and cache: Context and cache:
--ctx-size N, --n_ctx N, --max_seq_len N Context size in tokens. --ctx-size N, --n_ctx N, --max_seq_len N Context size in tokens.
@ -314,6 +307,10 @@ Speculative decoding:
--device-draft DEVICE_DRAFT Comma-separated list of devices to use for offloading the draft model. Example: CUDA0,CUDA1 --device-draft DEVICE_DRAFT Comma-separated list of devices to use for offloading the draft model. Example: CUDA0,CUDA1
--ctx-size-draft CTX_SIZE_DRAFT Size of the prompt context for the draft model. If 0, uses the same as the main model. --ctx-size-draft CTX_SIZE_DRAFT Size of the prompt context for the draft model. If 0, uses the same as the main model.
ExLlamaV3:
--enable-tp, --enable_tp Enable Tensor Parallelism (TP) to split the model across GPUs.
--tp-backend TP_BACKEND The backend for tensor parallelism. Valid options: native, nccl. Default: native.
ExLlamaV2: ExLlamaV2:
--gpu-split GPU_SPLIT Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7. --gpu-split GPU_SPLIT Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.
--autosplit Autosplit the model tensors across the available GPUs. This causes --gpu-split to be ignored. --autosplit Autosplit the model tensors across the available GPUs. This causes --gpu-split to be ignored.
@ -322,7 +319,6 @@ ExLlamaV2:
--no_xformers Force xformers to not be used. --no_xformers Force xformers to not be used.
--no_sdpa Force Torch SDPA to not be used. --no_sdpa Force Torch SDPA to not be used.
--num_experts_per_token N Number of experts to use for generation. Applies to MoE models like Mixtral. --num_experts_per_token N Number of experts to use for generation. Applies to MoE models like Mixtral.
--enable_tp Enable Tensor Parallelism (TP) in ExLlamaV2.
TensorRT-LLM: TensorRT-LLM:
--cpp-runner Use the ModelRunnerCpp runner, which is faster than the default ModelRunner but doesn't support streaming yet. --cpp-runner Use the ModelRunnerCpp runner, which is faster than the default ModelRunner but doesn't support streaming yet.
@ -382,7 +378,7 @@ text-generation-webui
└── llama-2-13b-chat.Q4_K_M.gguf └── llama-2-13b-chat.Q4_K_M.gguf
``` ```
* The remaining model types (like 16-bit Transformers models and EXL2 models) are made of several files and must be placed in a subfolder. Example: * The remaining model types (like 16-bit Transformers models and EXL3 models) are made of several files and must be placed in a subfolder. Example:
``` ```
text-generation-webui text-generation-webui

View file

@ -93,7 +93,10 @@ curl http://127.0.0.1:5000/v1/chat/completions \
{"type": "image_url", "image_url": {"url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"}} {"type": "image_url", "image_url": {"url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"}}
] ]
} }
] ],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20
}' }'
``` ```
@ -127,7 +130,10 @@ curl http://127.0.0.1:5000/v1/completions \
} }
] ]
} }
] ],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20
}' }'
``` ```

View file

@ -269,7 +269,7 @@ def generate_chat_prompt(user_input, state, **kwargs):
enhanced_user_msg = user_msg enhanced_user_msg = user_msg
# Add attachment content if present AND if past attachments are enabled # Add attachment content if present AND if past attachments are enabled
if (state.get('include_past_attachments', True) and user_key in metadata and "attachments" in metadata[user_key]): if user_key in metadata and "attachments" in metadata[user_key]:
attachments_text = "" attachments_text = ""
image_refs = "" image_refs = ""
@ -277,7 +277,7 @@ def generate_chat_prompt(user_input, state, **kwargs):
if attachment.get("type") == "image": if attachment.get("type") == "image":
# Add image reference for multimodal models # Add image reference for multimodal models
image_refs += "<__media__>" image_refs += "<__media__>"
else: elif state.get('include_past_attachments', True):
# Handle text/PDF attachments # Handle text/PDF attachments
filename = attachment.get("name", "file") filename = attachment.get("name", "file")
content = attachment.get("content", "") content = attachment.get("content", "")

View file

@ -91,6 +91,11 @@ class Exllamav3Model:
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")] split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
load_params['use_per_device'] = split load_params['use_per_device'] = split
# Tensor-parallelism
if shared.args.enable_tp:
load_params['tensor_p'] = True
load_params['tp_backend'] = shared.args.tp_backend
model.load(**load_params) model.load(**load_params)
tokenizer = Tokenizer.from_config(config) tokenizer = Tokenizer.from_config(config)
@ -177,9 +182,6 @@ class Exllamav3Model:
Process all possible image inputs and return modified prompt + embeddings. Process all possible image inputs and return modified prompt + embeddings.
Returns: (processed_prompt, image_embeddings) Returns: (processed_prompt, image_embeddings)
""" """
if not self.is_multimodal():
return prompt, []
# Collect images from various sources using shared utilities # Collect images from various sources using shared utilities
pil_images = [] pil_images = []
@ -234,8 +236,12 @@ class Exllamav3Model:
""" """
Generate text with streaming using native ExLlamaV3 API Generate text with streaming using native ExLlamaV3 API
""" """
# Process images and modify prompt (ExLlamaV3-specific)
prompt, image_embeddings = self._process_images_for_generation(prompt, state) if shared.is_multimodal:
# Process images and modify prompt (ExLlamaV3-specific)
prompt, image_embeddings = self._process_images_for_generation(prompt, state)
else:
image_embeddings = []
# Greedy decoding is a special case # Greedy decoding is a special case
if state['temperature'] == 0: if state['temperature'] == 0:

View file

@ -74,6 +74,11 @@ class Exllamav3HF(PreTrainedModel, GenerationMixin):
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")] split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
load_params['use_per_device'] = split load_params['use_per_device'] = split
# Tensor-parallelism
if shared.args.enable_tp:
load_params['tensor_p'] = True
load_params['tp_backend'] = shared.args.tp_backend
self.ex_model.load(**load_params) self.ex_model.load(**load_params)
self.past_seq = None self.past_seq = None
self.max_tokens = max_tokens self.max_tokens = max_tokens

View file

@ -306,6 +306,9 @@ def process_markdown_content(string):
# Convert to HTML using markdown # Convert to HTML using markdown
html_output = markdown.markdown(result, extensions=['fenced_code', 'tables', SaneListExtension()]) html_output = markdown.markdown(result, extensions=['fenced_code', 'tables', SaneListExtension()])
# Remove extra newlines before </code>
html_output = re.sub(r'\s*</code>', '</code>', html_output)
# Unescape code blocks # Unescape code blocks
pattern = re.compile(r'<code[^>]*>(.*?)</code>', re.DOTALL) pattern = re.compile(r'<code[^>]*>(.*?)</code>', re.DOTALL)
html_output = pattern.sub(lambda x: html.unescape(x.group()), html_output) html_output = pattern.sub(lambda x: html.unescape(x.group()), html_output)

View file

@ -8,6 +8,7 @@ import sys
import threading import threading
import time import time
from pathlib import Path from pathlib import Path
from typing import Any, List
import llama_cpp_binaries import llama_cpp_binaries
import requests import requests
@ -129,10 +130,10 @@ class LlamaServer:
return payload return payload
def generate_with_streaming(self, prompt, state): def _process_images_for_generation(self, state: dict) -> List[Any]:
url = f"http://127.0.0.1:{self.port}/completion" """
payload = self.prepare_payload(state) Process all possible image inputs and return PIL images
"""
pil_images = [] pil_images = []
# Source 1: Web UI (from chatbot_wrapper) # Source 1: Web UI (from chatbot_wrapper)
if 'image_attachments' in state and state['image_attachments']: if 'image_attachments' in state and state['image_attachments']:
@ -144,6 +145,21 @@ class LlamaServer:
elif 'raw_images' in state and state['raw_images']: elif 'raw_images' in state and state['raw_images']:
pil_images.extend(state.get('raw_images', [])) pil_images.extend(state.get('raw_images', []))
return pil_images
def is_multimodal(self) -> bool:
"""Check if this model supports multimodal input."""
return shared.args.mmproj not in [None, 'None']
def generate_with_streaming(self, prompt, state):
url = f"http://127.0.0.1:{self.port}/completion"
payload = self.prepare_payload(state)
pil_images = []
if shared.is_multimodal:
pil_images = self._process_images_for_generation(state)
if pil_images: if pil_images:
# Multimodal case # Multimodal case
IMAGE_TOKEN_COST_ESTIMATE = 600 # A safe, conservative estimate per image IMAGE_TOKEN_COST_ESTIMATE = 600 # A safe, conservative estimate per image

View file

@ -56,6 +56,8 @@ loaders_and_params = OrderedDict({
'cfg_cache', 'cfg_cache',
'trust_remote_code', 'trust_remote_code',
'no_use_fast', 'no_use_fast',
'enable_tp',
'tp_backend',
], ],
'ExLlamav3': [ 'ExLlamav3': [
'ctx_size', 'ctx_size',
@ -65,6 +67,8 @@ loaders_and_params = OrderedDict({
'draft_max', 'draft_max',
'ctx_size_draft', 'ctx_size_draft',
'speculative_decoding_accordion', 'speculative_decoding_accordion',
'enable_tp',
'tp_backend',
], ],
'ExLlamav2_HF': [ 'ExLlamav2_HF': [
'ctx_size', 'ctx_size',

View file

@ -55,6 +55,10 @@ def load_model(model_name, loader=None):
if loader.lower().startswith('exllama') or loader.lower().startswith('tensorrt') or loader == 'llama.cpp': if loader.lower().startswith('exllama') or loader.lower().startswith('tensorrt') or loader == 'llama.cpp':
shared.settings['truncation_length'] = shared.args.ctx_size shared.settings['truncation_length'] = shared.args.ctx_size
shared.is_multimodal = False
if loader.lower() in ('exllamav3', 'llama.cpp'):
shared.is_multimodal = model.is_multimodal()
logger.info(f"Loaded \"{model_name}\" in {(time.time()-t0):.2f} seconds.") logger.info(f"Loaded \"{model_name}\" in {(time.time()-t0):.2f} seconds.")
logger.info(f"LOADER: \"{loader}\"") logger.info(f"LOADER: \"{loader}\"")
logger.info(f"TRUNCATION LENGTH: {shared.settings['truncation_length']}") logger.info(f"TRUNCATION LENGTH: {shared.settings['truncation_length']}")
@ -124,10 +128,12 @@ def unload_model(keep_model_name=False):
if shared.model is None: if shared.model is None:
return return
is_llamacpp = (shared.model.__class__.__name__ == 'LlamaServer') model_class_name = shared.model.__class__.__name__
if shared.args.loader in ['ExLlamav3_HF', 'ExLlamav3']: is_llamacpp = (model_class_name == 'LlamaServer')
if model_class_name in ['Exllamav3Model', 'Exllamav3HF']:
shared.model.unload() shared.model.unload()
elif shared.args.loader in ['ExLlamav2_HF', 'ExLlamav2'] and hasattr(shared.model, 'unload'): elif model_class_name in ['Exllamav2Model', 'Exllamav2HF'] and hasattr(shared.model, 'unload'):
shared.model.unload() shared.model.unload()
shared.model = shared.tokenizer = None shared.model = shared.tokenizer = None

View file

@ -251,7 +251,7 @@ def apply_model_settings_to_state(model, state):
model_settings = get_model_metadata(model) model_settings = get_model_metadata(model)
if 'loader' in model_settings: if 'loader' in model_settings:
loader = model_settings.pop('loader') loader = model_settings.pop('loader')
if not (loader == 'ExLlamav2_HF' and state['loader'] in ['ExLlamav2']): if not ((loader == 'ExLlamav2_HF' and state['loader'] == 'ExLlamav2') or (loader == 'ExLlamav3_HF' and state['loader'] == 'ExLlamav3')):
state['loader'] = loader state['loader'] = loader
for k in model_settings: for k in model_settings:

View file

@ -16,6 +16,7 @@ model = None
tokenizer = None tokenizer = None
model_name = 'None' model_name = 'None'
is_seq2seq = False is_seq2seq = False
is_multimodal = False
model_dirty_from_training = False model_dirty_from_training = False
lora_names = [] lora_names = []
@ -100,6 +101,11 @@ group.add_argument('--gpu-layers-draft', type=int, default=256, help='Number of
group.add_argument('--device-draft', type=str, default=None, help='Comma-separated list of devices to use for offloading the draft model. Example: CUDA0,CUDA1') group.add_argument('--device-draft', type=str, default=None, help='Comma-separated list of devices to use for offloading the draft model. Example: CUDA0,CUDA1')
group.add_argument('--ctx-size-draft', type=int, default=0, help='Size of the prompt context for the draft model. If 0, uses the same as the main model.') group.add_argument('--ctx-size-draft', type=int, default=0, help='Size of the prompt context for the draft model. If 0, uses the same as the main model.')
# ExLlamaV3
group = parser.add_argument_group('ExLlamaV3')
group.add_argument('--enable-tp', '--enable_tp', action='store_true', help='Enable Tensor Parallelism (TP) to split the model across GPUs.')
group.add_argument('--tp-backend', type=str, default='native', help='The backend for tensor parallelism. Valid options: native, nccl. Default: native.')
# ExLlamaV2 # ExLlamaV2
group = parser.add_argument_group('ExLlamaV2') group = parser.add_argument_group('ExLlamaV2')
group.add_argument('--gpu-split', type=str, help='Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.') group.add_argument('--gpu-split', type=str, help='Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.')
@ -109,7 +115,6 @@ group.add_argument('--no_flash_attn', action='store_true', help='Force flash-att
group.add_argument('--no_xformers', action='store_true', help='Force xformers to not be used.') group.add_argument('--no_xformers', action='store_true', help='Force xformers to not be used.')
group.add_argument('--no_sdpa', action='store_true', help='Force Torch SDPA to not be used.') group.add_argument('--no_sdpa', action='store_true', help='Force Torch SDPA to not be used.')
group.add_argument('--num_experts_per_token', type=int, default=2, metavar='N', help='Number of experts to use for generation. Applies to MoE models like Mixtral.') group.add_argument('--num_experts_per_token', type=int, default=2, metavar='N', help='Number of experts to use for generation. Applies to MoE models like Mixtral.')
group.add_argument('--enable_tp', action='store_true', help='Enable Tensor Parallelism (TP) in ExLlamaV2.')
# TensorRT-LLM # TensorRT-LLM
group = parser.add_argument_group('TensorRT-LLM') group = parser.add_argument_group('TensorRT-LLM')

View file

@ -155,6 +155,7 @@ def list_model_elements():
'bf16', 'bf16',
'autosplit', 'autosplit',
'enable_tp', 'enable_tp',
'tp_backend',
'no_flash_attn', 'no_flash_attn',
'no_xformers', 'no_xformers',
'no_sdpa', 'no_sdpa',

View file

@ -46,6 +46,8 @@ def create_ui():
shared.gradio['gpu_split'] = gr.Textbox(label='gpu-split', info='Comma-separated list of VRAM (in GB) to use per GPU. Example: 20,7,7') shared.gradio['gpu_split'] = gr.Textbox(label='gpu-split', info='Comma-separated list of VRAM (in GB) to use per GPU. Example: 20,7,7')
shared.gradio['attn_implementation'] = gr.Dropdown(label="attn-implementation", choices=['sdpa', 'eager', 'flash_attention_2'], value=shared.args.attn_implementation, info='Attention implementation.') shared.gradio['attn_implementation'] = gr.Dropdown(label="attn-implementation", choices=['sdpa', 'eager', 'flash_attention_2'], value=shared.args.attn_implementation, info='Attention implementation.')
shared.gradio['cache_type'] = gr.Dropdown(label="cache-type", choices=['fp16', 'q8_0', 'q4_0', 'fp8', 'q8', 'q7', 'q6', 'q5', 'q4', 'q3', 'q2'], value=shared.args.cache_type, allow_custom_value=True, info='Valid options: llama.cpp - fp16, q8_0, q4_0; ExLlamaV2 - fp16, fp8, q8, q6, q4; ExLlamaV3 - fp16, q2 to q8. For ExLlamaV3, you can type custom combinations for separate k/v bits (e.g. q4_q8).') shared.gradio['cache_type'] = gr.Dropdown(label="cache-type", choices=['fp16', 'q8_0', 'q4_0', 'fp8', 'q8', 'q7', 'q6', 'q5', 'q4', 'q3', 'q2'], value=shared.args.cache_type, allow_custom_value=True, info='Valid options: llama.cpp - fp16, q8_0, q4_0; ExLlamaV2 - fp16, fp8, q8, q6, q4; ExLlamaV3 - fp16, q2 to q8. For ExLlamaV3, you can type custom combinations for separate k/v bits (e.g. q4_q8).')
shared.gradio['tp_backend'] = gr.Dropdown(label="tp-backend", choices=['native', 'nccl'], value=shared.args.tp_backend, info='The backend for tensor parallelism.')
with gr.Column(): with gr.Column():
shared.gradio['vram_info'] = gr.HTML(value=get_initial_vram_info()) shared.gradio['vram_info'] = gr.HTML(value=get_initial_vram_info())
shared.gradio['flash_attn'] = gr.Checkbox(label="flash-attn", value=shared.args.flash_attn, info='Use flash-attention.') shared.gradio['flash_attn'] = gr.Checkbox(label="flash-attn", value=shared.args.flash_attn, info='Use flash-attention.')
@ -54,7 +56,7 @@ def create_ui():
shared.gradio['load_in_4bit'] = gr.Checkbox(label="load-in-4bit", value=shared.args.load_in_4bit) shared.gradio['load_in_4bit'] = gr.Checkbox(label="load-in-4bit", value=shared.args.load_in_4bit)
shared.gradio['use_double_quant'] = gr.Checkbox(label="use_double_quant", value=shared.args.use_double_quant, info='Used by load-in-4bit.') shared.gradio['use_double_quant'] = gr.Checkbox(label="use_double_quant", value=shared.args.use_double_quant, info='Used by load-in-4bit.')
shared.gradio['autosplit'] = gr.Checkbox(label="autosplit", value=shared.args.autosplit, info='Automatically split the model tensors across the available GPUs.') shared.gradio['autosplit'] = gr.Checkbox(label="autosplit", value=shared.args.autosplit, info='Automatically split the model tensors across the available GPUs.')
shared.gradio['enable_tp'] = gr.Checkbox(label="enable_tp", value=shared.args.enable_tp, info='Enable Tensor Parallelism (TP).') shared.gradio['enable_tp'] = gr.Checkbox(label="enable_tp", value=shared.args.enable_tp, info='Enable tensor parallelism (TP).')
shared.gradio['cpp_runner'] = gr.Checkbox(label="cpp-runner", value=shared.args.cpp_runner, info='Enable inference with ModelRunnerCpp, which is faster than the default ModelRunner.') shared.gradio['cpp_runner'] = gr.Checkbox(label="cpp-runner", value=shared.args.cpp_runner, info='Enable inference with ModelRunnerCpp, which is faster than the default ModelRunner.')
shared.gradio['trust_remote_code'] = gr.Checkbox(label="trust-remote-code", value=shared.args.trust_remote_code, info='Set trust_remote_code=True while loading the tokenizer/model. To enable this option, start the web UI with the --trust-remote-code flag.', interactive=shared.args.trust_remote_code) shared.gradio['trust_remote_code'] = gr.Checkbox(label="trust-remote-code", value=shared.args.trust_remote_code, info='Set trust_remote_code=True while loading the tokenizer/model. To enable this option, start the web UI with the --trust-remote-code flag.', interactive=shared.args.trust_remote_code)
shared.gradio['tensorrt_llm_info'] = gr.Markdown('* TensorRT-LLM has to be installed manually in a separate Python 3.10 environment at the moment. For a guide, consult the description of [this PR](https://github.com/oobabooga/text-generation-webui/pull/5715). \n\n* `ctx_size` is only used when `cpp-runner` is checked.\n\n* `cpp_runner` does not support streaming at the moment.') shared.gradio['tensorrt_llm_info'] = gr.Markdown('* TensorRT-LLM has to be installed manually in a separate Python 3.10 environment at the moment. For a guide, consult the description of [this PR](https://github.com/oobabooga/text-generation-webui/pull/5715). \n\n* `ctx_size` is only used when `cpp-runner` is checked.\n\n* `cpp_runner` does not support streaming at the moment.')

View file

@ -1,6 +1,8 @@
import concurrent.futures import concurrent.futures
import html import html
import random
import re import re
import urllib.request
from concurrent.futures import as_completed from concurrent.futures import as_completed
from datetime import datetime from datetime import datetime
from urllib.parse import quote_plus from urllib.parse import quote_plus
@ -50,16 +52,21 @@ def download_web_page(url, timeout=10):
def perform_web_search(query, num_pages=3, max_workers=5, timeout=10): def perform_web_search(query, num_pages=3, max_workers=5, timeout=10):
"""Perform web search and return results with content""" """Perform web search and return results with content"""
try: try:
# Use DuckDuckGo HTML search endpoint
search_url = f"https://html.duckduckgo.com/html/?q={quote_plus(query)}" search_url = f"https://html.duckduckgo.com/html/?q={quote_plus(query)}"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
response = requests.get(search_url, headers=headers, timeout=timeout) agents = [
response.raise_for_status() "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
]
response_text = ""
req = urllib.request.Request(search_url, headers={'User-Agent': random.choice(agents)})
with urllib.request.urlopen(req, timeout=timeout) as response:
response_text = response.read().decode('utf-8')
# Extract results with regex # Extract results with regex
titles = re.findall(r'<a[^>]*class="[^"]*result__a[^"]*"[^>]*>(.*?)</a>', response.text, re.DOTALL) titles = re.findall(r'<a[^>]*class="[^"]*result__a[^"]*"[^>]*>(.*?)</a>', response_text, re.DOTALL)
urls = re.findall(r'<a[^>]*class="[^"]*result__url[^"]*"[^>]*>(.*?)</a>', response.text, re.DOTALL) urls = re.findall(r'<a[^>]*class="[^"]*result__url[^"]*"[^>]*>(.*?)</a>', response_text, re.DOTALL)
# Prepare download tasks # Prepare download tasks
download_tasks = [] download_tasks = []

View file

@ -16,7 +16,7 @@ import sys
# os.environ["HCC_AMDGPU_TARGET"] = 'gfx1030' # os.environ["HCC_AMDGPU_TARGET"] = 'gfx1030'
# Define the required versions # Define the required versions
TORCH_VERSION = "2.6.0" TORCH_VERSION = "2.7.1"
PYTHON_VERSION = "3.11" PYTHON_VERSION = "3.11"
LIBSTDCXX_VERSION_LINUX = "12.1.0" LIBSTDCXX_VERSION_LINUX = "12.1.0"
@ -113,17 +113,16 @@ def get_gpu_choice():
choice = get_user_choice( choice = get_user_choice(
"What is your GPU?", "What is your GPU?",
{ {
'A': 'NVIDIA - CUDA 12.4', 'A': 'NVIDIA',
'B': 'AMD - Linux/macOS only, requires ROCm 6.2.4', 'B': 'AMD - Linux/macOS only, requires ROCm 6.2.4',
'C': 'Apple M Series', 'C': 'Apple M Series',
'D': 'Intel Arc (beta)', 'D': 'Intel Arc (beta)',
'E': 'NVIDIA - CUDA 12.8',
'N': 'CPU mode' 'N': 'CPU mode'
}, },
) )
# Convert choice to GPU name # Convert choice to GPU name
gpu_choice = {"A": "NVIDIA", "B": "AMD", "C": "APPLE", "D": "INTEL", "E": "NVIDIA_CUDA128", "N": "NONE"}[choice] gpu_choice = {"A": "NVIDIA_CUDA128", "B": "AMD", "C": "APPLE", "D": "INTEL", "N": "NONE"}[choice]
# Save choice to state # Save choice to state
state['gpu_choice'] = gpu_choice state['gpu_choice'] = gpu_choice
@ -136,10 +135,8 @@ def get_pytorch_install_command(gpu_choice):
"""Get PyTorch installation command based on GPU choice""" """Get PyTorch installation command based on GPU choice"""
base_cmd = f"python -m pip install torch=={TORCH_VERSION} " base_cmd = f"python -m pip install torch=={TORCH_VERSION} "
if gpu_choice == "NVIDIA": if gpu_choice == "NVIDIA_CUDA128":
return base_cmd + "--index-url https://download.pytorch.org/whl/cu124" return base_cmd + "--index-url https://download.pytorch.org/whl/cu128"
elif gpu_choice == "NVIDIA_CUDA128":
return "python -m pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128"
elif gpu_choice == "AMD": elif gpu_choice == "AMD":
return base_cmd + "--index-url https://download.pytorch.org/whl/rocm6.2.4" return base_cmd + "--index-url https://download.pytorch.org/whl/rocm6.2.4"
elif gpu_choice in ["APPLE", "NONE"]: elif gpu_choice in ["APPLE", "NONE"]:
@ -157,10 +154,8 @@ def get_pytorch_update_command(gpu_choice):
"""Get PyTorch update command based on GPU choice""" """Get PyTorch update command based on GPU choice"""
base_cmd = f"python -m pip install --upgrade torch=={TORCH_VERSION} " base_cmd = f"python -m pip install --upgrade torch=={TORCH_VERSION} "
if gpu_choice == "NVIDIA": if gpu_choice == "NVIDIA_CUDA128":
return f"{base_cmd} --index-url https://download.pytorch.org/whl/cu124" return f"{base_cmd} --index-url https://download.pytorch.org/whl/cu128"
elif gpu_choice == "NVIDIA_CUDA128":
return "python -m pip install --upgrade torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128"
elif gpu_choice == "AMD": elif gpu_choice == "AMD":
return f"{base_cmd} --index-url https://download.pytorch.org/whl/rocm6.2.4" return f"{base_cmd} --index-url https://download.pytorch.org/whl/rocm6.2.4"
elif gpu_choice in ["APPLE", "NONE"]: elif gpu_choice in ["APPLE", "NONE"]:
@ -176,16 +171,14 @@ def get_requirements_file(gpu_choice):
"""Get requirements file path based on GPU choice""" """Get requirements file path based on GPU choice"""
requirements_base = os.path.join("requirements", "full") requirements_base = os.path.join("requirements", "full")
if gpu_choice == "AMD": if gpu_choice == "NVIDIA_CUDA128":
file_name = f"requirements{'_noavx2' if not cpu_has_avx2() else ''}.txt"
elif gpu_choice == "AMD":
file_name = f"requirements_amd{'_noavx2' if not cpu_has_avx2() else ''}.txt" file_name = f"requirements_amd{'_noavx2' if not cpu_has_avx2() else ''}.txt"
elif gpu_choice == "APPLE": elif gpu_choice == "APPLE":
file_name = f"requirements_apple_{'intel' if is_x86_64() else 'silicon'}.txt" file_name = f"requirements_apple_{'intel' if is_x86_64() else 'silicon'}.txt"
elif gpu_choice in ["INTEL", "NONE"]: elif gpu_choice in ["INTEL", "NONE"]:
file_name = f"requirements_cpu_only{'_noavx2' if not cpu_has_avx2() else ''}.txt" file_name = f"requirements_cpu_only{'_noavx2' if not cpu_has_avx2() else ''}.txt"
elif gpu_choice == "NVIDIA":
file_name = f"requirements{'_noavx2' if not cpu_has_avx2() else ''}.txt"
elif gpu_choice == "NVIDIA_CUDA128":
file_name = f"requirements_cuda128{'_noavx2' if not cpu_has_avx2() else ''}.txt"
else: else:
raise ValueError(f"Unknown GPU choice: {gpu_choice}") raise ValueError(f"Unknown GPU choice: {gpu_choice}")
@ -331,8 +324,6 @@ def install_webui():
cmd_flags_file.write("\n--cpu\n") cmd_flags_file.write("\n--cpu\n")
# Handle CUDA version display # Handle CUDA version display
elif any((is_windows(), is_linux())) and gpu_choice == "NVIDIA":
print("CUDA: 12.4")
elif any((is_windows(), is_linux())) and gpu_choice == "NVIDIA_CUDA128": elif any((is_windows(), is_linux())) and gpu_choice == "NVIDIA_CUDA128":
print("CUDA: 12.8") print("CUDA: 12.8")
@ -368,6 +359,19 @@ def update_requirements(initial_installation=False, pull=True):
assert_success=True assert_success=True
) )
# Check for outdated CUDA 12.4 installs and refuse to update
state = load_state()
if state.get('gpu_choice') == 'NVIDIA':
print_big_message(
"Your current installation uses CUDA 12.4, which has been removed.\n"
"To update to the new default (CUDA 12.8), a clean installation is required.\n\n"
"INSTRUCTIONS:\n"
"1. Delete the 'installer_files' folder in your text-generation-webui directory.\n"
"2. Run the start script again (e.g., start_windows.bat).\n\n"
"This will create a fresh environment with the latest software."
)
sys.exit(0)
current_commit = get_current_commit() current_commit = get_current_commit()
wheels_changed = not os.path.exists(state_file) wheels_changed = not os.path.exists(state_file)
if not wheels_changed: if not wheels_changed:
@ -404,7 +408,7 @@ def update_requirements(initial_installation=False, pull=True):
with open(requirements_file, 'r') as f: with open(requirements_file, 'r') as f:
after_pull_whl_lines = [line for line in f if '.whl' in line] after_pull_whl_lines = [line for line in f if '.whl' in line]
wheels_changed = wheels_changed or (before_pull_whl_lines != after_pull_whl_lines) wheels_changed = wheels_changed or (before_pull_whl_lines != after_pull_whl_lines)
# Check for changes to installer files # Check for changes to installer files
for file in files_to_check: for file in files_to_check:

View file

@ -24,7 +24,7 @@ scipy
sentencepiece sentencepiece
tensorboard tensorboard
transformers==4.55.* transformers==4.55.*
triton-windows==3.2.0.post19; platform_system == "Windows" triton-windows==3.3.1.post19; platform_system == "Windows"
tqdm tqdm
wandb wandb
@ -34,12 +34,12 @@ sse-starlette==1.6.5
tiktoken tiktoken
# CUDA wheels # CUDA wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64" https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu124torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/kingbri1/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"

View file

@ -33,7 +33,7 @@ sse-starlette==1.6.5
tiktoken tiktoken
# AMD wheels # AMD wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+rocm6.2.4.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+rocm6.2.4.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64" https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"

View file

@ -33,7 +33,7 @@ sse-starlette==1.6.5
tiktoken tiktoken
# AMD wheels # AMD wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+rocm6.2.4.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+rocm6.2.4.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64" https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"

View file

@ -33,7 +33,7 @@ sse-starlette==1.6.5
tiktoken tiktoken
# Mac wheels # Mac wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5-py3-none-any.whl https://github.com/oobabooga/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6-py3-none-any.whl
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl

View file

@ -33,8 +33,8 @@ sse-starlette==1.6.5
tiktoken tiktoken
# Mac wheels # Mac wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5-py3-none-any.whl https://github.com/oobabooga/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6-py3-none-any.whl
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl

View file

@ -33,5 +33,5 @@ sse-starlette==1.6.5
tiktoken tiktoken
# llama.cpp (CPU only, AVX2) # llama.cpp (CPU only, AVX2)
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"

View file

@ -33,5 +33,5 @@ sse-starlette==1.6.5
tiktoken tiktoken
# llama.cpp (CPU only, no AVX2) # llama.cpp (CPU only, no AVX2)
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"

View file

@ -1,45 +0,0 @@
accelerate==1.8.*
bitsandbytes==0.46.*
colorama
datasets
einops
fastapi==0.112.4
gradio==4.37.*
html2text==2025.4.15
jinja2==3.1.6
markdown
numpy==2.2.*
pandas
peft==0.16.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
PyPDF2==3.0.1
python-docx==1.1.2
pyyaml
requests
rich
safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.55.*
triton-windows==3.3.1.post19; platform_system == "Windows"
tqdm
wandb
# API
flask_cloudflared==0.0.14
sse-starlette==1.6.5
tiktoken
# CUDA wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"

View file

@ -1,45 +0,0 @@
accelerate==1.8.*
bitsandbytes==0.46.*
colorama
datasets
einops
fastapi==0.112.4
gradio==4.37.*
html2text==2025.4.15
jinja2==3.1.6
markdown
numpy==2.2.*
pandas
peft==0.16.*
Pillow>=9.5.0
psutil
pydantic==2.8.2
PyPDF2==3.0.1
python-docx==1.1.2
pyyaml
requests
rich
safetensors==0.5.*
scipy
sentencepiece
tensorboard
transformers==4.55.*
triton-windows==3.3.1.post19; platform_system == "Windows"
tqdm
wandb
# API
flask_cloudflared==0.0.14
sse-starlette==1.6.5
tiktoken
# CUDA wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"

View file

@ -24,7 +24,7 @@ scipy
sentencepiece sentencepiece
tensorboard tensorboard
transformers==4.55.* transformers==4.55.*
triton-windows==3.2.0.post19; platform_system == "Windows" triton-windows==3.3.1.post19; platform_system == "Windows"
tqdm tqdm
wandb wandb
@ -34,12 +34,12 @@ sse-starlette==1.6.5
tiktoken tiktoken
# CUDA wheels # CUDA wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64" https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu124torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/kingbri1/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11" https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"

View file

@ -18,5 +18,5 @@ sse-starlette==1.6.5
tiktoken tiktoken
# CUDA wheels # CUDA wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"

View file

@ -18,5 +18,5 @@ sse-starlette==1.6.5
tiktoken tiktoken
# Mac wheels # Mac wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0"

View file

@ -18,6 +18,6 @@ sse-starlette==1.6.5
tiktoken tiktoken
# Mac wheels # Mac wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0"

View file

@ -18,5 +18,5 @@ sse-starlette==1.6.5
tiktoken tiktoken
# llama.cpp (CPU only, AVX2) # llama.cpp (CPU only, AVX2)
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows"

View file

@ -18,5 +18,5 @@ sse-starlette==1.6.5
tiktoken tiktoken
# llama.cpp (CPU only, no AVX2) # llama.cpp (CPU only, no AVX2)
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows"

View file

@ -18,5 +18,5 @@ sse-starlette==1.6.5
tiktoken tiktoken
# CUDA wheels # CUDA wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"

View file

@ -18,5 +18,5 @@ sse-starlette==1.6.5
tiktoken tiktoken
# CUDA wheels # CUDA wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"

View file

@ -18,5 +18,5 @@ sse-starlette==1.6.5
tiktoken tiktoken
# CUDA wheels # CUDA wheels
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows"
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"