mirror of
https://github.com/oobabooga/text-generation-webui.git
synced 2025-12-06 07:12:10 +01:00
commit
cb00db15c9
118
README.md
118
README.md
|
|
@ -2,8 +2,6 @@
|
||||||
|
|
||||||
A Gradio web UI for Large Language Models.
|
A Gradio web UI for Large Language Models.
|
||||||
|
|
||||||
Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) of text generation.
|
|
||||||
|
|
||||||
[Try the Deep Reason extension](https://oobabooga.gumroad.com/l/deep_reason)
|
[Try the Deep Reason extension](https://oobabooga.gumroad.com/l/deep_reason)
|
||||||
|
|
||||||
| |  |
|
| |  |
|
||||||
|
|
@ -32,54 +30,15 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
|
||||||
|
|
||||||
## How to install
|
## How to install
|
||||||
|
|
||||||
#### Option 1: Portable builds (get started in 1 minute)
|
#### ✅ Option 1: Portable builds (get started in 1 minute)
|
||||||
|
|
||||||
No installation needed – just download, unzip and run. All dependencies included.
|
No installation needed – just download, unzip and run. All dependencies included.
|
||||||
|
|
||||||
Compatible with GGUF (llama.cpp) models on Windows, Linux, and macOS.
|
Compatible with GGUF (llama.cpp) models on Windows, Linux, and macOS.
|
||||||
|
|
||||||
Download from here: https://github.com/oobabooga/text-generation-webui/releases
|
Download from here: **https://github.com/oobabooga/text-generation-webui/releases**
|
||||||
|
|
||||||
#### Option 2: One-click installer
|
#### Option 2: Manual portable install with venv
|
||||||
|
|
||||||
For users who need additional backends (ExLlamaV3, Transformers) or extensions (TTS, voice input, translation, etc). Requires ~10GB disk space and downloads PyTorch.
|
|
||||||
|
|
||||||
1. Clone the repository, or [download its source code](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip) and extract it.
|
|
||||||
2. Run the startup script for your OS: `start_windows.bat`, `start_linux.sh`, or `start_macos.sh`.
|
|
||||||
3. When prompted, select your GPU vendor.
|
|
||||||
4. After installation, open `http://127.0.0.1:7860` in your browser.
|
|
||||||
|
|
||||||
To restart the web UI later, run the same `start_` script.
|
|
||||||
|
|
||||||
To reinstall with a fresh Python environment, delete the `installer_files` folder and run the `start_` script again.
|
|
||||||
|
|
||||||
You can pass command-line flags directly (e.g., `./start_linux.sh --help`), or add them to `user_data/CMD_FLAGS.txt` (e.g., `--api` to enable the API).
|
|
||||||
|
|
||||||
To update, run the update script for your OS: `update_wizard_windows.bat`, `update_wizard_linux.sh`, or `update_wizard_macos.sh`.
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>
|
|
||||||
One-click installer details
|
|
||||||
</summary>
|
|
||||||
|
|
||||||
### One-click-installer
|
|
||||||
|
|
||||||
The script uses Miniforge to set up a Conda environment in the `installer_files` folder.
|
|
||||||
|
|
||||||
If you ever need to install something manually in the `installer_files` environment, you can launch an interactive shell using the cmd script: `cmd_linux.sh`, `cmd_windows.bat`, or `cmd_macos.sh`.
|
|
||||||
|
|
||||||
* There is no need to run any of those scripts (`start_`, `update_wizard_`, or `cmd_`) as admin/root.
|
|
||||||
* To install requirements for extensions, it is recommended to use the update wizard script with the "Install/update extensions requirements" option. At the end, this script will install the main requirements for the project to make sure that they take precedence in case of version conflicts.
|
|
||||||
* For automated installation, you can use the `GPU_CHOICE`, `LAUNCH_AFTER_INSTALL`, and `INSTALL_EXTENSIONS` environment variables. For instance: `GPU_CHOICE=A LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh`.
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>
|
|
||||||
Manual portable installation with venv
|
|
||||||
</summary>
|
|
||||||
|
|
||||||
### Manual portable installation with venv
|
|
||||||
|
|
||||||
Very fast setup that should work on any Python 3.9+:
|
Very fast setup that should work on any Python 3.9+:
|
||||||
|
|
||||||
|
|
@ -98,7 +57,7 @@ venv\Scripts\activate
|
||||||
source venv/bin/activate
|
source venv/bin/activate
|
||||||
|
|
||||||
# Install dependencies (choose appropriate file under requirements/portable for your hardware)
|
# Install dependencies (choose appropriate file under requirements/portable for your hardware)
|
||||||
pip install -r requirements/portable/requirements.txt
|
pip install -r requirements/portable/requirements.txt --upgrade
|
||||||
|
|
||||||
# Launch server (basic command)
|
# Launch server (basic command)
|
||||||
python server.py --portable --api --auto-launch
|
python server.py --portable --api --auto-launch
|
||||||
|
|
@ -106,6 +65,39 @@ python server.py --portable --api --auto-launch
|
||||||
# When done working, deactivate
|
# When done working, deactivate
|
||||||
deactivate
|
deactivate
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Option 3: One-click installer
|
||||||
|
|
||||||
|
For users who need additional backends (ExLlamaV3, Transformers) or extensions (TTS, voice input, translation, etc). Requires ~10GB disk space and downloads PyTorch.
|
||||||
|
|
||||||
|
1. Clone the repository, or [download its source code](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip) and extract it.
|
||||||
|
2. Run the startup script for your OS: `start_windows.bat`, `start_linux.sh`, or `start_macos.sh`.
|
||||||
|
3. When prompted, select your GPU vendor.
|
||||||
|
4. After installation, open `http://127.0.0.1:7860` in your browser.
|
||||||
|
|
||||||
|
To restart the web UI later, run the same `start_` script.
|
||||||
|
|
||||||
|
You can pass command-line flags directly (e.g., `./start_linux.sh --help`), or add them to `user_data/CMD_FLAGS.txt` (e.g., `--api` to enable the API).
|
||||||
|
|
||||||
|
To update, run the update script for your OS: `update_wizard_windows.bat`, `update_wizard_linux.sh`, or `update_wizard_macos.sh`.
|
||||||
|
|
||||||
|
To reinstall with a fresh Python environment, delete the `installer_files` folder and run the `start_` script again.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>
|
||||||
|
One-click installer details
|
||||||
|
</summary>
|
||||||
|
|
||||||
|
### One-click-installer
|
||||||
|
|
||||||
|
The script uses Miniforge to set up a Conda environment in the `installer_files` folder.
|
||||||
|
|
||||||
|
If you ever need to install something manually in the `installer_files` environment, you can launch an interactive shell using the cmd script: `cmd_linux.sh`, `cmd_windows.bat`, or `cmd_macos.sh`.
|
||||||
|
|
||||||
|
* There is no need to run any of those scripts (`start_`, `update_wizard_`, or `cmd_`) as admin/root.
|
||||||
|
* To install requirements for extensions, it is recommended to use the update wizard script with the "Install/update extensions requirements" option. At the end, this script will install the main requirements for the project to make sure that they take precedence in case of version conflicts.
|
||||||
|
* For automated installation, you can use the `GPU_CHOICE`, `LAUNCH_AFTER_INSTALL`, and `INSTALL_EXTENSIONS` environment variables. For instance: `GPU_CHOICE=A LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh`.
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
|
|
@ -139,19 +131,19 @@ conda activate textgen
|
||||||
|
|
||||||
| System | GPU | Command |
|
| System | GPU | Command |
|
||||||
|--------|---------|---------|
|
|--------|---------|---------|
|
||||||
| Linux/WSL | NVIDIA | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124` |
|
| Linux/WSL | NVIDIA | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128` |
|
||||||
| Linux/WSL | CPU only | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu` |
|
| Linux/WSL | CPU only | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/cpu` |
|
||||||
| Linux | AMD | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/rocm6.2.4` |
|
| Linux | AMD | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/rocm6.2.4` |
|
||||||
| MacOS + MPS | Any | `pip3 install torch==2.6.0` |
|
| MacOS + MPS | Any | `pip3 install torch==2.7.1` |
|
||||||
| Windows | NVIDIA | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124` |
|
| Windows | NVIDIA | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128` |
|
||||||
| Windows | CPU only | `pip3 install torch==2.6.0` |
|
| Windows | CPU only | `pip3 install torch==2.7.1` |
|
||||||
|
|
||||||
The up-to-date commands can be found here: https://pytorch.org/get-started/locally/.
|
The up-to-date commands can be found here: https://pytorch.org/get-started/locally/.
|
||||||
|
|
||||||
If you need `nvcc` to compile some library manually, you will additionally need to install this:
|
If you need `nvcc` to compile some library manually, you will additionally need to install this:
|
||||||
|
|
||||||
```
|
```
|
||||||
conda install -y -c "nvidia/label/cuda-12.4.1" cuda
|
conda install -y -c "nvidia/label/cuda-12.8.1" cuda
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 3. Install the web UI
|
#### 3. Install the web UI
|
||||||
|
|
@ -238,13 +230,13 @@ usage: server.py [-h] [--multi-user] [--model MODEL] [--lora LORA [LORA ...]] [-
|
||||||
[--extensions EXTENSIONS [EXTENSIONS ...]] [--verbose] [--idle-timeout IDLE_TIMEOUT] [--loader LOADER] [--cpu] [--cpu-memory CPU_MEMORY] [--disk] [--disk-cache-dir DISK_CACHE_DIR]
|
[--extensions EXTENSIONS [EXTENSIONS ...]] [--verbose] [--idle-timeout IDLE_TIMEOUT] [--loader LOADER] [--cpu] [--cpu-memory CPU_MEMORY] [--disk] [--disk-cache-dir DISK_CACHE_DIR]
|
||||||
[--load-in-8bit] [--bf16] [--no-cache] [--trust-remote-code] [--force-safetensors] [--no_use_fast] [--attn-implementation IMPLEMENTATION] [--load-in-4bit] [--use_double_quant]
|
[--load-in-8bit] [--bf16] [--no-cache] [--trust-remote-code] [--force-safetensors] [--no_use_fast] [--attn-implementation IMPLEMENTATION] [--load-in-4bit] [--use_double_quant]
|
||||||
[--compute_dtype COMPUTE_DTYPE] [--quant_type QUANT_TYPE] [--flash-attn] [--threads THREADS] [--threads-batch THREADS_BATCH] [--batch-size BATCH_SIZE] [--no-mmap] [--mlock]
|
[--compute_dtype COMPUTE_DTYPE] [--quant_type QUANT_TYPE] [--flash-attn] [--threads THREADS] [--threads-batch THREADS_BATCH] [--batch-size BATCH_SIZE] [--no-mmap] [--mlock]
|
||||||
[--gpu-layers N] [--tensor-split TENSOR_SPLIT] [--numa] [--no-kv-offload] [--row-split] [--extra-flags EXTRA_FLAGS] [--streaming-llm] [--ctx-size N] [--cache-type N]
|
[--gpu-layers N] [--tensor-split TENSOR_SPLIT] [--numa] [--no-kv-offload] [--row-split] [--extra-flags EXTRA_FLAGS] [--streaming-llm] [--mmproj MMPROJ] [--ctx-size N] [--cache-type N]
|
||||||
[--model-draft MODEL_DRAFT] [--draft-max DRAFT_MAX] [--gpu-layers-draft GPU_LAYERS_DRAFT] [--device-draft DEVICE_DRAFT] [--ctx-size-draft CTX_SIZE_DRAFT] [--gpu-split GPU_SPLIT]
|
[--model-draft MODEL_DRAFT] [--draft-max DRAFT_MAX] [--gpu-layers-draft GPU_LAYERS_DRAFT] [--device-draft DEVICE_DRAFT] [--ctx-size-draft CTX_SIZE_DRAFT] [--enable-tp]
|
||||||
[--autosplit] [--cfg-cache] [--no_flash_attn] [--no_xformers] [--no_sdpa] [--num_experts_per_token N] [--enable_tp] [--cpp-runner] [--deepspeed] [--nvme-offload-dir NVME_OFFLOAD_DIR]
|
[--tp-backend TP_BACKEND] [--gpu-split GPU_SPLIT] [--autosplit] [--cfg-cache] [--no_flash_attn] [--no_xformers] [--no_sdpa] [--num_experts_per_token N] [--cpp-runner] [--deepspeed]
|
||||||
[--local_rank LOCAL_RANK] [--alpha_value ALPHA_VALUE] [--rope_freq_base ROPE_FREQ_BASE] [--compress_pos_emb COMPRESS_POS_EMB] [--listen] [--listen-port LISTEN_PORT]
|
[--nvme-offload-dir NVME_OFFLOAD_DIR] [--local_rank LOCAL_RANK] [--alpha_value ALPHA_VALUE] [--rope_freq_base ROPE_FREQ_BASE] [--compress_pos_emb COMPRESS_POS_EMB] [--listen]
|
||||||
[--listen-host LISTEN_HOST] [--share] [--auto-launch] [--gradio-auth GRADIO_AUTH] [--gradio-auth-path GRADIO_AUTH_PATH] [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE]
|
[--listen-port LISTEN_PORT] [--listen-host LISTEN_HOST] [--share] [--auto-launch] [--gradio-auth GRADIO_AUTH] [--gradio-auth-path GRADIO_AUTH_PATH] [--ssl-keyfile SSL_KEYFILE]
|
||||||
[--subpath SUBPATH] [--old-colors] [--portable] [--api] [--public-api] [--public-api-id PUBLIC_API_ID] [--api-port API_PORT] [--api-key API_KEY] [--admin-key ADMIN_KEY]
|
[--ssl-certfile SSL_CERTFILE] [--subpath SUBPATH] [--old-colors] [--portable] [--api] [--public-api] [--public-api-id PUBLIC_API_ID] [--api-port API_PORT] [--api-key API_KEY]
|
||||||
[--api-enable-ipv6] [--api-disable-ipv4] [--nowebui]
|
[--admin-key ADMIN_KEY] [--api-enable-ipv6] [--api-disable-ipv4] [--nowebui]
|
||||||
|
|
||||||
Text generation web UI
|
Text generation web UI
|
||||||
|
|
||||||
|
|
@ -301,6 +293,7 @@ llama.cpp:
|
||||||
--row-split Split the model by rows across GPUs. This may improve multi-gpu performance.
|
--row-split Split the model by rows across GPUs. This may improve multi-gpu performance.
|
||||||
--extra-flags EXTRA_FLAGS Extra flags to pass to llama-server. Format: "flag1=value1,flag2,flag3=value3". Example: "override-tensor=exps=CPU"
|
--extra-flags EXTRA_FLAGS Extra flags to pass to llama-server. Format: "flag1=value1,flag2,flag3=value3". Example: "override-tensor=exps=CPU"
|
||||||
--streaming-llm Activate StreamingLLM to avoid re-evaluating the entire prompt when old messages are removed.
|
--streaming-llm Activate StreamingLLM to avoid re-evaluating the entire prompt when old messages are removed.
|
||||||
|
--mmproj MMPROJ Path to the mmproj file for vision models.
|
||||||
|
|
||||||
Context and cache:
|
Context and cache:
|
||||||
--ctx-size N, --n_ctx N, --max_seq_len N Context size in tokens.
|
--ctx-size N, --n_ctx N, --max_seq_len N Context size in tokens.
|
||||||
|
|
@ -314,6 +307,10 @@ Speculative decoding:
|
||||||
--device-draft DEVICE_DRAFT Comma-separated list of devices to use for offloading the draft model. Example: CUDA0,CUDA1
|
--device-draft DEVICE_DRAFT Comma-separated list of devices to use for offloading the draft model. Example: CUDA0,CUDA1
|
||||||
--ctx-size-draft CTX_SIZE_DRAFT Size of the prompt context for the draft model. If 0, uses the same as the main model.
|
--ctx-size-draft CTX_SIZE_DRAFT Size of the prompt context for the draft model. If 0, uses the same as the main model.
|
||||||
|
|
||||||
|
ExLlamaV3:
|
||||||
|
--enable-tp, --enable_tp Enable Tensor Parallelism (TP) to split the model across GPUs.
|
||||||
|
--tp-backend TP_BACKEND The backend for tensor parallelism. Valid options: native, nccl. Default: native.
|
||||||
|
|
||||||
ExLlamaV2:
|
ExLlamaV2:
|
||||||
--gpu-split GPU_SPLIT Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.
|
--gpu-split GPU_SPLIT Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.
|
||||||
--autosplit Autosplit the model tensors across the available GPUs. This causes --gpu-split to be ignored.
|
--autosplit Autosplit the model tensors across the available GPUs. This causes --gpu-split to be ignored.
|
||||||
|
|
@ -322,7 +319,6 @@ ExLlamaV2:
|
||||||
--no_xformers Force xformers to not be used.
|
--no_xformers Force xformers to not be used.
|
||||||
--no_sdpa Force Torch SDPA to not be used.
|
--no_sdpa Force Torch SDPA to not be used.
|
||||||
--num_experts_per_token N Number of experts to use for generation. Applies to MoE models like Mixtral.
|
--num_experts_per_token N Number of experts to use for generation. Applies to MoE models like Mixtral.
|
||||||
--enable_tp Enable Tensor Parallelism (TP) in ExLlamaV2.
|
|
||||||
|
|
||||||
TensorRT-LLM:
|
TensorRT-LLM:
|
||||||
--cpp-runner Use the ModelRunnerCpp runner, which is faster than the default ModelRunner but doesn't support streaming yet.
|
--cpp-runner Use the ModelRunnerCpp runner, which is faster than the default ModelRunner but doesn't support streaming yet.
|
||||||
|
|
@ -382,7 +378,7 @@ text-generation-webui
|
||||||
└── llama-2-13b-chat.Q4_K_M.gguf
|
└── llama-2-13b-chat.Q4_K_M.gguf
|
||||||
```
|
```
|
||||||
|
|
||||||
* The remaining model types (like 16-bit Transformers models and EXL2 models) are made of several files and must be placed in a subfolder. Example:
|
* The remaining model types (like 16-bit Transformers models and EXL3 models) are made of several files and must be placed in a subfolder. Example:
|
||||||
|
|
||||||
```
|
```
|
||||||
text-generation-webui
|
text-generation-webui
|
||||||
|
|
|
||||||
|
|
@ -93,7 +93,10 @@ curl http://127.0.0.1:5000/v1/chat/completions \
|
||||||
{"type": "image_url", "image_url": {"url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"}}
|
{"type": "image_url", "image_url": {"url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"}}
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
]
|
],
|
||||||
|
"temperature": 0.6,
|
||||||
|
"top_p": 0.95,
|
||||||
|
"top_k": 20
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -127,7 +130,10 @@ curl http://127.0.0.1:5000/v1/completions \
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
]
|
],
|
||||||
|
"temperature": 0.6,
|
||||||
|
"top_p": 0.95,
|
||||||
|
"top_k": 20
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -269,7 +269,7 @@ def generate_chat_prompt(user_input, state, **kwargs):
|
||||||
enhanced_user_msg = user_msg
|
enhanced_user_msg = user_msg
|
||||||
|
|
||||||
# Add attachment content if present AND if past attachments are enabled
|
# Add attachment content if present AND if past attachments are enabled
|
||||||
if (state.get('include_past_attachments', True) and user_key in metadata and "attachments" in metadata[user_key]):
|
if user_key in metadata and "attachments" in metadata[user_key]:
|
||||||
attachments_text = ""
|
attachments_text = ""
|
||||||
image_refs = ""
|
image_refs = ""
|
||||||
|
|
||||||
|
|
@ -277,7 +277,7 @@ def generate_chat_prompt(user_input, state, **kwargs):
|
||||||
if attachment.get("type") == "image":
|
if attachment.get("type") == "image":
|
||||||
# Add image reference for multimodal models
|
# Add image reference for multimodal models
|
||||||
image_refs += "<__media__>"
|
image_refs += "<__media__>"
|
||||||
else:
|
elif state.get('include_past_attachments', True):
|
||||||
# Handle text/PDF attachments
|
# Handle text/PDF attachments
|
||||||
filename = attachment.get("name", "file")
|
filename = attachment.get("name", "file")
|
||||||
content = attachment.get("content", "")
|
content = attachment.get("content", "")
|
||||||
|
|
|
||||||
|
|
@ -91,6 +91,11 @@ class Exllamav3Model:
|
||||||
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
|
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
|
||||||
load_params['use_per_device'] = split
|
load_params['use_per_device'] = split
|
||||||
|
|
||||||
|
# Tensor-parallelism
|
||||||
|
if shared.args.enable_tp:
|
||||||
|
load_params['tensor_p'] = True
|
||||||
|
load_params['tp_backend'] = shared.args.tp_backend
|
||||||
|
|
||||||
model.load(**load_params)
|
model.load(**load_params)
|
||||||
tokenizer = Tokenizer.from_config(config)
|
tokenizer = Tokenizer.from_config(config)
|
||||||
|
|
||||||
|
|
@ -177,9 +182,6 @@ class Exllamav3Model:
|
||||||
Process all possible image inputs and return modified prompt + embeddings.
|
Process all possible image inputs and return modified prompt + embeddings.
|
||||||
Returns: (processed_prompt, image_embeddings)
|
Returns: (processed_prompt, image_embeddings)
|
||||||
"""
|
"""
|
||||||
if not self.is_multimodal():
|
|
||||||
return prompt, []
|
|
||||||
|
|
||||||
# Collect images from various sources using shared utilities
|
# Collect images from various sources using shared utilities
|
||||||
pil_images = []
|
pil_images = []
|
||||||
|
|
||||||
|
|
@ -234,8 +236,12 @@ class Exllamav3Model:
|
||||||
"""
|
"""
|
||||||
Generate text with streaming using native ExLlamaV3 API
|
Generate text with streaming using native ExLlamaV3 API
|
||||||
"""
|
"""
|
||||||
# Process images and modify prompt (ExLlamaV3-specific)
|
|
||||||
prompt, image_embeddings = self._process_images_for_generation(prompt, state)
|
if shared.is_multimodal:
|
||||||
|
# Process images and modify prompt (ExLlamaV3-specific)
|
||||||
|
prompt, image_embeddings = self._process_images_for_generation(prompt, state)
|
||||||
|
else:
|
||||||
|
image_embeddings = []
|
||||||
|
|
||||||
# Greedy decoding is a special case
|
# Greedy decoding is a special case
|
||||||
if state['temperature'] == 0:
|
if state['temperature'] == 0:
|
||||||
|
|
|
||||||
|
|
@ -74,6 +74,11 @@ class Exllamav3HF(PreTrainedModel, GenerationMixin):
|
||||||
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
|
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
|
||||||
load_params['use_per_device'] = split
|
load_params['use_per_device'] = split
|
||||||
|
|
||||||
|
# Tensor-parallelism
|
||||||
|
if shared.args.enable_tp:
|
||||||
|
load_params['tensor_p'] = True
|
||||||
|
load_params['tp_backend'] = shared.args.tp_backend
|
||||||
|
|
||||||
self.ex_model.load(**load_params)
|
self.ex_model.load(**load_params)
|
||||||
self.past_seq = None
|
self.past_seq = None
|
||||||
self.max_tokens = max_tokens
|
self.max_tokens = max_tokens
|
||||||
|
|
|
||||||
|
|
@ -306,6 +306,9 @@ def process_markdown_content(string):
|
||||||
# Convert to HTML using markdown
|
# Convert to HTML using markdown
|
||||||
html_output = markdown.markdown(result, extensions=['fenced_code', 'tables', SaneListExtension()])
|
html_output = markdown.markdown(result, extensions=['fenced_code', 'tables', SaneListExtension()])
|
||||||
|
|
||||||
|
# Remove extra newlines before </code>
|
||||||
|
html_output = re.sub(r'\s*</code>', '</code>', html_output)
|
||||||
|
|
||||||
# Unescape code blocks
|
# Unescape code blocks
|
||||||
pattern = re.compile(r'<code[^>]*>(.*?)</code>', re.DOTALL)
|
pattern = re.compile(r'<code[^>]*>(.*?)</code>', re.DOTALL)
|
||||||
html_output = pattern.sub(lambda x: html.unescape(x.group()), html_output)
|
html_output = pattern.sub(lambda x: html.unescape(x.group()), html_output)
|
||||||
|
|
|
||||||
|
|
@ -8,6 +8,7 @@ import sys
|
||||||
import threading
|
import threading
|
||||||
import time
|
import time
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from typing import Any, List
|
||||||
|
|
||||||
import llama_cpp_binaries
|
import llama_cpp_binaries
|
||||||
import requests
|
import requests
|
||||||
|
|
@ -129,10 +130,10 @@ class LlamaServer:
|
||||||
|
|
||||||
return payload
|
return payload
|
||||||
|
|
||||||
def generate_with_streaming(self, prompt, state):
|
def _process_images_for_generation(self, state: dict) -> List[Any]:
|
||||||
url = f"http://127.0.0.1:{self.port}/completion"
|
"""
|
||||||
payload = self.prepare_payload(state)
|
Process all possible image inputs and return PIL images
|
||||||
|
"""
|
||||||
pil_images = []
|
pil_images = []
|
||||||
# Source 1: Web UI (from chatbot_wrapper)
|
# Source 1: Web UI (from chatbot_wrapper)
|
||||||
if 'image_attachments' in state and state['image_attachments']:
|
if 'image_attachments' in state and state['image_attachments']:
|
||||||
|
|
@ -144,6 +145,21 @@ class LlamaServer:
|
||||||
elif 'raw_images' in state and state['raw_images']:
|
elif 'raw_images' in state and state['raw_images']:
|
||||||
pil_images.extend(state.get('raw_images', []))
|
pil_images.extend(state.get('raw_images', []))
|
||||||
|
|
||||||
|
return pil_images
|
||||||
|
|
||||||
|
def is_multimodal(self) -> bool:
|
||||||
|
"""Check if this model supports multimodal input."""
|
||||||
|
return shared.args.mmproj not in [None, 'None']
|
||||||
|
|
||||||
|
def generate_with_streaming(self, prompt, state):
|
||||||
|
url = f"http://127.0.0.1:{self.port}/completion"
|
||||||
|
payload = self.prepare_payload(state)
|
||||||
|
|
||||||
|
pil_images = []
|
||||||
|
|
||||||
|
if shared.is_multimodal:
|
||||||
|
pil_images = self._process_images_for_generation(state)
|
||||||
|
|
||||||
if pil_images:
|
if pil_images:
|
||||||
# Multimodal case
|
# Multimodal case
|
||||||
IMAGE_TOKEN_COST_ESTIMATE = 600 # A safe, conservative estimate per image
|
IMAGE_TOKEN_COST_ESTIMATE = 600 # A safe, conservative estimate per image
|
||||||
|
|
|
||||||
|
|
@ -56,6 +56,8 @@ loaders_and_params = OrderedDict({
|
||||||
'cfg_cache',
|
'cfg_cache',
|
||||||
'trust_remote_code',
|
'trust_remote_code',
|
||||||
'no_use_fast',
|
'no_use_fast',
|
||||||
|
'enable_tp',
|
||||||
|
'tp_backend',
|
||||||
],
|
],
|
||||||
'ExLlamav3': [
|
'ExLlamav3': [
|
||||||
'ctx_size',
|
'ctx_size',
|
||||||
|
|
@ -65,6 +67,8 @@ loaders_and_params = OrderedDict({
|
||||||
'draft_max',
|
'draft_max',
|
||||||
'ctx_size_draft',
|
'ctx_size_draft',
|
||||||
'speculative_decoding_accordion',
|
'speculative_decoding_accordion',
|
||||||
|
'enable_tp',
|
||||||
|
'tp_backend',
|
||||||
],
|
],
|
||||||
'ExLlamav2_HF': [
|
'ExLlamav2_HF': [
|
||||||
'ctx_size',
|
'ctx_size',
|
||||||
|
|
|
||||||
|
|
@ -55,6 +55,10 @@ def load_model(model_name, loader=None):
|
||||||
if loader.lower().startswith('exllama') or loader.lower().startswith('tensorrt') or loader == 'llama.cpp':
|
if loader.lower().startswith('exllama') or loader.lower().startswith('tensorrt') or loader == 'llama.cpp':
|
||||||
shared.settings['truncation_length'] = shared.args.ctx_size
|
shared.settings['truncation_length'] = shared.args.ctx_size
|
||||||
|
|
||||||
|
shared.is_multimodal = False
|
||||||
|
if loader.lower() in ('exllamav3', 'llama.cpp'):
|
||||||
|
shared.is_multimodal = model.is_multimodal()
|
||||||
|
|
||||||
logger.info(f"Loaded \"{model_name}\" in {(time.time()-t0):.2f} seconds.")
|
logger.info(f"Loaded \"{model_name}\" in {(time.time()-t0):.2f} seconds.")
|
||||||
logger.info(f"LOADER: \"{loader}\"")
|
logger.info(f"LOADER: \"{loader}\"")
|
||||||
logger.info(f"TRUNCATION LENGTH: {shared.settings['truncation_length']}")
|
logger.info(f"TRUNCATION LENGTH: {shared.settings['truncation_length']}")
|
||||||
|
|
@ -124,10 +128,12 @@ def unload_model(keep_model_name=False):
|
||||||
if shared.model is None:
|
if shared.model is None:
|
||||||
return
|
return
|
||||||
|
|
||||||
is_llamacpp = (shared.model.__class__.__name__ == 'LlamaServer')
|
model_class_name = shared.model.__class__.__name__
|
||||||
if shared.args.loader in ['ExLlamav3_HF', 'ExLlamav3']:
|
is_llamacpp = (model_class_name == 'LlamaServer')
|
||||||
|
|
||||||
|
if model_class_name in ['Exllamav3Model', 'Exllamav3HF']:
|
||||||
shared.model.unload()
|
shared.model.unload()
|
||||||
elif shared.args.loader in ['ExLlamav2_HF', 'ExLlamav2'] and hasattr(shared.model, 'unload'):
|
elif model_class_name in ['Exllamav2Model', 'Exllamav2HF'] and hasattr(shared.model, 'unload'):
|
||||||
shared.model.unload()
|
shared.model.unload()
|
||||||
|
|
||||||
shared.model = shared.tokenizer = None
|
shared.model = shared.tokenizer = None
|
||||||
|
|
|
||||||
|
|
@ -251,7 +251,7 @@ def apply_model_settings_to_state(model, state):
|
||||||
model_settings = get_model_metadata(model)
|
model_settings = get_model_metadata(model)
|
||||||
if 'loader' in model_settings:
|
if 'loader' in model_settings:
|
||||||
loader = model_settings.pop('loader')
|
loader = model_settings.pop('loader')
|
||||||
if not (loader == 'ExLlamav2_HF' and state['loader'] in ['ExLlamav2']):
|
if not ((loader == 'ExLlamav2_HF' and state['loader'] == 'ExLlamav2') or (loader == 'ExLlamav3_HF' and state['loader'] == 'ExLlamav3')):
|
||||||
state['loader'] = loader
|
state['loader'] = loader
|
||||||
|
|
||||||
for k in model_settings:
|
for k in model_settings:
|
||||||
|
|
|
||||||
|
|
@ -16,6 +16,7 @@ model = None
|
||||||
tokenizer = None
|
tokenizer = None
|
||||||
model_name = 'None'
|
model_name = 'None'
|
||||||
is_seq2seq = False
|
is_seq2seq = False
|
||||||
|
is_multimodal = False
|
||||||
model_dirty_from_training = False
|
model_dirty_from_training = False
|
||||||
lora_names = []
|
lora_names = []
|
||||||
|
|
||||||
|
|
@ -100,6 +101,11 @@ group.add_argument('--gpu-layers-draft', type=int, default=256, help='Number of
|
||||||
group.add_argument('--device-draft', type=str, default=None, help='Comma-separated list of devices to use for offloading the draft model. Example: CUDA0,CUDA1')
|
group.add_argument('--device-draft', type=str, default=None, help='Comma-separated list of devices to use for offloading the draft model. Example: CUDA0,CUDA1')
|
||||||
group.add_argument('--ctx-size-draft', type=int, default=0, help='Size of the prompt context for the draft model. If 0, uses the same as the main model.')
|
group.add_argument('--ctx-size-draft', type=int, default=0, help='Size of the prompt context for the draft model. If 0, uses the same as the main model.')
|
||||||
|
|
||||||
|
# ExLlamaV3
|
||||||
|
group = parser.add_argument_group('ExLlamaV3')
|
||||||
|
group.add_argument('--enable-tp', '--enable_tp', action='store_true', help='Enable Tensor Parallelism (TP) to split the model across GPUs.')
|
||||||
|
group.add_argument('--tp-backend', type=str, default='native', help='The backend for tensor parallelism. Valid options: native, nccl. Default: native.')
|
||||||
|
|
||||||
# ExLlamaV2
|
# ExLlamaV2
|
||||||
group = parser.add_argument_group('ExLlamaV2')
|
group = parser.add_argument_group('ExLlamaV2')
|
||||||
group.add_argument('--gpu-split', type=str, help='Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.')
|
group.add_argument('--gpu-split', type=str, help='Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.')
|
||||||
|
|
@ -109,7 +115,6 @@ group.add_argument('--no_flash_attn', action='store_true', help='Force flash-att
|
||||||
group.add_argument('--no_xformers', action='store_true', help='Force xformers to not be used.')
|
group.add_argument('--no_xformers', action='store_true', help='Force xformers to not be used.')
|
||||||
group.add_argument('--no_sdpa', action='store_true', help='Force Torch SDPA to not be used.')
|
group.add_argument('--no_sdpa', action='store_true', help='Force Torch SDPA to not be used.')
|
||||||
group.add_argument('--num_experts_per_token', type=int, default=2, metavar='N', help='Number of experts to use for generation. Applies to MoE models like Mixtral.')
|
group.add_argument('--num_experts_per_token', type=int, default=2, metavar='N', help='Number of experts to use for generation. Applies to MoE models like Mixtral.')
|
||||||
group.add_argument('--enable_tp', action='store_true', help='Enable Tensor Parallelism (TP) in ExLlamaV2.')
|
|
||||||
|
|
||||||
# TensorRT-LLM
|
# TensorRT-LLM
|
||||||
group = parser.add_argument_group('TensorRT-LLM')
|
group = parser.add_argument_group('TensorRT-LLM')
|
||||||
|
|
|
||||||
|
|
@ -155,6 +155,7 @@ def list_model_elements():
|
||||||
'bf16',
|
'bf16',
|
||||||
'autosplit',
|
'autosplit',
|
||||||
'enable_tp',
|
'enable_tp',
|
||||||
|
'tp_backend',
|
||||||
'no_flash_attn',
|
'no_flash_attn',
|
||||||
'no_xformers',
|
'no_xformers',
|
||||||
'no_sdpa',
|
'no_sdpa',
|
||||||
|
|
|
||||||
|
|
@ -46,6 +46,8 @@ def create_ui():
|
||||||
shared.gradio['gpu_split'] = gr.Textbox(label='gpu-split', info='Comma-separated list of VRAM (in GB) to use per GPU. Example: 20,7,7')
|
shared.gradio['gpu_split'] = gr.Textbox(label='gpu-split', info='Comma-separated list of VRAM (in GB) to use per GPU. Example: 20,7,7')
|
||||||
shared.gradio['attn_implementation'] = gr.Dropdown(label="attn-implementation", choices=['sdpa', 'eager', 'flash_attention_2'], value=shared.args.attn_implementation, info='Attention implementation.')
|
shared.gradio['attn_implementation'] = gr.Dropdown(label="attn-implementation", choices=['sdpa', 'eager', 'flash_attention_2'], value=shared.args.attn_implementation, info='Attention implementation.')
|
||||||
shared.gradio['cache_type'] = gr.Dropdown(label="cache-type", choices=['fp16', 'q8_0', 'q4_0', 'fp8', 'q8', 'q7', 'q6', 'q5', 'q4', 'q3', 'q2'], value=shared.args.cache_type, allow_custom_value=True, info='Valid options: llama.cpp - fp16, q8_0, q4_0; ExLlamaV2 - fp16, fp8, q8, q6, q4; ExLlamaV3 - fp16, q2 to q8. For ExLlamaV3, you can type custom combinations for separate k/v bits (e.g. q4_q8).')
|
shared.gradio['cache_type'] = gr.Dropdown(label="cache-type", choices=['fp16', 'q8_0', 'q4_0', 'fp8', 'q8', 'q7', 'q6', 'q5', 'q4', 'q3', 'q2'], value=shared.args.cache_type, allow_custom_value=True, info='Valid options: llama.cpp - fp16, q8_0, q4_0; ExLlamaV2 - fp16, fp8, q8, q6, q4; ExLlamaV3 - fp16, q2 to q8. For ExLlamaV3, you can type custom combinations for separate k/v bits (e.g. q4_q8).')
|
||||||
|
shared.gradio['tp_backend'] = gr.Dropdown(label="tp-backend", choices=['native', 'nccl'], value=shared.args.tp_backend, info='The backend for tensor parallelism.')
|
||||||
|
|
||||||
with gr.Column():
|
with gr.Column():
|
||||||
shared.gradio['vram_info'] = gr.HTML(value=get_initial_vram_info())
|
shared.gradio['vram_info'] = gr.HTML(value=get_initial_vram_info())
|
||||||
shared.gradio['flash_attn'] = gr.Checkbox(label="flash-attn", value=shared.args.flash_attn, info='Use flash-attention.')
|
shared.gradio['flash_attn'] = gr.Checkbox(label="flash-attn", value=shared.args.flash_attn, info='Use flash-attention.')
|
||||||
|
|
@ -54,7 +56,7 @@ def create_ui():
|
||||||
shared.gradio['load_in_4bit'] = gr.Checkbox(label="load-in-4bit", value=shared.args.load_in_4bit)
|
shared.gradio['load_in_4bit'] = gr.Checkbox(label="load-in-4bit", value=shared.args.load_in_4bit)
|
||||||
shared.gradio['use_double_quant'] = gr.Checkbox(label="use_double_quant", value=shared.args.use_double_quant, info='Used by load-in-4bit.')
|
shared.gradio['use_double_quant'] = gr.Checkbox(label="use_double_quant", value=shared.args.use_double_quant, info='Used by load-in-4bit.')
|
||||||
shared.gradio['autosplit'] = gr.Checkbox(label="autosplit", value=shared.args.autosplit, info='Automatically split the model tensors across the available GPUs.')
|
shared.gradio['autosplit'] = gr.Checkbox(label="autosplit", value=shared.args.autosplit, info='Automatically split the model tensors across the available GPUs.')
|
||||||
shared.gradio['enable_tp'] = gr.Checkbox(label="enable_tp", value=shared.args.enable_tp, info='Enable Tensor Parallelism (TP).')
|
shared.gradio['enable_tp'] = gr.Checkbox(label="enable_tp", value=shared.args.enable_tp, info='Enable tensor parallelism (TP).')
|
||||||
shared.gradio['cpp_runner'] = gr.Checkbox(label="cpp-runner", value=shared.args.cpp_runner, info='Enable inference with ModelRunnerCpp, which is faster than the default ModelRunner.')
|
shared.gradio['cpp_runner'] = gr.Checkbox(label="cpp-runner", value=shared.args.cpp_runner, info='Enable inference with ModelRunnerCpp, which is faster than the default ModelRunner.')
|
||||||
shared.gradio['trust_remote_code'] = gr.Checkbox(label="trust-remote-code", value=shared.args.trust_remote_code, info='Set trust_remote_code=True while loading the tokenizer/model. To enable this option, start the web UI with the --trust-remote-code flag.', interactive=shared.args.trust_remote_code)
|
shared.gradio['trust_remote_code'] = gr.Checkbox(label="trust-remote-code", value=shared.args.trust_remote_code, info='Set trust_remote_code=True while loading the tokenizer/model. To enable this option, start the web UI with the --trust-remote-code flag.', interactive=shared.args.trust_remote_code)
|
||||||
shared.gradio['tensorrt_llm_info'] = gr.Markdown('* TensorRT-LLM has to be installed manually in a separate Python 3.10 environment at the moment. For a guide, consult the description of [this PR](https://github.com/oobabooga/text-generation-webui/pull/5715). \n\n* `ctx_size` is only used when `cpp-runner` is checked.\n\n* `cpp_runner` does not support streaming at the moment.')
|
shared.gradio['tensorrt_llm_info'] = gr.Markdown('* TensorRT-LLM has to be installed manually in a separate Python 3.10 environment at the moment. For a guide, consult the description of [this PR](https://github.com/oobabooga/text-generation-webui/pull/5715). \n\n* `ctx_size` is only used when `cpp-runner` is checked.\n\n* `cpp_runner` does not support streaming at the moment.')
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,8 @@
|
||||||
import concurrent.futures
|
import concurrent.futures
|
||||||
import html
|
import html
|
||||||
|
import random
|
||||||
import re
|
import re
|
||||||
|
import urllib.request
|
||||||
from concurrent.futures import as_completed
|
from concurrent.futures import as_completed
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from urllib.parse import quote_plus
|
from urllib.parse import quote_plus
|
||||||
|
|
@ -50,16 +52,21 @@ def download_web_page(url, timeout=10):
|
||||||
def perform_web_search(query, num_pages=3, max_workers=5, timeout=10):
|
def perform_web_search(query, num_pages=3, max_workers=5, timeout=10):
|
||||||
"""Perform web search and return results with content"""
|
"""Perform web search and return results with content"""
|
||||||
try:
|
try:
|
||||||
# Use DuckDuckGo HTML search endpoint
|
|
||||||
search_url = f"https://html.duckduckgo.com/html/?q={quote_plus(query)}"
|
search_url = f"https://html.duckduckgo.com/html/?q={quote_plus(query)}"
|
||||||
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
|
|
||||||
|
|
||||||
response = requests.get(search_url, headers=headers, timeout=timeout)
|
agents = [
|
||||||
response.raise_for_status()
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
||||||
|
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
|
||||||
|
]
|
||||||
|
|
||||||
|
response_text = ""
|
||||||
|
req = urllib.request.Request(search_url, headers={'User-Agent': random.choice(agents)})
|
||||||
|
with urllib.request.urlopen(req, timeout=timeout) as response:
|
||||||
|
response_text = response.read().decode('utf-8')
|
||||||
|
|
||||||
# Extract results with regex
|
# Extract results with regex
|
||||||
titles = re.findall(r'<a[^>]*class="[^"]*result__a[^"]*"[^>]*>(.*?)</a>', response.text, re.DOTALL)
|
titles = re.findall(r'<a[^>]*class="[^"]*result__a[^"]*"[^>]*>(.*?)</a>', response_text, re.DOTALL)
|
||||||
urls = re.findall(r'<a[^>]*class="[^"]*result__url[^"]*"[^>]*>(.*?)</a>', response.text, re.DOTALL)
|
urls = re.findall(r'<a[^>]*class="[^"]*result__url[^"]*"[^>]*>(.*?)</a>', response_text, re.DOTALL)
|
||||||
|
|
||||||
# Prepare download tasks
|
# Prepare download tasks
|
||||||
download_tasks = []
|
download_tasks = []
|
||||||
|
|
|
||||||
44
one_click.py
44
one_click.py
|
|
@ -16,7 +16,7 @@ import sys
|
||||||
# os.environ["HCC_AMDGPU_TARGET"] = 'gfx1030'
|
# os.environ["HCC_AMDGPU_TARGET"] = 'gfx1030'
|
||||||
|
|
||||||
# Define the required versions
|
# Define the required versions
|
||||||
TORCH_VERSION = "2.6.0"
|
TORCH_VERSION = "2.7.1"
|
||||||
PYTHON_VERSION = "3.11"
|
PYTHON_VERSION = "3.11"
|
||||||
LIBSTDCXX_VERSION_LINUX = "12.1.0"
|
LIBSTDCXX_VERSION_LINUX = "12.1.0"
|
||||||
|
|
||||||
|
|
@ -113,17 +113,16 @@ def get_gpu_choice():
|
||||||
choice = get_user_choice(
|
choice = get_user_choice(
|
||||||
"What is your GPU?",
|
"What is your GPU?",
|
||||||
{
|
{
|
||||||
'A': 'NVIDIA - CUDA 12.4',
|
'A': 'NVIDIA',
|
||||||
'B': 'AMD - Linux/macOS only, requires ROCm 6.2.4',
|
'B': 'AMD - Linux/macOS only, requires ROCm 6.2.4',
|
||||||
'C': 'Apple M Series',
|
'C': 'Apple M Series',
|
||||||
'D': 'Intel Arc (beta)',
|
'D': 'Intel Arc (beta)',
|
||||||
'E': 'NVIDIA - CUDA 12.8',
|
|
||||||
'N': 'CPU mode'
|
'N': 'CPU mode'
|
||||||
},
|
},
|
||||||
)
|
)
|
||||||
|
|
||||||
# Convert choice to GPU name
|
# Convert choice to GPU name
|
||||||
gpu_choice = {"A": "NVIDIA", "B": "AMD", "C": "APPLE", "D": "INTEL", "E": "NVIDIA_CUDA128", "N": "NONE"}[choice]
|
gpu_choice = {"A": "NVIDIA_CUDA128", "B": "AMD", "C": "APPLE", "D": "INTEL", "N": "NONE"}[choice]
|
||||||
|
|
||||||
# Save choice to state
|
# Save choice to state
|
||||||
state['gpu_choice'] = gpu_choice
|
state['gpu_choice'] = gpu_choice
|
||||||
|
|
@ -136,10 +135,8 @@ def get_pytorch_install_command(gpu_choice):
|
||||||
"""Get PyTorch installation command based on GPU choice"""
|
"""Get PyTorch installation command based on GPU choice"""
|
||||||
base_cmd = f"python -m pip install torch=={TORCH_VERSION} "
|
base_cmd = f"python -m pip install torch=={TORCH_VERSION} "
|
||||||
|
|
||||||
if gpu_choice == "NVIDIA":
|
if gpu_choice == "NVIDIA_CUDA128":
|
||||||
return base_cmd + "--index-url https://download.pytorch.org/whl/cu124"
|
return base_cmd + "--index-url https://download.pytorch.org/whl/cu128"
|
||||||
elif gpu_choice == "NVIDIA_CUDA128":
|
|
||||||
return "python -m pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128"
|
|
||||||
elif gpu_choice == "AMD":
|
elif gpu_choice == "AMD":
|
||||||
return base_cmd + "--index-url https://download.pytorch.org/whl/rocm6.2.4"
|
return base_cmd + "--index-url https://download.pytorch.org/whl/rocm6.2.4"
|
||||||
elif gpu_choice in ["APPLE", "NONE"]:
|
elif gpu_choice in ["APPLE", "NONE"]:
|
||||||
|
|
@ -157,10 +154,8 @@ def get_pytorch_update_command(gpu_choice):
|
||||||
"""Get PyTorch update command based on GPU choice"""
|
"""Get PyTorch update command based on GPU choice"""
|
||||||
base_cmd = f"python -m pip install --upgrade torch=={TORCH_VERSION} "
|
base_cmd = f"python -m pip install --upgrade torch=={TORCH_VERSION} "
|
||||||
|
|
||||||
if gpu_choice == "NVIDIA":
|
if gpu_choice == "NVIDIA_CUDA128":
|
||||||
return f"{base_cmd} --index-url https://download.pytorch.org/whl/cu124"
|
return f"{base_cmd} --index-url https://download.pytorch.org/whl/cu128"
|
||||||
elif gpu_choice == "NVIDIA_CUDA128":
|
|
||||||
return "python -m pip install --upgrade torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128"
|
|
||||||
elif gpu_choice == "AMD":
|
elif gpu_choice == "AMD":
|
||||||
return f"{base_cmd} --index-url https://download.pytorch.org/whl/rocm6.2.4"
|
return f"{base_cmd} --index-url https://download.pytorch.org/whl/rocm6.2.4"
|
||||||
elif gpu_choice in ["APPLE", "NONE"]:
|
elif gpu_choice in ["APPLE", "NONE"]:
|
||||||
|
|
@ -176,16 +171,14 @@ def get_requirements_file(gpu_choice):
|
||||||
"""Get requirements file path based on GPU choice"""
|
"""Get requirements file path based on GPU choice"""
|
||||||
requirements_base = os.path.join("requirements", "full")
|
requirements_base = os.path.join("requirements", "full")
|
||||||
|
|
||||||
if gpu_choice == "AMD":
|
if gpu_choice == "NVIDIA_CUDA128":
|
||||||
|
file_name = f"requirements{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
||||||
|
elif gpu_choice == "AMD":
|
||||||
file_name = f"requirements_amd{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
file_name = f"requirements_amd{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
||||||
elif gpu_choice == "APPLE":
|
elif gpu_choice == "APPLE":
|
||||||
file_name = f"requirements_apple_{'intel' if is_x86_64() else 'silicon'}.txt"
|
file_name = f"requirements_apple_{'intel' if is_x86_64() else 'silicon'}.txt"
|
||||||
elif gpu_choice in ["INTEL", "NONE"]:
|
elif gpu_choice in ["INTEL", "NONE"]:
|
||||||
file_name = f"requirements_cpu_only{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
file_name = f"requirements_cpu_only{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
||||||
elif gpu_choice == "NVIDIA":
|
|
||||||
file_name = f"requirements{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
|
||||||
elif gpu_choice == "NVIDIA_CUDA128":
|
|
||||||
file_name = f"requirements_cuda128{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
|
||||||
else:
|
else:
|
||||||
raise ValueError(f"Unknown GPU choice: {gpu_choice}")
|
raise ValueError(f"Unknown GPU choice: {gpu_choice}")
|
||||||
|
|
||||||
|
|
@ -331,8 +324,6 @@ def install_webui():
|
||||||
cmd_flags_file.write("\n--cpu\n")
|
cmd_flags_file.write("\n--cpu\n")
|
||||||
|
|
||||||
# Handle CUDA version display
|
# Handle CUDA version display
|
||||||
elif any((is_windows(), is_linux())) and gpu_choice == "NVIDIA":
|
|
||||||
print("CUDA: 12.4")
|
|
||||||
elif any((is_windows(), is_linux())) and gpu_choice == "NVIDIA_CUDA128":
|
elif any((is_windows(), is_linux())) and gpu_choice == "NVIDIA_CUDA128":
|
||||||
print("CUDA: 12.8")
|
print("CUDA: 12.8")
|
||||||
|
|
||||||
|
|
@ -368,6 +359,19 @@ def update_requirements(initial_installation=False, pull=True):
|
||||||
assert_success=True
|
assert_success=True
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Check for outdated CUDA 12.4 installs and refuse to update
|
||||||
|
state = load_state()
|
||||||
|
if state.get('gpu_choice') == 'NVIDIA':
|
||||||
|
print_big_message(
|
||||||
|
"Your current installation uses CUDA 12.4, which has been removed.\n"
|
||||||
|
"To update to the new default (CUDA 12.8), a clean installation is required.\n\n"
|
||||||
|
"INSTRUCTIONS:\n"
|
||||||
|
"1. Delete the 'installer_files' folder in your text-generation-webui directory.\n"
|
||||||
|
"2. Run the start script again (e.g., start_windows.bat).\n\n"
|
||||||
|
"This will create a fresh environment with the latest software."
|
||||||
|
)
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
current_commit = get_current_commit()
|
current_commit = get_current_commit()
|
||||||
wheels_changed = not os.path.exists(state_file)
|
wheels_changed = not os.path.exists(state_file)
|
||||||
if not wheels_changed:
|
if not wheels_changed:
|
||||||
|
|
@ -404,7 +408,7 @@ def update_requirements(initial_installation=False, pull=True):
|
||||||
with open(requirements_file, 'r') as f:
|
with open(requirements_file, 'r') as f:
|
||||||
after_pull_whl_lines = [line for line in f if '.whl' in line]
|
after_pull_whl_lines = [line for line in f if '.whl' in line]
|
||||||
|
|
||||||
wheels_changed = wheels_changed or (before_pull_whl_lines != after_pull_whl_lines)
|
wheels_changed = wheels_changed or (before_pull_whl_lines != after_pull_whl_lines)
|
||||||
|
|
||||||
# Check for changes to installer files
|
# Check for changes to installer files
|
||||||
for file in files_to_check:
|
for file in files_to_check:
|
||||||
|
|
|
||||||
|
|
@ -24,7 +24,7 @@ scipy
|
||||||
sentencepiece
|
sentencepiece
|
||||||
tensorboard
|
tensorboard
|
||||||
transformers==4.55.*
|
transformers==4.55.*
|
||||||
triton-windows==3.2.0.post19; platform_system == "Windows"
|
triton-windows==3.3.1.post19; platform_system == "Windows"
|
||||||
tqdm
|
tqdm
|
||||||
wandb
|
wandb
|
||||||
|
|
||||||
|
|
@ -34,12 +34,12 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# CUDA wheels
|
# CUDA wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
||||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu124torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
https://github.com/kingbri1/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
|
|
|
||||||
|
|
@ -33,7 +33,7 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# AMD wheels
|
# AMD wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+rocm6.2.4.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+rocm6.2.4.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"
|
||||||
|
|
|
||||||
|
|
@ -33,7 +33,7 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# AMD wheels
|
# AMD wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+rocm6.2.4.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+rocm6.2.4.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"
|
||||||
|
|
|
||||||
|
|
@ -33,7 +33,7 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# Mac wheels
|
# Mac wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5-py3-none-any.whl
|
https://github.com/oobabooga/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6-py3-none-any.whl
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl
|
||||||
|
|
|
||||||
|
|
@ -33,8 +33,8 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# Mac wheels
|
# Mac wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5-py3-none-any.whl
|
https://github.com/oobabooga/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6-py3-none-any.whl
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl
|
||||||
|
|
|
||||||
|
|
@ -33,5 +33,5 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# llama.cpp (CPU only, AVX2)
|
# llama.cpp (CPU only, AVX2)
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||||
|
|
|
||||||
|
|
@ -33,5 +33,5 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# llama.cpp (CPU only, no AVX2)
|
# llama.cpp (CPU only, no AVX2)
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||||
|
|
|
||||||
|
|
@ -1,45 +0,0 @@
|
||||||
accelerate==1.8.*
|
|
||||||
bitsandbytes==0.46.*
|
|
||||||
colorama
|
|
||||||
datasets
|
|
||||||
einops
|
|
||||||
fastapi==0.112.4
|
|
||||||
gradio==4.37.*
|
|
||||||
html2text==2025.4.15
|
|
||||||
jinja2==3.1.6
|
|
||||||
markdown
|
|
||||||
numpy==2.2.*
|
|
||||||
pandas
|
|
||||||
peft==0.16.*
|
|
||||||
Pillow>=9.5.0
|
|
||||||
psutil
|
|
||||||
pydantic==2.8.2
|
|
||||||
PyPDF2==3.0.1
|
|
||||||
python-docx==1.1.2
|
|
||||||
pyyaml
|
|
||||||
requests
|
|
||||||
rich
|
|
||||||
safetensors==0.5.*
|
|
||||||
scipy
|
|
||||||
sentencepiece
|
|
||||||
tensorboard
|
|
||||||
transformers==4.55.*
|
|
||||||
triton-windows==3.3.1.post19; platform_system == "Windows"
|
|
||||||
tqdm
|
|
||||||
wandb
|
|
||||||
|
|
||||||
# API
|
|
||||||
flask_cloudflared==0.0.14
|
|
||||||
sse-starlette==1.6.5
|
|
||||||
tiktoken
|
|
||||||
|
|
||||||
# CUDA wheels
|
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
|
||||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
|
||||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
|
||||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
|
||||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
|
||||||
|
|
@ -1,45 +0,0 @@
|
||||||
accelerate==1.8.*
|
|
||||||
bitsandbytes==0.46.*
|
|
||||||
colorama
|
|
||||||
datasets
|
|
||||||
einops
|
|
||||||
fastapi==0.112.4
|
|
||||||
gradio==4.37.*
|
|
||||||
html2text==2025.4.15
|
|
||||||
jinja2==3.1.6
|
|
||||||
markdown
|
|
||||||
numpy==2.2.*
|
|
||||||
pandas
|
|
||||||
peft==0.16.*
|
|
||||||
Pillow>=9.5.0
|
|
||||||
psutil
|
|
||||||
pydantic==2.8.2
|
|
||||||
PyPDF2==3.0.1
|
|
||||||
python-docx==1.1.2
|
|
||||||
pyyaml
|
|
||||||
requests
|
|
||||||
rich
|
|
||||||
safetensors==0.5.*
|
|
||||||
scipy
|
|
||||||
sentencepiece
|
|
||||||
tensorboard
|
|
||||||
transformers==4.55.*
|
|
||||||
triton-windows==3.3.1.post19; platform_system == "Windows"
|
|
||||||
tqdm
|
|
||||||
wandb
|
|
||||||
|
|
||||||
# API
|
|
||||||
flask_cloudflared==0.0.14
|
|
||||||
sse-starlette==1.6.5
|
|
||||||
tiktoken
|
|
||||||
|
|
||||||
# CUDA wheels
|
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
|
||||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
|
||||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
|
||||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
|
||||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
|
||||||
|
|
@ -24,7 +24,7 @@ scipy
|
||||||
sentencepiece
|
sentencepiece
|
||||||
tensorboard
|
tensorboard
|
||||||
transformers==4.55.*
|
transformers==4.55.*
|
||||||
triton-windows==3.2.0.post19; platform_system == "Windows"
|
triton-windows==3.3.1.post19; platform_system == "Windows"
|
||||||
tqdm
|
tqdm
|
||||||
wandb
|
wandb
|
||||||
|
|
||||||
|
|
@ -34,12 +34,12 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# CUDA wheels
|
# CUDA wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
||||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu124torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
https://github.com/kingbri1/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||||
|
|
|
||||||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# CUDA wheels
|
# CUDA wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||||
|
|
|
||||||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# Mac wheels
|
# Mac wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0"
|
||||||
|
|
|
||||||
|
|
@ -18,6 +18,6 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# Mac wheels
|
# Mac wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0"
|
||||||
|
|
|
||||||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# llama.cpp (CPU only, AVX2)
|
# llama.cpp (CPU only, AVX2)
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||||
|
|
|
||||||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# llama.cpp (CPU only, no AVX2)
|
# llama.cpp (CPU only, no AVX2)
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||||
|
|
|
||||||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# CUDA wheels
|
# CUDA wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||||
|
|
|
||||||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# CUDA wheels
|
# CUDA wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||||
|
|
|
||||||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
||||||
tiktoken
|
tiktoken
|
||||||
|
|
||||||
# CUDA wheels
|
# CUDA wheels
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.36.0/llama_cpp_binaries-0.36.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue