mirror of
https://github.com/oobabooga/text-generation-webui.git
synced 2026-02-14 03:35:04 +01:00
Merge branch 'main' into main
This commit is contained in:
commit
3f1f0f0f7f
119
README.md
119
README.md
|
|
@ -2,8 +2,6 @@
|
|||
|
||||
A Gradio web UI for Large Language Models.
|
||||
|
||||
Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) of text generation.
|
||||
|
||||
[Try the Deep Reason extension](https://oobabooga.gumroad.com/l/deep_reason)
|
||||
|
||||
| |  |
|
||||
|
|
@ -16,6 +14,7 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
|
|||
- Easy setup: Choose between **portable builds** (zero setup, just unzip and run) for GGUF models on Windows/Linux/macOS, or the one-click installer that creates a self-contained `installer_files` directory.
|
||||
- 100% offline and private, with zero telemetry, external resources, or remote update requests.
|
||||
- **File attachments**: Upload text files, PDF documents, and .docx documents to talk about their contents.
|
||||
- **Vision (multimodal models)**: Attach images to messages for visual understanding ([tutorial](https://github.com/oobabooga/text-generation-webui/wiki/Multimodal-Tutorial)).
|
||||
- **Web search**: Optionally search the internet with LLM-generated queries to add context to the conversation.
|
||||
- Aesthetic UI with dark and light themes.
|
||||
- Syntax highlighting for code blocks and LaTeX rendering for mathematical expressions.
|
||||
|
|
@ -31,54 +30,15 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
|
|||
|
||||
## How to install
|
||||
|
||||
#### Option 1: Portable builds (get started in 1 minute)
|
||||
#### ✅ Option 1: Portable builds (get started in 1 minute)
|
||||
|
||||
No installation needed – just download, unzip and run. All dependencies included.
|
||||
|
||||
Compatible with GGUF (llama.cpp) models on Windows, Linux, and macOS.
|
||||
|
||||
Download from here: https://github.com/oobabooga/text-generation-webui/releases
|
||||
Download from here: **https://github.com/oobabooga/text-generation-webui/releases**
|
||||
|
||||
#### Option 2: One-click installer
|
||||
|
||||
For users who need additional backends (ExLlamaV3, Transformers) or extensions (TTS, voice input, translation, etc). Requires ~10GB disk space and downloads PyTorch.
|
||||
|
||||
1. Clone the repository, or [download its source code](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip) and extract it.
|
||||
2. Run the startup script for your OS: `start_windows.bat`, `start_linux.sh`, or `start_macos.sh`.
|
||||
3. When prompted, select your GPU vendor.
|
||||
4. After installation, open `http://127.0.0.1:7860` in your browser.
|
||||
|
||||
To restart the web UI later, run the same `start_` script.
|
||||
|
||||
To reinstall with a fresh Python environment, delete the `installer_files` folder and run the `start_` script again.
|
||||
|
||||
You can pass command-line flags directly (e.g., `./start_linux.sh --help`), or add them to `user_data/CMD_FLAGS.txt` (e.g., `--api` to enable the API).
|
||||
|
||||
To update, run the update script for your OS: `update_wizard_windows.bat`, `update_wizard_linux.sh`, or `update_wizard_macos.sh`.
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
One-click installer details
|
||||
</summary>
|
||||
|
||||
### One-click-installer
|
||||
|
||||
The script uses Miniforge to set up a Conda environment in the `installer_files` folder.
|
||||
|
||||
If you ever need to install something manually in the `installer_files` environment, you can launch an interactive shell using the cmd script: `cmd_linux.sh`, `cmd_windows.bat`, or `cmd_macos.sh`.
|
||||
|
||||
* There is no need to run any of those scripts (`start_`, `update_wizard_`, or `cmd_`) as admin/root.
|
||||
* To install requirements for extensions, it is recommended to use the update wizard script with the "Install/update extensions requirements" option. At the end, this script will install the main requirements for the project to make sure that they take precedence in case of version conflicts.
|
||||
* For automated installation, you can use the `GPU_CHOICE`, `LAUNCH_AFTER_INSTALL`, and `INSTALL_EXTENSIONS` environment variables. For instance: `GPU_CHOICE=A LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh`.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
Manual portable installation with venv
|
||||
</summary>
|
||||
|
||||
### Manual portable installation with venv
|
||||
#### Option 2: Manual portable install with venv
|
||||
|
||||
Very fast setup that should work on any Python 3.9+:
|
||||
|
||||
|
|
@ -97,7 +57,7 @@ venv\Scripts\activate
|
|||
source venv/bin/activate
|
||||
|
||||
# Install dependencies (choose appropriate file under requirements/portable for your hardware)
|
||||
pip install -r requirements/portable/requirements.txt
|
||||
pip install -r requirements/portable/requirements.txt --upgrade
|
||||
|
||||
# Launch server (basic command)
|
||||
python server.py --portable --api --auto-launch
|
||||
|
|
@ -105,6 +65,39 @@ python server.py --portable --api --auto-launch
|
|||
# When done working, deactivate
|
||||
deactivate
|
||||
```
|
||||
|
||||
#### Option 3: One-click installer
|
||||
|
||||
For users who need additional backends (ExLlamaV3, Transformers) or extensions (TTS, voice input, translation, etc). Requires ~10GB disk space and downloads PyTorch.
|
||||
|
||||
1. Clone the repository, or [download its source code](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip) and extract it.
|
||||
2. Run the startup script for your OS: `start_windows.bat`, `start_linux.sh`, or `start_macos.sh`.
|
||||
3. When prompted, select your GPU vendor.
|
||||
4. After installation, open `http://127.0.0.1:7860` in your browser.
|
||||
|
||||
To restart the web UI later, run the same `start_` script.
|
||||
|
||||
You can pass command-line flags directly (e.g., `./start_linux.sh --help`), or add them to `user_data/CMD_FLAGS.txt` (e.g., `--api` to enable the API).
|
||||
|
||||
To update, run the update script for your OS: `update_wizard_windows.bat`, `update_wizard_linux.sh`, or `update_wizard_macos.sh`.
|
||||
|
||||
To reinstall with a fresh Python environment, delete the `installer_files` folder and run the `start_` script again.
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
One-click installer details
|
||||
</summary>
|
||||
|
||||
### One-click-installer
|
||||
|
||||
The script uses Miniforge to set up a Conda environment in the `installer_files` folder.
|
||||
|
||||
If you ever need to install something manually in the `installer_files` environment, you can launch an interactive shell using the cmd script: `cmd_linux.sh`, `cmd_windows.bat`, or `cmd_macos.sh`.
|
||||
|
||||
* There is no need to run any of those scripts (`start_`, `update_wizard_`, or `cmd_`) as admin/root.
|
||||
* To install requirements for extensions, it is recommended to use the update wizard script with the "Install/update extensions requirements" option. At the end, this script will install the main requirements for the project to make sure that they take precedence in case of version conflicts.
|
||||
* For automated installation, you can use the `GPU_CHOICE`, `LAUNCH_AFTER_INSTALL`, and `INSTALL_EXTENSIONS` environment variables. For instance: `GPU_CHOICE=A LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh`.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
|
@ -138,19 +131,19 @@ conda activate textgen
|
|||
|
||||
| System | GPU | Command |
|
||||
|--------|---------|---------|
|
||||
| Linux/WSL | NVIDIA | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124` |
|
||||
| Linux/WSL | CPU only | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu` |
|
||||
| Linux | AMD | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/rocm6.2.4` |
|
||||
| MacOS + MPS | Any | `pip3 install torch==2.6.0` |
|
||||
| Windows | NVIDIA | `pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124` |
|
||||
| Windows | CPU only | `pip3 install torch==2.6.0` |
|
||||
| Linux/WSL | NVIDIA | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128` |
|
||||
| Linux/WSL | CPU only | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/cpu` |
|
||||
| Linux | AMD | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/rocm6.2.4` |
|
||||
| MacOS + MPS | Any | `pip3 install torch==2.7.1` |
|
||||
| Windows | NVIDIA | `pip3 install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128` |
|
||||
| Windows | CPU only | `pip3 install torch==2.7.1` |
|
||||
|
||||
The up-to-date commands can be found here: https://pytorch.org/get-started/locally/.
|
||||
|
||||
If you need `nvcc` to compile some library manually, you will additionally need to install this:
|
||||
|
||||
```
|
||||
conda install -y -c "nvidia/label/cuda-12.4.1" cuda
|
||||
conda install -y -c "nvidia/label/cuda-12.8.1" cuda
|
||||
```
|
||||
|
||||
#### 3. Install the web UI
|
||||
|
|
@ -237,13 +230,13 @@ usage: server.py [-h] [--multi-user] [--model MODEL] [--lora LORA [LORA ...]] [-
|
|||
[--extensions EXTENSIONS [EXTENSIONS ...]] [--verbose] [--idle-timeout IDLE_TIMEOUT] [--loader LOADER] [--cpu] [--cpu-memory CPU_MEMORY] [--disk] [--disk-cache-dir DISK_CACHE_DIR]
|
||||
[--load-in-8bit] [--bf16] [--no-cache] [--trust-remote-code] [--force-safetensors] [--no_use_fast] [--attn-implementation IMPLEMENTATION] [--load-in-4bit] [--use_double_quant]
|
||||
[--compute_dtype COMPUTE_DTYPE] [--quant_type QUANT_TYPE] [--flash-attn] [--threads THREADS] [--threads-batch THREADS_BATCH] [--batch-size BATCH_SIZE] [--no-mmap] [--mlock]
|
||||
[--gpu-layers N] [--tensor-split TENSOR_SPLIT] [--numa] [--no-kv-offload] [--row-split] [--extra-flags EXTRA_FLAGS] [--streaming-llm] [--ctx-size N] [--cache-type N]
|
||||
[--model-draft MODEL_DRAFT] [--draft-max DRAFT_MAX] [--gpu-layers-draft GPU_LAYERS_DRAFT] [--device-draft DEVICE_DRAFT] [--ctx-size-draft CTX_SIZE_DRAFT] [--gpu-split GPU_SPLIT]
|
||||
[--autosplit] [--cfg-cache] [--no_flash_attn] [--no_xformers] [--no_sdpa] [--num_experts_per_token N] [--enable_tp] [--cpp-runner] [--deepspeed] [--nvme-offload-dir NVME_OFFLOAD_DIR]
|
||||
[--local_rank LOCAL_RANK] [--alpha_value ALPHA_VALUE] [--rope_freq_base ROPE_FREQ_BASE] [--compress_pos_emb COMPRESS_POS_EMB] [--listen] [--listen-port LISTEN_PORT]
|
||||
[--listen-host LISTEN_HOST] [--share] [--auto-launch] [--gradio-auth GRADIO_AUTH] [--gradio-auth-path GRADIO_AUTH_PATH] [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE]
|
||||
[--subpath SUBPATH] [--old-colors] [--portable] [--api] [--public-api] [--public-api-id PUBLIC_API_ID] [--api-port API_PORT] [--api-key API_KEY] [--admin-key ADMIN_KEY]
|
||||
[--api-enable-ipv6] [--api-disable-ipv4] [--nowebui]
|
||||
[--gpu-layers N] [--tensor-split TENSOR_SPLIT] [--numa] [--no-kv-offload] [--row-split] [--extra-flags EXTRA_FLAGS] [--streaming-llm] [--mmproj MMPROJ] [--ctx-size N] [--cache-type N]
|
||||
[--model-draft MODEL_DRAFT] [--draft-max DRAFT_MAX] [--gpu-layers-draft GPU_LAYERS_DRAFT] [--device-draft DEVICE_DRAFT] [--ctx-size-draft CTX_SIZE_DRAFT] [--enable-tp]
|
||||
[--tp-backend TP_BACKEND] [--gpu-split GPU_SPLIT] [--autosplit] [--cfg-cache] [--no_flash_attn] [--no_xformers] [--no_sdpa] [--num_experts_per_token N] [--cpp-runner] [--deepspeed]
|
||||
[--nvme-offload-dir NVME_OFFLOAD_DIR] [--local_rank LOCAL_RANK] [--alpha_value ALPHA_VALUE] [--rope_freq_base ROPE_FREQ_BASE] [--compress_pos_emb COMPRESS_POS_EMB] [--listen]
|
||||
[--listen-port LISTEN_PORT] [--listen-host LISTEN_HOST] [--share] [--auto-launch] [--gradio-auth GRADIO_AUTH] [--gradio-auth-path GRADIO_AUTH_PATH] [--ssl-keyfile SSL_KEYFILE]
|
||||
[--ssl-certfile SSL_CERTFILE] [--subpath SUBPATH] [--old-colors] [--portable] [--api] [--public-api] [--public-api-id PUBLIC_API_ID] [--api-port API_PORT] [--api-key API_KEY]
|
||||
[--admin-key ADMIN_KEY] [--api-enable-ipv6] [--api-disable-ipv4] [--nowebui]
|
||||
|
||||
Text generation web UI
|
||||
|
||||
|
|
@ -300,6 +293,7 @@ llama.cpp:
|
|||
--row-split Split the model by rows across GPUs. This may improve multi-gpu performance.
|
||||
--extra-flags EXTRA_FLAGS Extra flags to pass to llama-server. Format: "flag1=value1,flag2,flag3=value3". Example: "override-tensor=exps=CPU"
|
||||
--streaming-llm Activate StreamingLLM to avoid re-evaluating the entire prompt when old messages are removed.
|
||||
--mmproj MMPROJ Path to the mmproj file for vision models.
|
||||
|
||||
Context and cache:
|
||||
--ctx-size N, --n_ctx N, --max_seq_len N Context size in tokens.
|
||||
|
|
@ -313,6 +307,10 @@ Speculative decoding:
|
|||
--device-draft DEVICE_DRAFT Comma-separated list of devices to use for offloading the draft model. Example: CUDA0,CUDA1
|
||||
--ctx-size-draft CTX_SIZE_DRAFT Size of the prompt context for the draft model. If 0, uses the same as the main model.
|
||||
|
||||
ExLlamaV3:
|
||||
--enable-tp, --enable_tp Enable Tensor Parallelism (TP) to split the model across GPUs.
|
||||
--tp-backend TP_BACKEND The backend for tensor parallelism. Valid options: native, nccl. Default: native.
|
||||
|
||||
ExLlamaV2:
|
||||
--gpu-split GPU_SPLIT Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.
|
||||
--autosplit Autosplit the model tensors across the available GPUs. This causes --gpu-split to be ignored.
|
||||
|
|
@ -321,7 +319,6 @@ ExLlamaV2:
|
|||
--no_xformers Force xformers to not be used.
|
||||
--no_sdpa Force Torch SDPA to not be used.
|
||||
--num_experts_per_token N Number of experts to use for generation. Applies to MoE models like Mixtral.
|
||||
--enable_tp Enable Tensor Parallelism (TP) in ExLlamaV2.
|
||||
|
||||
TensorRT-LLM:
|
||||
--cpp-runner Use the ModelRunnerCpp runner, which is faster than the default ModelRunner but doesn't support streaming yet.
|
||||
|
|
@ -381,7 +378,7 @@ text-generation-webui
|
|||
└── llama-2-13b-chat.Q4_K_M.gguf
|
||||
```
|
||||
|
||||
* The remaining model types (like 16-bit Transformers models and EXL2 models) are made of several files and must be placed in a subfolder. Example:
|
||||
* The remaining model types (like 16-bit Transformers models and EXL3 models) are made of several files and must be placed in a subfolder. Example:
|
||||
|
||||
```
|
||||
text-generation-webui
|
||||
|
|
|
|||
|
|
@ -99,3 +99,9 @@
|
|||
.message-body p em {
|
||||
color: rgb(110 110 110) !important;
|
||||
}
|
||||
.editing-textarea {
|
||||
width: max(30rem) !important;
|
||||
}
|
||||
.circle-you + .text .edit-control-button, .circle-you + .text .editing-textarea {
|
||||
color: #000 !important;
|
||||
}
|
||||
|
|
|
|||
|
|
@ -13,7 +13,7 @@
|
|||
line-height: 28px !important;
|
||||
}
|
||||
|
||||
.dark .chat .message-body :is(p, li, q, h1, h2, h3, h4, h5, h6) {
|
||||
.dark .chat .message-body :is(p, li, q, em, h1, h2, h3, h4, h5, h6) {
|
||||
color: #d1d5db !important;
|
||||
}
|
||||
|
||||
|
|
|
|||
14
css/main.css
14
css/main.css
|
|
@ -1577,6 +1577,20 @@ strong {
|
|||
margin-top: 4px;
|
||||
}
|
||||
|
||||
.image-attachment {
|
||||
flex-direction: column;
|
||||
max-width: 314px;
|
||||
}
|
||||
|
||||
.image-preview {
|
||||
border-radius: 16px;
|
||||
margin-bottom: 5px;
|
||||
object-fit: cover;
|
||||
object-position: center;
|
||||
border: 2px solid var(--border-color-primary);
|
||||
aspect-ratio: 1 / 1;
|
||||
}
|
||||
|
||||
button:focus {
|
||||
outline: none;
|
||||
}
|
||||
|
|
|
|||
|
|
@ -77,6 +77,68 @@ curl http://127.0.0.1:5000/v1/chat/completions \
|
|||
}'
|
||||
```
|
||||
|
||||
#### Multimodal/vision (llama.cpp and ExLlamaV3)
|
||||
|
||||
##### With /v1/chat/completions (recommended!)
|
||||
|
||||
```shell
|
||||
curl http://127.0.0.1:5000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "Please describe what you see in this image."},
|
||||
{"type": "image_url", "image_url": {"url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"}}
|
||||
]
|
||||
}
|
||||
],
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20
|
||||
}'
|
||||
```
|
||||
|
||||
For base64-encoded images, just replace the inner "url" value with this format: `_STRING` where FORMAT is the file type (png, jpeg, gif, etc.) and BASE64_STRING is your base64-encoded image data.
|
||||
|
||||
##### With /v1/completions
|
||||
|
||||
```shell
|
||||
curl http://127.0.0.1:5000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "About image <__media__> and image <__media__>, what I can say is that the first one"
|
||||
},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/strawberry.png?raw=true"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20
|
||||
}'
|
||||
```
|
||||
|
||||
For base64-encoded images, just replace the inner "url" values with this format: `_STRING` where FORMAT is the file type (png, jpeg, gif, etc.) and BASE64_STRING is your base64-encoded image data.
|
||||
|
||||
#### SSE streaming
|
||||
|
||||
```shell
|
||||
|
|
|
|||
66
docs/Multimodal Tutorial.md
Normal file
66
docs/Multimodal Tutorial.md
Normal file
|
|
@ -0,0 +1,66 @@
|
|||
## Getting started
|
||||
|
||||
### 1. Find a multimodal model
|
||||
|
||||
GGUF models with vision capabilities are uploaded along a `mmproj` file to Hugging Face.
|
||||
|
||||
For instance, [unsloth/gemma-3-4b-it-GGUF](https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/tree/main) has this:
|
||||
|
||||
<img width="414" height="270" alt="print1" src="https://github.com/user-attachments/assets/ac5aeb61-f6a2-491e-a1f0-47d6e27ea286" />
|
||||
|
||||
### 2. Download the model to `user_data/models`
|
||||
|
||||
As an example, download
|
||||
|
||||
https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q4_K_S.gguf?download=true
|
||||
|
||||
to your `text-generation-webui/user_data/models` folder.
|
||||
|
||||
### 3. Download the associated mmproj file to `user_data/mmproj`
|
||||
|
||||
Then download
|
||||
|
||||
https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/mmproj-F16.gguf?download=true
|
||||
|
||||
to your `text-generation-webui/user_data/mmproj` folder. Name it `mmproj-gemma-3-4b-it-F16.gguf` to give it a recognizable name.
|
||||
|
||||
### 4. Load the model
|
||||
|
||||
1. Launch the web UI
|
||||
2. Navigate to the Model tab
|
||||
3. Select the GGUF model in the Model dropdown:
|
||||
|
||||
<img width="545" height="92" alt="print2" src="https://github.com/user-attachments/assets/3f920f50-e6c3-4768-91e2-20828dd63a1c" />
|
||||
|
||||
4. Select the mmproj file in the Multimodal (vision) menu:
|
||||
|
||||
<img width="454" height="172" alt="print3" src="https://github.com/user-attachments/assets/a657e20f-0ceb-4d71-9fe4-2b78571d20a6" />
|
||||
|
||||
5. Click "Load"
|
||||
|
||||
### 5. Send a message with an image
|
||||
|
||||
Select your image by clicking on the 📎 icon and send your message:
|
||||
|
||||
<img width="368" height="135" alt="print5" src="https://github.com/user-attachments/assets/6175ec9f-04f4-4dba-9382-4ac80d5b0b1f" />
|
||||
|
||||
The model will reply with great understanding of the image contents:
|
||||
|
||||
<img width="809" height="884" alt="print6" src="https://github.com/user-attachments/assets/be4a8f4d-619d-49e6-86f5-012d89f8db8d" />
|
||||
|
||||
## Multimodal with ExLlamaV3
|
||||
|
||||
Multimodal also works with the ExLlamaV3 loader (the non-HF one).
|
||||
|
||||
No additional files are necessary, just load a multimodal EXL3 model and send an image.
|
||||
|
||||
Examples of models that you can use:
|
||||
|
||||
- https://huggingface.co/turboderp/gemma-3-27b-it-exl3
|
||||
- https://huggingface.co/turboderp/Mistral-Small-3.1-24B-Instruct-2503-exl3
|
||||
|
||||
## Multimodal API examples
|
||||
|
||||
In the page below you can find some ready-to-use examples:
|
||||
|
||||
[Multimodal/vision (llama.cpp and ExLlamaV3)](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#multimodalvision-llamacpp-and-exllamav3)
|
||||
|
|
@ -16,6 +16,8 @@ from modules.chat import (
|
|||
load_character_memoized,
|
||||
load_instruction_template_memoized
|
||||
)
|
||||
from modules.image_utils import convert_openai_messages_to_images
|
||||
from modules.logging_colors import logger
|
||||
from modules.presets import load_preset_memoized
|
||||
from modules.text_generation import decode, encode, generate_reply
|
||||
|
||||
|
|
@ -82,6 +84,33 @@ def process_parameters(body, is_legacy=False):
|
|||
return generate_params
|
||||
|
||||
|
||||
def process_multimodal_content(content):
|
||||
"""Extract text and add image placeholders from OpenAI multimodal format"""
|
||||
if isinstance(content, str):
|
||||
return content
|
||||
|
||||
if isinstance(content, list):
|
||||
text_parts = []
|
||||
image_placeholders = ""
|
||||
for item in content:
|
||||
if not isinstance(item, dict):
|
||||
continue
|
||||
|
||||
item_type = item.get('type', '')
|
||||
if item_type == 'text':
|
||||
text_parts.append(item.get('text', ''))
|
||||
elif item_type == 'image_url':
|
||||
image_placeholders += "<__media__>"
|
||||
|
||||
final_text = ' '.join(text_parts)
|
||||
if image_placeholders:
|
||||
return f"{image_placeholders}\n\n{final_text}"
|
||||
else:
|
||||
return final_text
|
||||
|
||||
return str(content)
|
||||
|
||||
|
||||
def convert_history(history):
|
||||
'''
|
||||
Chat histories in this program are in the format [message, reply].
|
||||
|
|
@ -99,8 +128,11 @@ def convert_history(history):
|
|||
role = entry["role"]
|
||||
|
||||
if role == "user":
|
||||
# Extract text content (images handled by model-specific code)
|
||||
content = process_multimodal_content(content)
|
||||
user_input = content
|
||||
user_input_last = True
|
||||
|
||||
if current_message:
|
||||
chat_dialogue.append([current_message, '', ''])
|
||||
current_message = ""
|
||||
|
|
@ -126,7 +158,11 @@ def convert_history(history):
|
|||
if not user_input_last:
|
||||
user_input = ""
|
||||
|
||||
return user_input, system_message, {'internal': chat_dialogue, 'visible': copy.deepcopy(chat_dialogue)}
|
||||
return user_input, system_message, {
|
||||
'internal': chat_dialogue,
|
||||
'visible': copy.deepcopy(chat_dialogue),
|
||||
'messages': history # Store original messages for multimodal models
|
||||
}
|
||||
|
||||
|
||||
def chat_completions_common(body: dict, is_legacy: bool = False, stream=False, prompt_only=False) -> dict:
|
||||
|
|
@ -150,9 +186,23 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False, p
|
|||
elif m['role'] == 'function':
|
||||
raise InvalidRequestError(message="role: function is not supported.", param='messages')
|
||||
|
||||
if 'content' not in m and "image_url" not in m:
|
||||
# Handle multimodal content validation
|
||||
content = m.get('content')
|
||||
if content is None:
|
||||
raise InvalidRequestError(message="messages: missing content", param='messages')
|
||||
|
||||
# Validate multimodal content structure
|
||||
if isinstance(content, list):
|
||||
for item in content:
|
||||
if not isinstance(item, dict) or 'type' not in item:
|
||||
raise InvalidRequestError(message="messages: invalid content item format", param='messages')
|
||||
if item['type'] not in ['text', 'image_url']:
|
||||
raise InvalidRequestError(message="messages: unsupported content type", param='messages')
|
||||
if item['type'] == 'text' and 'text' not in item:
|
||||
raise InvalidRequestError(message="messages: missing text in content item", param='messages')
|
||||
if item['type'] == 'image_url' and ('image_url' not in item or 'url' not in item['image_url']):
|
||||
raise InvalidRequestError(message="messages: missing image_url in content item", param='messages')
|
||||
|
||||
# Chat Completions
|
||||
object_type = 'chat.completion' if not stream else 'chat.completion.chunk'
|
||||
created_time = int(time.time())
|
||||
|
|
@ -336,9 +386,26 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
|
|||
|
||||
prompt_str = 'context' if is_legacy else 'prompt'
|
||||
|
||||
# ... encoded as a string, array of strings, array of tokens, or array of token arrays.
|
||||
if prompt_str not in body:
|
||||
raise InvalidRequestError("Missing required input", param=prompt_str)
|
||||
# Handle both prompt and messages format for unified multimodal support
|
||||
if prompt_str not in body or body[prompt_str] is None:
|
||||
if 'messages' in body:
|
||||
# Convert messages format to prompt for completions endpoint
|
||||
prompt_text = ""
|
||||
for message in body.get('messages', []):
|
||||
if isinstance(message, dict) and 'content' in message:
|
||||
# Extract text content from multimodal messages
|
||||
content = message['content']
|
||||
if isinstance(content, str):
|
||||
prompt_text += content
|
||||
elif isinstance(content, list):
|
||||
for item in content:
|
||||
if isinstance(item, dict) and item.get('type') == 'text':
|
||||
prompt_text += item.get('text', '')
|
||||
|
||||
# Allow empty prompts for image-only requests
|
||||
body[prompt_str] = prompt_text
|
||||
else:
|
||||
raise InvalidRequestError("Missing required input", param=prompt_str)
|
||||
|
||||
# common params
|
||||
generate_params = process_parameters(body, is_legacy=is_legacy)
|
||||
|
|
@ -349,9 +416,22 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
|
|||
suffix = body['suffix'] if body['suffix'] else ''
|
||||
echo = body['echo']
|
||||
|
||||
# Add messages to generate_params if present for multimodal processing
|
||||
if body.get('messages'):
|
||||
generate_params['messages'] = body['messages']
|
||||
raw_images = convert_openai_messages_to_images(generate_params['messages'])
|
||||
if raw_images:
|
||||
logger.info(f"Found {len(raw_images)} image(s) in request.")
|
||||
generate_params['raw_images'] = raw_images
|
||||
|
||||
if not stream:
|
||||
prompt_arg = body[prompt_str]
|
||||
if isinstance(prompt_arg, str) or (isinstance(prompt_arg, list) and isinstance(prompt_arg[0], int)):
|
||||
|
||||
# Handle empty/None prompts (e.g., image-only requests)
|
||||
if prompt_arg is None:
|
||||
prompt_arg = ""
|
||||
|
||||
if isinstance(prompt_arg, str) or (isinstance(prompt_arg, list) and len(prompt_arg) > 0 and isinstance(prompt_arg[0], int)):
|
||||
prompt_arg = [prompt_arg]
|
||||
|
||||
resp_list_data = []
|
||||
|
|
@ -359,7 +439,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
|
|||
total_prompt_token_count = 0
|
||||
|
||||
for idx, prompt in enumerate(prompt_arg, start=0):
|
||||
if isinstance(prompt[0], int):
|
||||
if isinstance(prompt, list) and len(prompt) > 0 and isinstance(prompt[0], int):
|
||||
# token lists
|
||||
if requested_model == shared.model_name:
|
||||
prompt = decode(prompt)[0]
|
||||
|
|
@ -448,7 +528,6 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
|
|||
# generate reply #######################################
|
||||
debug_msg({'prompt': prompt, 'generate_params': generate_params})
|
||||
generator = generate_reply(prompt, generate_params, is_chat=False)
|
||||
|
||||
answer = ''
|
||||
seen_content = ''
|
||||
completion_token_count = 0
|
||||
|
|
|
|||
|
|
@ -2,7 +2,7 @@ import json
|
|||
import time
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
from pydantic import BaseModel, Field, validator
|
||||
from pydantic import BaseModel, Field, model_validator, validator
|
||||
|
||||
|
||||
class GenerationOptions(BaseModel):
|
||||
|
|
@ -99,13 +99,14 @@ class ToolCall(BaseModel):
|
|||
|
||||
class CompletionRequestParams(BaseModel):
|
||||
model: str | None = Field(default=None, description="Unused parameter. To change the model, use the /v1/internal/model/load endpoint.")
|
||||
prompt: str | List[str]
|
||||
prompt: str | List[str] | None = Field(default=None, description="Text prompt for completion. Can also use 'messages' format for multimodal.")
|
||||
messages: List[dict] | None = Field(default=None, description="OpenAI messages format for multimodal support. Alternative to 'prompt'.")
|
||||
best_of: int | None = Field(default=1, description="Unused parameter.")
|
||||
echo: bool | None = False
|
||||
frequency_penalty: float | None = 0
|
||||
logit_bias: dict | None = None
|
||||
logprobs: int | None = None
|
||||
max_tokens: int | None = 16
|
||||
max_tokens: int | None = 512
|
||||
n: int | None = Field(default=1, description="Unused parameter.")
|
||||
presence_penalty: float | None = 0
|
||||
stop: str | List[str] | None = None
|
||||
|
|
@ -115,6 +116,12 @@ class CompletionRequestParams(BaseModel):
|
|||
top_p: float | None = 1
|
||||
user: str | None = Field(default=None, description="Unused parameter.")
|
||||
|
||||
@model_validator(mode='after')
|
||||
def validate_prompt_or_messages(self):
|
||||
if self.prompt is None and self.messages is None:
|
||||
raise ValueError("Either 'prompt' or 'messages' must be provided")
|
||||
return self
|
||||
|
||||
|
||||
class CompletionRequest(GenerationOptions, CompletionRequestParams):
|
||||
pass
|
||||
|
|
@ -220,7 +227,7 @@ class LogitsRequestParams(BaseModel):
|
|||
use_samplers: bool = False
|
||||
top_logits: int | None = 50
|
||||
frequency_penalty: float | None = 0
|
||||
max_tokens: int | None = 16
|
||||
max_tokens: int | None = 512
|
||||
presence_penalty: float | None = 0
|
||||
temperature: float | None = 1
|
||||
top_p: float | None = 1
|
||||
|
|
|
|||
|
|
@ -583,7 +583,7 @@ function moveToChatTab() {
|
|||
|
||||
const chatControlsFirstChild = document.querySelector("#chat-controls").firstElementChild;
|
||||
const newParent = chatControlsFirstChild;
|
||||
let newPosition = newParent.children.length - 2;
|
||||
let newPosition = newParent.children.length - 3;
|
||||
|
||||
newParent.insertBefore(grandParent, newParent.children[newPosition]);
|
||||
document.getElementById("save-character").style.display = "none";
|
||||
|
|
@ -977,7 +977,7 @@ if (document.readyState === "loading") {
|
|||
//------------------------------------------------
|
||||
|
||||
// File upload button
|
||||
document.querySelector("#chat-input .upload-button").title = "Upload text files, PDFs, and DOCX documents";
|
||||
document.querySelector("#chat-input .upload-button").title = "Upload text files, PDFs, DOCX documents, and images";
|
||||
|
||||
// Activate web search
|
||||
document.getElementById("web-search").title = "Search the internet with DuckDuckGo";
|
||||
|
|
|
|||
133
modules/chat.py
133
modules/chat.py
|
|
@ -269,18 +269,29 @@ def generate_chat_prompt(user_input, state, **kwargs):
|
|||
enhanced_user_msg = user_msg
|
||||
|
||||
# Add attachment content if present AND if past attachments are enabled
|
||||
if (state.get('include_past_attachments', True) and user_key in metadata and "attachments" in metadata[user_key]):
|
||||
if user_key in metadata and "attachments" in metadata[user_key]:
|
||||
attachments_text = ""
|
||||
for attachment in metadata[user_key]["attachments"]:
|
||||
filename = attachment.get("name", "file")
|
||||
content = attachment.get("content", "")
|
||||
if attachment.get("type") == "text/html" and attachment.get("url"):
|
||||
attachments_text += f"\nName: {filename}\nURL: {attachment['url']}\nContents:\n\n=====\n{content}\n=====\n\n"
|
||||
else:
|
||||
attachments_text += f"\nName: {filename}\nContents:\n\n=====\n{content}\n=====\n\n"
|
||||
image_refs = ""
|
||||
|
||||
if attachments_text:
|
||||
enhanced_user_msg = f"{user_msg}\n\nATTACHMENTS:\n{attachments_text}"
|
||||
for attachment in metadata[user_key]["attachments"]:
|
||||
if attachment.get("type") == "image":
|
||||
# Add image reference for multimodal models
|
||||
image_refs += "<__media__>"
|
||||
elif state.get('include_past_attachments', True):
|
||||
# Handle text/PDF attachments
|
||||
filename = attachment.get("name", "file")
|
||||
content = attachment.get("content", "")
|
||||
if attachment.get("type") == "text/html" and attachment.get("url"):
|
||||
attachments_text += f"\nName: {filename}\nURL: {attachment['url']}\nContents:\n\n=====\n{content}\n=====\n\n"
|
||||
else:
|
||||
attachments_text += f"\nName: {filename}\nContents:\n\n=====\n{content}\n=====\n\n"
|
||||
|
||||
if image_refs or attachments_text:
|
||||
enhanced_user_msg = user_msg
|
||||
if image_refs:
|
||||
enhanced_user_msg = f"{image_refs}\n\n{enhanced_user_msg}"
|
||||
if attachments_text:
|
||||
enhanced_user_msg += f"\n\nATTACHMENTS:\n{attachments_text}"
|
||||
|
||||
messages.insert(insert_pos, {"role": "user", "content": enhanced_user_msg})
|
||||
|
||||
|
|
@ -301,16 +312,25 @@ def generate_chat_prompt(user_input, state, **kwargs):
|
|||
|
||||
if user_key in metadata and "attachments" in metadata[user_key]:
|
||||
attachments_text = ""
|
||||
for attachment in metadata[user_key]["attachments"]:
|
||||
filename = attachment.get("name", "file")
|
||||
content = attachment.get("content", "")
|
||||
if attachment.get("type") == "text/html" and attachment.get("url"):
|
||||
attachments_text += f"\nName: {filename}\nURL: {attachment['url']}\nContents:\n\n=====\n{content}\n=====\n\n"
|
||||
else:
|
||||
attachments_text += f"\nName: {filename}\nContents:\n\n=====\n{content}\n=====\n\n"
|
||||
image_refs = ""
|
||||
|
||||
if attachments_text:
|
||||
user_input = f"{user_input}\n\nATTACHMENTS:\n{attachments_text}"
|
||||
for attachment in metadata[user_key]["attachments"]:
|
||||
if attachment.get("type") == "image":
|
||||
image_refs += "<__media__>"
|
||||
else:
|
||||
filename = attachment.get("name", "file")
|
||||
content = attachment.get("content", "")
|
||||
if attachment.get("type") == "text/html" and attachment.get("url"):
|
||||
attachments_text += f"\nName: {filename}\nURL: {attachment['url']}\nContents:\n\n=====\n{content}\n=====\n\n"
|
||||
else:
|
||||
attachments_text += f"\nName: {filename}\nContents:\n\n=====\n{content}\n=====\n\n"
|
||||
|
||||
if image_refs or attachments_text:
|
||||
user_input = user_input
|
||||
if image_refs:
|
||||
user_input = f"{image_refs}\n\n{user_input}"
|
||||
if attachments_text:
|
||||
user_input += f"\n\nATTACHMENTS:\n{attachments_text}"
|
||||
|
||||
messages.append({"role": "user", "content": user_input})
|
||||
|
||||
|
|
@ -594,29 +614,63 @@ def add_message_attachment(history, row_idx, file_path, is_user=True):
|
|||
file_extension = path.suffix.lower()
|
||||
|
||||
try:
|
||||
# Handle different file types
|
||||
if file_extension == '.pdf':
|
||||
# Handle image files
|
||||
if file_extension in ['.jpg', '.jpeg', '.png', '.webp', '.bmp', '.gif']:
|
||||
# Convert image to base64
|
||||
with open(path, 'rb') as f:
|
||||
image_data = base64.b64encode(f.read()).decode('utf-8')
|
||||
|
||||
# Determine MIME type from extension
|
||||
mime_type_map = {
|
||||
'.jpg': 'image/jpeg',
|
||||
'.jpeg': 'image/jpeg',
|
||||
'.png': 'image/png',
|
||||
'.webp': 'image/webp',
|
||||
'.bmp': 'image/bmp',
|
||||
'.gif': 'image/gif'
|
||||
}
|
||||
mime_type = mime_type_map.get(file_extension, 'image/jpeg')
|
||||
|
||||
# Format as data URL
|
||||
data_url = f"data:{mime_type};base64,{image_data}"
|
||||
|
||||
# Generate unique image ID
|
||||
image_id = len([att for att in history['metadata'][key]["attachments"] if att.get("type") == "image"]) + 1
|
||||
|
||||
attachment = {
|
||||
"name": filename,
|
||||
"type": "image",
|
||||
"image_data": data_url,
|
||||
"image_id": image_id,
|
||||
}
|
||||
elif file_extension == '.pdf':
|
||||
# Process PDF file
|
||||
content = extract_pdf_text(path)
|
||||
file_type = "application/pdf"
|
||||
attachment = {
|
||||
"name": filename,
|
||||
"type": "application/pdf",
|
||||
"content": content,
|
||||
}
|
||||
elif file_extension == '.docx':
|
||||
content = extract_docx_text(path)
|
||||
file_type = "application/docx"
|
||||
attachment = {
|
||||
"name": filename,
|
||||
"type": "application/docx",
|
||||
"content": content,
|
||||
}
|
||||
else:
|
||||
# Default handling for text files
|
||||
with open(path, 'r', encoding='utf-8') as f:
|
||||
content = f.read()
|
||||
file_type = "text/plain"
|
||||
|
||||
# Add attachment
|
||||
attachment = {
|
||||
"name": filename,
|
||||
"type": file_type,
|
||||
"content": content,
|
||||
}
|
||||
attachment = {
|
||||
"name": filename,
|
||||
"type": "text/plain",
|
||||
"content": content,
|
||||
}
|
||||
|
||||
history['metadata'][key]["attachments"].append(attachment)
|
||||
return content # Return the content for reuse
|
||||
return attachment # Return the attachment for reuse
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing attachment {filename}: {e}")
|
||||
return None
|
||||
|
|
@ -814,6 +868,22 @@ def chatbot_wrapper(text, state, regenerate=False, _continue=False, loading_mess
|
|||
'metadata': output['metadata']
|
||||
}
|
||||
|
||||
row_idx = len(output['internal']) - 1
|
||||
|
||||
# Collect image attachments for multimodal generation from the entire history
|
||||
all_image_attachments = []
|
||||
if 'metadata' in output:
|
||||
for i in range(len(output['internal'])):
|
||||
user_key = f"user_{i}"
|
||||
if user_key in output['metadata'] and "attachments" in output['metadata'][user_key]:
|
||||
for attachment in output['metadata'][user_key]["attachments"]:
|
||||
if attachment.get("type") == "image":
|
||||
all_image_attachments.append(attachment)
|
||||
|
||||
# Add all collected image attachments to state for the generation
|
||||
if all_image_attachments:
|
||||
state['image_attachments'] = all_image_attachments
|
||||
|
||||
# Generate the prompt
|
||||
kwargs = {
|
||||
'_continue': _continue,
|
||||
|
|
@ -828,7 +898,6 @@ def chatbot_wrapper(text, state, regenerate=False, _continue=False, loading_mess
|
|||
prompt = generate_chat_prompt(text, state, **kwargs)
|
||||
|
||||
# Add timestamp for assistant's response at the start of generation
|
||||
row_idx = len(output['internal']) - 1
|
||||
update_message_metadata(output['metadata'], "assistant", row_idx, timestamp=get_current_timestamp(), model_name=shared.model_name)
|
||||
|
||||
# Generate
|
||||
|
|
|
|||
|
|
@ -135,7 +135,8 @@ class Exllamav2Model:
|
|||
return result, result
|
||||
|
||||
def encode(self, string, **kwargs):
|
||||
return self.tokenizer.encode(string, add_bos=True, encode_special_tokens=True)
|
||||
add_bos = kwargs.pop('add_bos', True)
|
||||
return self.tokenizer.encode(string, add_bos=add_bos, encode_special_tokens=True, **kwargs)
|
||||
|
||||
def decode(self, ids, **kwargs):
|
||||
if isinstance(ids, list):
|
||||
|
|
|
|||
415
modules/exllamav3.py
Normal file
415
modules/exllamav3.py
Normal file
|
|
@ -0,0 +1,415 @@
|
|||
import traceback
|
||||
from pathlib import Path
|
||||
from typing import Any, List, Tuple
|
||||
|
||||
from exllamav3 import Cache, Config, Generator, Model, Tokenizer
|
||||
from exllamav3.cache import CacheLayer_fp16, CacheLayer_quant
|
||||
from exllamav3.generator import Job
|
||||
from exllamav3.generator.sampler import (
|
||||
CustomSampler,
|
||||
SS_Argmax,
|
||||
SS_MinP,
|
||||
SS_PresFreqP,
|
||||
SS_RepP,
|
||||
SS_Sample,
|
||||
SS_Temperature,
|
||||
SS_TopK,
|
||||
SS_TopP
|
||||
)
|
||||
|
||||
from modules import shared
|
||||
from modules.image_utils import (
|
||||
convert_image_attachments_to_pil,
|
||||
convert_openai_messages_to_images
|
||||
)
|
||||
from modules.logging_colors import logger
|
||||
from modules.text_generation import get_max_prompt_length
|
||||
|
||||
try:
|
||||
import flash_attn
|
||||
except Exception:
|
||||
logger.warning('Failed to load flash-attention due to the following error:\n')
|
||||
traceback.print_exc()
|
||||
|
||||
|
||||
class Exllamav3Model:
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, path_to_model):
|
||||
path_to_model = Path(f'{shared.args.model_dir}') / Path(path_to_model)
|
||||
|
||||
# Reset global MMTokenAllocator to prevent token ID corruption when switching models
|
||||
from exllamav3.tokenizer.mm_embedding import (
|
||||
FIRST_MM_EMBEDDING_INDEX,
|
||||
global_allocator
|
||||
)
|
||||
global_allocator.next_token_index = FIRST_MM_EMBEDDING_INDEX
|
||||
|
||||
config = Config.from_directory(str(path_to_model))
|
||||
model = Model.from_config(config)
|
||||
|
||||
# Calculate the closest multiple of 256 at or above the chosen value
|
||||
max_tokens = shared.args.ctx_size
|
||||
if max_tokens % 256 != 0:
|
||||
adjusted_tokens = ((max_tokens // 256) + 1) * 256
|
||||
logger.warning(f"max_num_tokens must be a multiple of 256. Adjusting from {max_tokens} to {adjusted_tokens}")
|
||||
max_tokens = adjusted_tokens
|
||||
|
||||
# Parse cache type (ExLlamaV2 pattern)
|
||||
cache_type = shared.args.cache_type.lower()
|
||||
cache_kwargs = {}
|
||||
if cache_type == 'fp16':
|
||||
layer_type = CacheLayer_fp16
|
||||
elif cache_type.startswith('q'):
|
||||
layer_type = CacheLayer_quant
|
||||
if '_' in cache_type:
|
||||
# Different bits for k and v (e.g., q4_q8)
|
||||
k_part, v_part = cache_type.split('_')
|
||||
k_bits = int(k_part[1:])
|
||||
v_bits = int(v_part[1:])
|
||||
else:
|
||||
# Same bits for k and v (e.g., q4)
|
||||
k_bits = v_bits = int(cache_type[1:])
|
||||
|
||||
# Validate bit ranges
|
||||
if not (2 <= k_bits <= 8 and 2 <= v_bits <= 8):
|
||||
logger.warning(f"Invalid quantization bits: k_bits={k_bits}, v_bits={v_bits}. Must be between 2 and 8. Falling back to fp16.")
|
||||
layer_type = CacheLayer_fp16
|
||||
else:
|
||||
cache_kwargs = {'k_bits': k_bits, 'v_bits': v_bits}
|
||||
else:
|
||||
logger.warning(f"Unrecognized cache type: {cache_type}. Falling back to fp16.")
|
||||
layer_type = CacheLayer_fp16
|
||||
|
||||
cache = Cache(model, max_num_tokens=max_tokens, layer_type=layer_type, **cache_kwargs)
|
||||
|
||||
load_params = {'progressbar': True}
|
||||
split = None
|
||||
if shared.args.gpu_split:
|
||||
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
|
||||
load_params['use_per_device'] = split
|
||||
|
||||
# Tensor-parallelism
|
||||
if shared.args.enable_tp:
|
||||
load_params['tensor_p'] = True
|
||||
load_params['tp_backend'] = shared.args.tp_backend
|
||||
|
||||
model.load(**load_params)
|
||||
tokenizer = Tokenizer.from_config(config)
|
||||
|
||||
# Initialize draft model for speculative decoding
|
||||
draft_model = None
|
||||
draft_cache = None
|
||||
if shared.args.model_draft and shared.args.model_draft.lower() not in ["", "none"]:
|
||||
logger.info(f"Loading draft model for speculative decoding: {shared.args.model_draft}")
|
||||
|
||||
draft_path = Path(shared.args.model_draft)
|
||||
if not draft_path.is_dir():
|
||||
draft_path = Path(f'{shared.args.model_dir}') / Path(shared.args.model_draft)
|
||||
|
||||
if not draft_path.is_dir():
|
||||
logger.warning(f"Draft model not found at {draft_path}, speculative decoding disabled.")
|
||||
else:
|
||||
draft_config = Config.from_directory(str(draft_path))
|
||||
|
||||
# Set context size for draft model with 256-multiple validation
|
||||
if shared.args.ctx_size_draft > 0:
|
||||
draft_max_tokens = shared.args.ctx_size_draft
|
||||
else:
|
||||
draft_max_tokens = shared.args.ctx_size
|
||||
|
||||
# Validate draft model context size is a multiple of 256
|
||||
if draft_max_tokens % 256 != 0:
|
||||
adjusted_draft_tokens = ((draft_max_tokens // 256) + 1) * 256
|
||||
logger.warning(f"Draft model max_num_tokens must be a multiple of 256. Adjusting from {draft_max_tokens} to {adjusted_draft_tokens}")
|
||||
draft_max_tokens = adjusted_draft_tokens
|
||||
|
||||
draft_config.max_seq_len = draft_max_tokens
|
||||
|
||||
draft_model = Model.from_config(draft_config)
|
||||
draft_cache = Cache(draft_model, max_num_tokens=draft_max_tokens, layer_type=layer_type, **cache_kwargs)
|
||||
|
||||
draft_load_params = {'progressbar': True}
|
||||
if split:
|
||||
draft_load_params['use_per_device'] = split
|
||||
|
||||
draft_model.load(**draft_load_params)
|
||||
logger.info(f"Draft model loaded successfully. Max speculative tokens: {shared.args.draft_max}")
|
||||
|
||||
# Load vision model component (ExLlamaV3 native)
|
||||
vision_model = None
|
||||
if "vision_config" in config.config_dict:
|
||||
logger.info("Vision component detected in model config. Attempting to load...")
|
||||
try:
|
||||
vision_model = Model.from_config(config, component="vision")
|
||||
vision_model.load(progressbar=True)
|
||||
logger.info("Vision model loaded successfully.")
|
||||
except Exception as e:
|
||||
logger.warning(f"Vision model loading failed (multimodal disabled): {e}")
|
||||
else:
|
||||
logger.info("No vision component in model config. Skipping multimodal setup.")
|
||||
|
||||
generator = Generator(
|
||||
model=model,
|
||||
cache=cache,
|
||||
tokenizer=tokenizer,
|
||||
draft_model=draft_model,
|
||||
draft_cache=draft_cache,
|
||||
num_speculative_tokens=shared.args.draft_max if draft_model is not None else 0,
|
||||
)
|
||||
|
||||
result = cls()
|
||||
result.model = model
|
||||
result.cache = cache
|
||||
result.tokenizer = tokenizer
|
||||
result.generator = generator
|
||||
result.config = config
|
||||
result.max_tokens = max_tokens
|
||||
result.vision_model = vision_model
|
||||
result.draft_model = draft_model
|
||||
result.draft_cache = draft_cache
|
||||
|
||||
return result
|
||||
|
||||
def is_multimodal(self) -> bool:
|
||||
"""Check if this model supports multimodal input."""
|
||||
return hasattr(self, 'vision_model') and self.vision_model is not None
|
||||
|
||||
def _process_images_for_generation(self, prompt: str, state: dict) -> Tuple[str, List[Any]]:
|
||||
"""
|
||||
Process all possible image inputs and return modified prompt + embeddings.
|
||||
Returns: (processed_prompt, image_embeddings)
|
||||
"""
|
||||
# Collect images from various sources using shared utilities
|
||||
pil_images = []
|
||||
|
||||
# From webui image_attachments (preferred format)
|
||||
if 'image_attachments' in state and state['image_attachments']:
|
||||
pil_images.extend(convert_image_attachments_to_pil(state['image_attachments']))
|
||||
# From OpenAI API raw_images
|
||||
elif 'raw_images' in state and state['raw_images']:
|
||||
pil_images.extend(state['raw_images'])
|
||||
# From OpenAI API messages format
|
||||
elif 'messages' in state and state['messages']:
|
||||
pil_images.extend(convert_openai_messages_to_images(state['messages']))
|
||||
|
||||
if not pil_images:
|
||||
return prompt, []
|
||||
|
||||
# ExLlamaV3-specific: Generate embeddings
|
||||
try:
|
||||
# Use pre-computed embeddings if available (proper MMEmbedding lifetime)
|
||||
if 'image_embeddings' in state and state['image_embeddings']:
|
||||
# Use existing embeddings - this preserves MMEmbedding lifetime
|
||||
image_embeddings = state['image_embeddings']
|
||||
else:
|
||||
# Do not reset the cache/allocator index; it causes token ID conflicts during generation.
|
||||
logger.info(f"Processing {len(pil_images)} image(s) with ExLlamaV3 vision model")
|
||||
image_embeddings = [
|
||||
self.vision_model.get_image_embeddings(tokenizer=self.tokenizer, image=img)
|
||||
for img in pil_images
|
||||
]
|
||||
|
||||
# ExLlamaV3-specific: Handle prompt processing with placeholders
|
||||
placeholders = [ie.text_alias for ie in image_embeddings]
|
||||
|
||||
if '<__media__>' in prompt:
|
||||
# Web chat: Replace <__media__> placeholders
|
||||
for alias in placeholders:
|
||||
prompt = prompt.replace('<__media__>', alias, 1)
|
||||
logger.info(f"Replaced {len(placeholders)} <__media__> placeholder(s)")
|
||||
else:
|
||||
# API: Prepend embedding aliases
|
||||
combined_placeholders = "\n".join(placeholders)
|
||||
prompt = combined_placeholders + "\n" + prompt
|
||||
logger.info(f"Prepended {len(placeholders)} embedding(s) to prompt")
|
||||
|
||||
return prompt, image_embeddings
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to process images: {e}")
|
||||
return prompt, []
|
||||
|
||||
def generate_with_streaming(self, prompt, state):
|
||||
"""
|
||||
Generate text with streaming using native ExLlamaV3 API
|
||||
"""
|
||||
|
||||
if shared.is_multimodal:
|
||||
# Process images and modify prompt (ExLlamaV3-specific)
|
||||
prompt, image_embeddings = self._process_images_for_generation(prompt, state)
|
||||
else:
|
||||
image_embeddings = []
|
||||
|
||||
# Greedy decoding is a special case
|
||||
if state['temperature'] == 0:
|
||||
sampler = CustomSampler([SS_Argmax()])
|
||||
else:
|
||||
# 1. Create a list of all active, unordered samplers
|
||||
unordered_samplers = []
|
||||
|
||||
# Penalties
|
||||
penalty_range = state['repetition_penalty_range']
|
||||
if penalty_range <= 0:
|
||||
penalty_range = int(10e7) # Use large number for "full context"
|
||||
rep_decay = 0 # Not a configurable parameter
|
||||
|
||||
# Add penalty samplers if they are active
|
||||
if state['repetition_penalty'] != 1.0:
|
||||
unordered_samplers.append(SS_RepP(state['repetition_penalty'], penalty_range, rep_decay))
|
||||
if state['presence_penalty'] != 0.0 or state['frequency_penalty'] != 0.0:
|
||||
unordered_samplers.append(SS_PresFreqP(state['presence_penalty'], state['frequency_penalty'], penalty_range, rep_decay))
|
||||
|
||||
# Standard samplers
|
||||
if state['top_k'] > 0:
|
||||
unordered_samplers.append(SS_TopK(state['top_k']))
|
||||
if state['top_p'] < 1.0:
|
||||
unordered_samplers.append(SS_TopP(state['top_p']))
|
||||
if state['min_p'] > 0.0:
|
||||
unordered_samplers.append(SS_MinP(state['min_p']))
|
||||
|
||||
# Temperature (SS_NoOp is returned if temp is 1.0)
|
||||
unordered_samplers.append(SS_Temperature(state['temperature']))
|
||||
|
||||
# 2. Define the mapping from class names to the priority list keys
|
||||
class_name_to_nickname = {
|
||||
'SS_RepP': 'repetition_penalty',
|
||||
'SS_PresFreqP': 'presence_frequency_penalty',
|
||||
'SS_TopK': 'top_k',
|
||||
'SS_TopP': 'top_p',
|
||||
'SS_MinP': 'min_p',
|
||||
'SS_Temperature': 'temperature',
|
||||
}
|
||||
|
||||
# 3. Get the priority list and handle temperature_last
|
||||
default_priority = ['repetition_penalty', 'presence_frequency_penalty', 'top_k', 'top_p', 'min_p', 'temperature']
|
||||
sampler_priority = state.get('sampler_priority') or default_priority
|
||||
|
||||
if state['temperature_last'] and 'temperature' in sampler_priority:
|
||||
sampler_priority.append(sampler_priority.pop(sampler_priority.index('temperature')))
|
||||
|
||||
# 4. Sort the unordered list based on the priority list
|
||||
def custom_sort_key(sampler_obj):
|
||||
class_name = sampler_obj.__class__.__name__
|
||||
nickname = class_name_to_nickname.get(class_name)
|
||||
if nickname and nickname in sampler_priority:
|
||||
return sampler_priority.index(nickname)
|
||||
return -1
|
||||
|
||||
ordered_samplers = sorted(unordered_samplers, key=custom_sort_key)
|
||||
|
||||
# 5. Add the final sampling stage and build the sampler
|
||||
ordered_samplers.append(SS_Sample())
|
||||
sampler = CustomSampler(ordered_samplers)
|
||||
|
||||
# Encode prompt with embeddings (ExLlamaV3-specific)
|
||||
input_ids = self.tokenizer.encode(
|
||||
prompt,
|
||||
add_bos=state['add_bos_token'],
|
||||
encode_special_tokens=True,
|
||||
embeddings=image_embeddings,
|
||||
)
|
||||
|
||||
input_ids = input_ids[:, -get_max_prompt_length(state):]
|
||||
|
||||
self._last_prompt_token_count = input_ids.shape[-1]
|
||||
|
||||
# Determine max_new_tokens
|
||||
if state['auto_max_new_tokens']:
|
||||
max_new_tokens = state['truncation_length'] - self._last_prompt_token_count
|
||||
else:
|
||||
max_new_tokens = state['max_new_tokens']
|
||||
|
||||
# Get stop conditions
|
||||
stop_conditions = []
|
||||
if not state['ban_eos_token']:
|
||||
if hasattr(self.tokenizer, 'eos_token_id') and self.tokenizer.eos_token_id is not None:
|
||||
stop_conditions.append(self.tokenizer.eos_token_id)
|
||||
|
||||
job = Job(
|
||||
input_ids=input_ids,
|
||||
max_new_tokens=max_new_tokens,
|
||||
decode_special_tokens=not state['skip_special_tokens'],
|
||||
embeddings=image_embeddings if image_embeddings else None,
|
||||
sampler=sampler,
|
||||
stop_conditions=stop_conditions if stop_conditions else None,
|
||||
)
|
||||
|
||||
# Stream generation
|
||||
self.generator.enqueue(job)
|
||||
|
||||
response_text = ""
|
||||
|
||||
try:
|
||||
while self.generator.num_remaining_jobs():
|
||||
results = self.generator.iterate()
|
||||
for result in results:
|
||||
if "eos" in result and result["eos"]:
|
||||
break
|
||||
|
||||
chunk = result.get("text", "")
|
||||
if chunk:
|
||||
response_text += chunk
|
||||
yield response_text
|
||||
|
||||
finally:
|
||||
self.generator.clear_queue()
|
||||
|
||||
def generate(self, prompt, state):
|
||||
output = ""
|
||||
for chunk in self.generate_with_streaming(prompt, state):
|
||||
output = chunk
|
||||
|
||||
return output
|
||||
|
||||
def encode(self, string, **kwargs):
|
||||
add_bos = kwargs.pop('add_bos', True)
|
||||
return self.tokenizer.encode(string, add_bos=add_bos, **kwargs)
|
||||
|
||||
def decode(self, ids, **kwargs):
|
||||
return self.tokenizer.decode(ids, **kwargs)
|
||||
|
||||
@property
|
||||
def last_prompt_token_count(self):
|
||||
return getattr(self, '_last_prompt_token_count', 0)
|
||||
|
||||
def unload(self):
|
||||
logger.info("Unloading ExLlamaV3 model components...")
|
||||
|
||||
if hasattr(self, 'vision_model') and self.vision_model is not None:
|
||||
try:
|
||||
del self.vision_model
|
||||
except Exception as e:
|
||||
logger.warning(f"Error unloading vision model: {e}")
|
||||
self.vision_model = None
|
||||
|
||||
if hasattr(self, 'draft_model') and self.draft_model is not None:
|
||||
try:
|
||||
self.draft_model.unload()
|
||||
del self.draft_model
|
||||
except Exception as e:
|
||||
logger.warning(f"Error unloading draft model: {e}")
|
||||
self.draft_model = None
|
||||
|
||||
if hasattr(self, 'draft_cache') and self.draft_cache is not None:
|
||||
self.draft_cache = None
|
||||
|
||||
if hasattr(self, 'model') and self.model is not None:
|
||||
try:
|
||||
self.model.unload()
|
||||
del self.model
|
||||
except Exception as e:
|
||||
logger.warning(f"Error unloading main model: {e}")
|
||||
|
||||
self.model = None
|
||||
|
||||
if hasattr(self, 'cache') and self.cache is not None:
|
||||
self.cache = None
|
||||
|
||||
if hasattr(self, 'generator') and self.generator is not None:
|
||||
self.generator = None
|
||||
|
||||
if hasattr(self, 'tokenizer') and self.tokenizer is not None:
|
||||
self.tokenizer = None
|
||||
|
|
@ -74,6 +74,11 @@ class Exllamav3HF(PreTrainedModel, GenerationMixin):
|
|||
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
|
||||
load_params['use_per_device'] = split
|
||||
|
||||
# Tensor-parallelism
|
||||
if shared.args.enable_tp:
|
||||
load_params['tensor_p'] = True
|
||||
load_params['tp_backend'] = shared.args.tp_backend
|
||||
|
||||
self.ex_model.load(**load_params)
|
||||
self.past_seq = None
|
||||
self.max_tokens = max_tokens
|
||||
|
|
|
|||
|
|
@ -306,6 +306,9 @@ def process_markdown_content(string):
|
|||
# Convert to HTML using markdown
|
||||
html_output = markdown.markdown(result, extensions=['fenced_code', 'tables', SaneListExtension()])
|
||||
|
||||
# Remove extra newlines before </code>
|
||||
html_output = re.sub(r'\s*</code>', '</code>', html_output)
|
||||
|
||||
# Unescape code blocks
|
||||
pattern = re.compile(r'<code[^>]*>(.*?)</code>', re.DOTALL)
|
||||
html_output = pattern.sub(lambda x: html.unescape(x.group()), html_output)
|
||||
|
|
@ -406,16 +409,26 @@ def format_message_attachments(history, role, index):
|
|||
for attachment in attachments:
|
||||
name = html.escape(attachment["name"])
|
||||
|
||||
# Make clickable if URL exists
|
||||
if "url" in attachment:
|
||||
name = f'<a href="{html.escape(attachment["url"])}" target="_blank" rel="noopener noreferrer">{name}</a>'
|
||||
if attachment.get("type") == "image":
|
||||
image_data = attachment.get("image_data", "")
|
||||
attachments_html += (
|
||||
f'<div class="attachment-box image-attachment">'
|
||||
f'<img src="{image_data}" alt="{name}" class="image-preview" />'
|
||||
f'<div class="attachment-name">{name}</div>'
|
||||
f'</div>'
|
||||
)
|
||||
else:
|
||||
# Make clickable if URL exists (web search)
|
||||
if "url" in attachment:
|
||||
name = f'<a href="{html.escape(attachment["url"])}" target="_blank" rel="noopener noreferrer">{name}</a>'
|
||||
|
||||
attachments_html += (
|
||||
f'<div class="attachment-box">'
|
||||
f'<div class="attachment-icon">{attachment_svg}</div>'
|
||||
f'<div class="attachment-name">{name}</div>'
|
||||
f'</div>'
|
||||
)
|
||||
|
||||
attachments_html += (
|
||||
f'<div class="attachment-box">'
|
||||
f'<div class="attachment-icon">{attachment_svg}</div>'
|
||||
f'<div class="attachment-name">{name}</div>'
|
||||
f'</div>'
|
||||
)
|
||||
attachments_html += '</div>'
|
||||
return attachments_html
|
||||
|
||||
|
|
|
|||
106
modules/image_utils.py
Normal file
106
modules/image_utils.py
Normal file
|
|
@ -0,0 +1,106 @@
|
|||
"""
|
||||
Shared image processing utilities for multimodal support.
|
||||
Used by both ExLlamaV3 and llama.cpp implementations.
|
||||
"""
|
||||
import base64
|
||||
import io
|
||||
from typing import Any, List, Tuple
|
||||
|
||||
from PIL import Image
|
||||
|
||||
from modules.logging_colors import logger
|
||||
|
||||
|
||||
def convert_pil_to_base64(image: Image.Image) -> str:
|
||||
"""Converts a PIL Image to a base64 encoded string."""
|
||||
buffered = io.BytesIO()
|
||||
# Save image to an in-memory bytes buffer in PNG format
|
||||
image.save(buffered, format="PNG")
|
||||
# Encode the bytes to a base64 string
|
||||
return base64.b64encode(buffered.getvalue()).decode('utf-8')
|
||||
|
||||
|
||||
def decode_base64_image(base64_string: str) -> Image.Image:
|
||||
"""Decodes a base64 string to a PIL Image."""
|
||||
try:
|
||||
if base64_string.startswith('data:image/'):
|
||||
base64_string = base64_string.split(',', 1)[1]
|
||||
|
||||
image_data = base64.b64decode(base64_string)
|
||||
image = Image.open(io.BytesIO(image_data))
|
||||
return image
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to decode base64 image: {e}")
|
||||
raise ValueError(f"Invalid base64 image data: {e}")
|
||||
|
||||
|
||||
def process_message_content(content: Any) -> Tuple[str, List[Image.Image]]:
|
||||
"""
|
||||
Processes message content that may contain text and images.
|
||||
Returns: A tuple of (text_content, list_of_pil_images).
|
||||
"""
|
||||
if isinstance(content, str):
|
||||
return content, []
|
||||
|
||||
if isinstance(content, list):
|
||||
text_parts = []
|
||||
images = []
|
||||
for item in content:
|
||||
if not isinstance(item, dict):
|
||||
continue
|
||||
|
||||
item_type = item.get('type', '')
|
||||
if item_type == 'text':
|
||||
text_parts.append(item.get('text', ''))
|
||||
elif item_type == 'image_url':
|
||||
image_url_data = item.get('image_url', {})
|
||||
image_url = image_url_data.get('url', '')
|
||||
|
||||
if image_url.startswith('data:image/'):
|
||||
try:
|
||||
images.append(decode_base64_image(image_url))
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to process a base64 image: {e}")
|
||||
elif image_url.startswith('http'):
|
||||
# Support external URLs
|
||||
try:
|
||||
import requests
|
||||
response = requests.get(image_url, timeout=10)
|
||||
response.raise_for_status()
|
||||
image_data = response.content
|
||||
image = Image.open(io.BytesIO(image_data))
|
||||
images.append(image)
|
||||
logger.info("Successfully loaded external image from URL")
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to fetch external image: {e}")
|
||||
else:
|
||||
logger.warning(f"Unsupported image URL format: {image_url[:70]}...")
|
||||
|
||||
return ' '.join(text_parts), images
|
||||
|
||||
return str(content), []
|
||||
|
||||
|
||||
def convert_image_attachments_to_pil(image_attachments: List[dict]) -> List[Image.Image]:
|
||||
"""Convert webui image_attachments format to PIL Images."""
|
||||
pil_images = []
|
||||
for attachment in image_attachments:
|
||||
if attachment.get('type') == 'image' and 'image_data' in attachment:
|
||||
try:
|
||||
image = decode_base64_image(attachment['image_data'])
|
||||
if image.mode != 'RGB':
|
||||
image = image.convert('RGB')
|
||||
pil_images.append(image)
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to process image attachment: {e}")
|
||||
return pil_images
|
||||
|
||||
|
||||
def convert_openai_messages_to_images(messages: List[dict]) -> List[Image.Image]:
|
||||
"""Convert OpenAI messages format to PIL Images."""
|
||||
all_images = []
|
||||
for message in messages:
|
||||
if isinstance(message, dict) and 'content' in message:
|
||||
_, images = process_message_content(message['content'])
|
||||
all_images.extend(images)
|
||||
return all_images
|
||||
|
|
@ -8,11 +8,17 @@ import sys
|
|||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Any, List
|
||||
|
||||
import llama_cpp_binaries
|
||||
import requests
|
||||
|
||||
from modules import shared
|
||||
from modules.image_utils import (
|
||||
convert_image_attachments_to_pil,
|
||||
convert_openai_messages_to_images,
|
||||
convert_pil_to_base64
|
||||
)
|
||||
from modules.logging_colors import logger
|
||||
|
||||
llamacpp_valid_cache_types = {"fp16", "q8_0", "q4_0"}
|
||||
|
|
@ -124,19 +130,61 @@ class LlamaServer:
|
|||
|
||||
return payload
|
||||
|
||||
def _process_images_for_generation(self, state: dict) -> List[Any]:
|
||||
"""
|
||||
Process all possible image inputs and return PIL images
|
||||
"""
|
||||
pil_images = []
|
||||
# Source 1: Web UI (from chatbot_wrapper)
|
||||
if 'image_attachments' in state and state['image_attachments']:
|
||||
pil_images.extend(convert_image_attachments_to_pil(state['image_attachments']))
|
||||
# Source 2: Chat Completions API (/v1/chat/completions)
|
||||
elif 'history' in state and state.get('history', {}).get('messages'):
|
||||
pil_images.extend(convert_openai_messages_to_images(state['history']['messages']))
|
||||
# Source 3: Legacy Completions API (/v1/completions)
|
||||
elif 'raw_images' in state and state['raw_images']:
|
||||
pil_images.extend(state.get('raw_images', []))
|
||||
|
||||
return pil_images
|
||||
|
||||
def is_multimodal(self) -> bool:
|
||||
"""Check if this model supports multimodal input."""
|
||||
return shared.args.mmproj not in [None, 'None']
|
||||
|
||||
def generate_with_streaming(self, prompt, state):
|
||||
url = f"http://127.0.0.1:{self.port}/completion"
|
||||
payload = self.prepare_payload(state)
|
||||
|
||||
token_ids = self.encode(prompt, add_bos_token=state["add_bos_token"])
|
||||
self.last_prompt_token_count = len(token_ids)
|
||||
pil_images = []
|
||||
|
||||
if shared.is_multimodal:
|
||||
pil_images = self._process_images_for_generation(state)
|
||||
|
||||
if pil_images:
|
||||
# Multimodal case
|
||||
IMAGE_TOKEN_COST_ESTIMATE = 600 # A safe, conservative estimate per image
|
||||
|
||||
base64_images = [convert_pil_to_base64(img) for img in pil_images]
|
||||
payload["prompt"] = {
|
||||
"prompt_string": prompt,
|
||||
"multimodal_data": base64_images
|
||||
}
|
||||
|
||||
# Calculate an estimated token count
|
||||
text_tokens = self.encode(prompt, add_bos_token=state["add_bos_token"])
|
||||
self.last_prompt_token_count = len(text_tokens) + (len(pil_images) * IMAGE_TOKEN_COST_ESTIMATE)
|
||||
else:
|
||||
# Text only case
|
||||
token_ids = self.encode(prompt, add_bos_token=state["add_bos_token"])
|
||||
self.last_prompt_token_count = len(token_ids)
|
||||
payload["prompt"] = token_ids
|
||||
|
||||
if state['auto_max_new_tokens']:
|
||||
max_new_tokens = state['truncation_length'] - len(token_ids)
|
||||
max_new_tokens = state['truncation_length'] - self.last_prompt_token_count
|
||||
else:
|
||||
max_new_tokens = state['max_new_tokens']
|
||||
|
||||
payload.update({
|
||||
"prompt": token_ids,
|
||||
"n_predict": max_new_tokens,
|
||||
"stream": True,
|
||||
"cache_prompt": True
|
||||
|
|
@ -144,7 +192,7 @@ class LlamaServer:
|
|||
|
||||
if shared.args.verbose:
|
||||
logger.info("GENERATE_PARAMS=")
|
||||
printable_payload = {k: v for k, v in payload.items() if k != "prompt"}
|
||||
printable_payload = {k: (v if k != "prompt" else "[multimodal object]" if pil_images else v) for k, v in payload.items()}
|
||||
pprint.PrettyPrinter(indent=4, sort_dicts=False).pprint(printable_payload)
|
||||
print()
|
||||
|
||||
|
|
@ -295,6 +343,13 @@ class LlamaServer:
|
|||
cmd += ["--rope-freq-scale", str(1.0 / shared.args.compress_pos_emb)]
|
||||
if shared.args.rope_freq_base > 0:
|
||||
cmd += ["--rope-freq-base", str(shared.args.rope_freq_base)]
|
||||
if shared.args.mmproj not in [None, 'None']:
|
||||
path = Path(shared.args.mmproj)
|
||||
if not path.exists():
|
||||
path = Path('user_data/mmproj') / shared.args.mmproj
|
||||
|
||||
if path.exists():
|
||||
cmd += ["--mmproj", str(path)]
|
||||
if shared.args.model_draft not in [None, 'None']:
|
||||
path = Path(shared.args.model_draft)
|
||||
if not path.exists():
|
||||
|
|
@ -316,6 +371,7 @@ class LlamaServer:
|
|||
cmd += ["--ctx-size-draft", str(shared.args.ctx_size_draft)]
|
||||
if shared.args.streaming_llm:
|
||||
cmd += ["--cache-reuse", "1"]
|
||||
cmd += ["--swa-full"]
|
||||
if shared.args.extra_flags:
|
||||
# Clean up the input
|
||||
extra_flags = shared.args.extra_flags.strip()
|
||||
|
|
|
|||
|
|
@ -28,6 +28,8 @@ loaders_and_params = OrderedDict({
|
|||
'device_draft',
|
||||
'ctx_size_draft',
|
||||
'speculative_decoding_accordion',
|
||||
'mmproj',
|
||||
'mmproj_accordion',
|
||||
'vram_info',
|
||||
],
|
||||
'Transformers': [
|
||||
|
|
@ -54,6 +56,19 @@ loaders_and_params = OrderedDict({
|
|||
'cfg_cache',
|
||||
'trust_remote_code',
|
||||
'no_use_fast',
|
||||
'enable_tp',
|
||||
'tp_backend',
|
||||
],
|
||||
'ExLlamav3': [
|
||||
'ctx_size',
|
||||
'cache_type',
|
||||
'gpu_split',
|
||||
'model_draft',
|
||||
'draft_max',
|
||||
'ctx_size_draft',
|
||||
'speculative_decoding_accordion',
|
||||
'enable_tp',
|
||||
'tp_backend',
|
||||
],
|
||||
'ExLlamav2_HF': [
|
||||
'ctx_size',
|
||||
|
|
@ -251,6 +266,24 @@ loaders_samplers = {
|
|||
'grammar_string',
|
||||
'grammar_file_row',
|
||||
},
|
||||
'ExLlamav3': {
|
||||
'temperature',
|
||||
'min_p',
|
||||
'top_p',
|
||||
'top_k',
|
||||
'repetition_penalty',
|
||||
'frequency_penalty',
|
||||
'presence_penalty',
|
||||
'repetition_penalty_range',
|
||||
'temperature_last',
|
||||
'sampler_priority',
|
||||
'auto_max_new_tokens',
|
||||
'ban_eos_token',
|
||||
'add_bos_token',
|
||||
'enable_thinking',
|
||||
'seed',
|
||||
'skip_special_tokens',
|
||||
},
|
||||
'ExLlamav2': {
|
||||
'temperature',
|
||||
'dynatemp_low',
|
||||
|
|
|
|||
|
|
@ -19,6 +19,7 @@ def load_model(model_name, loader=None):
|
|||
'llama.cpp': llama_cpp_server_loader,
|
||||
'Transformers': transformers_loader,
|
||||
'ExLlamav3_HF': ExLlamav3_HF_loader,
|
||||
'ExLlamav3': ExLlamav3_loader,
|
||||
'ExLlamav2_HF': ExLlamav2_HF_loader,
|
||||
'ExLlamav2': ExLlamav2_loader,
|
||||
'TensorRT-LLM': TensorRT_LLM_loader,
|
||||
|
|
@ -55,6 +56,10 @@ def load_model(model_name, loader=None):
|
|||
if loader.lower().startswith('exllama') or loader.lower().startswith('tensorrt') or loader == 'llama.cpp' or loader == 'MLX':
|
||||
shared.settings['truncation_length'] = shared.args.ctx_size
|
||||
|
||||
shared.is_multimodal = False
|
||||
if loader.lower() in ('exllamav3', 'llama.cpp'):
|
||||
shared.is_multimodal = model.is_multimodal()
|
||||
|
||||
logger.info(f"Loaded \"{model_name}\" in {(time.time()-t0):.2f} seconds.")
|
||||
logger.info(f"LOADER: \"{loader}\"")
|
||||
logger.info(f"TRUNCATION LENGTH: {shared.settings['truncation_length']}")
|
||||
|
|
@ -89,6 +94,14 @@ def ExLlamav3_HF_loader(model_name):
|
|||
return Exllamav3HF.from_pretrained(model_name)
|
||||
|
||||
|
||||
def ExLlamav3_loader(model_name):
|
||||
from modules.exllamav3 import Exllamav3Model
|
||||
|
||||
model = Exllamav3Model.from_pretrained(model_name)
|
||||
tokenizer = model.tokenizer
|
||||
return model, tokenizer
|
||||
|
||||
|
||||
def ExLlamav2_HF_loader(model_name):
|
||||
from modules.exllamav2_hf import Exllamav2HF
|
||||
|
||||
|
|
@ -129,8 +142,12 @@ def unload_model(keep_model_name=False):
|
|||
if shared.model is None:
|
||||
return
|
||||
|
||||
is_llamacpp = (shared.model.__class__.__name__ == 'LlamaServer')
|
||||
if shared.model.__class__.__name__ == 'Exllamav3HF':
|
||||
model_class_name = shared.model.__class__.__name__
|
||||
is_llamacpp = (model_class_name == 'LlamaServer')
|
||||
|
||||
if model_class_name in ['Exllamav3Model', 'Exllamav3HF']:
|
||||
shared.model.unload()
|
||||
elif model_class_name in ['Exllamav2Model', 'Exllamav2HF'] and hasattr(shared.model, 'unload'):
|
||||
shared.model.unload()
|
||||
elif shared.model.__class__.__name__ == 'MLXModel':
|
||||
shared.model.unload()
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ from modules.logging_colors import logger
|
|||
def get_fallback_settings():
|
||||
return {
|
||||
'bf16': False,
|
||||
'ctx_size': 2048,
|
||||
'ctx_size': 8192,
|
||||
'rope_freq_base': 0,
|
||||
'compress_pos_emb': 1,
|
||||
'alpha_value': 1,
|
||||
|
|
@ -106,9 +106,16 @@ def get_model_metadata(model):
|
|||
|
||||
for k in ['max_position_embeddings', 'model_max_length', 'max_seq_len']:
|
||||
if k in metadata:
|
||||
model_settings['truncation_length'] = metadata[k]
|
||||
model_settings['truncation_length_info'] = metadata[k]
|
||||
model_settings['ctx_size'] = min(metadata[k], 8192)
|
||||
value = metadata[k]
|
||||
elif k in metadata.get('text_config', {}):
|
||||
value = metadata['text_config'][k]
|
||||
else:
|
||||
continue
|
||||
|
||||
model_settings['truncation_length'] = value
|
||||
model_settings['truncation_length_info'] = value
|
||||
model_settings['ctx_size'] = min(value, 8192)
|
||||
break
|
||||
|
||||
if 'rope_theta' in metadata:
|
||||
model_settings['rope_freq_base'] = metadata['rope_theta']
|
||||
|
|
@ -132,16 +139,26 @@ def get_model_metadata(model):
|
|||
with open(jinja_path, 'r', encoding='utf-8') as f:
|
||||
template = f.read()
|
||||
|
||||
# 2. If no .jinja file, try chat_template.json
|
||||
if template is None:
|
||||
json_template_path = Path(f'{shared.args.model_dir}/{model}') / 'chat_template.json'
|
||||
if json_template_path.exists():
|
||||
with open(json_template_path, 'r', encoding='utf-8') as f:
|
||||
json_data = json.load(f)
|
||||
if 'chat_template' in json_data:
|
||||
template = json_data['chat_template']
|
||||
|
||||
# 3. Fall back to tokenizer_config.json metadata
|
||||
if path.exists():
|
||||
metadata = json.loads(open(path, 'r', encoding='utf-8').read())
|
||||
|
||||
# 2. Only read from metadata if we haven't already loaded from .jinja
|
||||
# Only read from metadata if we haven't already loaded from .jinja or .json
|
||||
if template is None and 'chat_template' in metadata:
|
||||
template = metadata['chat_template']
|
||||
if isinstance(template, list):
|
||||
template = template[0]['template']
|
||||
|
||||
# 3. If a template was found from either source, process it
|
||||
# 4. If a template was found from any source, process it
|
||||
if template:
|
||||
for k in ['eos_token', 'bos_token']:
|
||||
if k in metadata:
|
||||
|
|
@ -184,34 +201,31 @@ def get_model_metadata(model):
|
|||
|
||||
|
||||
def infer_loader(model_name, model_settings, hf_quant_method=None):
|
||||
import platform
|
||||
|
||||
# Check for MLX models first (before path checks)
|
||||
if (model_name.startswith('mlx-community/') or model_name.startswith('mlx-community_')) and platform.system() == "Darwin" and platform.machine() == "arm64":
|
||||
path_to_model = Path(f'{shared.args.model_dir}/{model_name}')
|
||||
if not path_to_model.exists():
|
||||
loader = None
|
||||
elif shared.args.portable:
|
||||
loader = 'llama.cpp'
|
||||
elif len(list(path_to_model.glob('*.gguf'))) > 0:
|
||||
loader = 'llama.cpp'
|
||||
elif re.match(r'.*\.gguf', model_name.lower()):
|
||||
loader = 'llama.cpp'
|
||||
elif hf_quant_method == 'mlx':
|
||||
loader = 'MLX'
|
||||
elif re.match(r'.*\.mlx', model_name.lower()) and platform.system() == "Darwin" and platform.machine() == "arm64":
|
||||
elif re.match(r'.*\.mlx', model_name.lower()):
|
||||
loader = 'MLX'
|
||||
elif model_name.lower().startswith('mlx-community'):
|
||||
loader = 'MLX'
|
||||
elif hf_quant_method == 'exl3':
|
||||
loader = 'ExLlamav3'
|
||||
elif hf_quant_method in ['exl2', 'gptq']:
|
||||
loader = 'ExLlamav2_HF'
|
||||
elif re.match(r'.*exl3', model_name.lower()):
|
||||
loader = 'ExLlamav3'
|
||||
elif re.match(r'.*exl2', model_name.lower()):
|
||||
loader = 'ExLlamav2_HF'
|
||||
else:
|
||||
# Original logic for other loaders
|
||||
path_to_model = Path(f'{shared.args.model_dir}/{model_name}')
|
||||
if not path_to_model.exists():
|
||||
loader = None
|
||||
elif shared.args.portable:
|
||||
loader = 'llama.cpp'
|
||||
elif len(list(path_to_model.glob('*.gguf'))) > 0:
|
||||
loader = 'llama.cpp'
|
||||
elif re.match(r'.*\.gguf', model_name.lower()):
|
||||
loader = 'llama.cpp'
|
||||
elif hf_quant_method == 'exl3':
|
||||
loader = 'ExLlamav3_HF'
|
||||
elif hf_quant_method in ['exl2', 'gptq']:
|
||||
loader = 'ExLlamav2_HF'
|
||||
elif re.match(r'.*exl3', model_name.lower()):
|
||||
loader = 'ExLlamav3_HF'
|
||||
elif re.match(r'.*exl2', model_name.lower()):
|
||||
loader = 'ExLlamav2_HF'
|
||||
else:
|
||||
loader = 'Transformers'
|
||||
loader = 'Transformers'
|
||||
|
||||
return loader
|
||||
|
||||
|
|
@ -243,7 +257,7 @@ def apply_model_settings_to_state(model, state):
|
|||
model_settings = get_model_metadata(model)
|
||||
if 'loader' in model_settings:
|
||||
loader = model_settings.pop('loader')
|
||||
if not (loader == 'ExLlamav2_HF' and state['loader'] in ['ExLlamav2']):
|
||||
if not ((loader == 'ExLlamav2_HF' and state['loader'] == 'ExLlamav2') or (loader == 'ExLlamav3_HF' and state['loader'] == 'ExLlamav3')):
|
||||
state['loader'] = loader
|
||||
|
||||
for k in model_settings:
|
||||
|
|
|
|||
|
|
@ -16,6 +16,7 @@ model = None
|
|||
tokenizer = None
|
||||
model_name = 'None'
|
||||
is_seq2seq = False
|
||||
is_multimodal = False
|
||||
model_dirty_from_training = False
|
||||
lora_names = []
|
||||
|
||||
|
|
@ -85,6 +86,7 @@ group.add_argument('--no-kv-offload', action='store_true', help='Do not offload
|
|||
group.add_argument('--row-split', action='store_true', help='Split the model by rows across GPUs. This may improve multi-gpu performance.')
|
||||
group.add_argument('--extra-flags', type=str, default=None, help='Extra flags to pass to llama-server. Format: "flag1=value1,flag2,flag3=value3". Example: "override-tensor=exps=CPU"')
|
||||
group.add_argument('--streaming-llm', action='store_true', help='Activate StreamingLLM to avoid re-evaluating the entire prompt when old messages are removed.')
|
||||
group.add_argument('--mmproj', type=str, default=None, help='Path to the mmproj file for vision models.')
|
||||
|
||||
# Cache
|
||||
group = parser.add_argument_group('Context and cache')
|
||||
|
|
@ -99,6 +101,11 @@ group.add_argument('--gpu-layers-draft', type=int, default=256, help='Number of
|
|||
group.add_argument('--device-draft', type=str, default=None, help='Comma-separated list of devices to use for offloading the draft model. Example: CUDA0,CUDA1')
|
||||
group.add_argument('--ctx-size-draft', type=int, default=0, help='Size of the prompt context for the draft model. If 0, uses the same as the main model.')
|
||||
|
||||
# ExLlamaV3
|
||||
group = parser.add_argument_group('ExLlamaV3')
|
||||
group.add_argument('--enable-tp', '--enable_tp', action='store_true', help='Enable Tensor Parallelism (TP) to split the model across GPUs.')
|
||||
group.add_argument('--tp-backend', type=str, default='native', help='The backend for tensor parallelism. Valid options: native, nccl. Default: native.')
|
||||
|
||||
# ExLlamaV2
|
||||
group = parser.add_argument_group('ExLlamaV2')
|
||||
group.add_argument('--gpu-split', type=str, help='Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.')
|
||||
|
|
@ -108,7 +115,6 @@ group.add_argument('--no_flash_attn', action='store_true', help='Force flash-att
|
|||
group.add_argument('--no_xformers', action='store_true', help='Force xformers to not be used.')
|
||||
group.add_argument('--no_sdpa', action='store_true', help='Force Torch SDPA to not be used.')
|
||||
group.add_argument('--num_experts_per_token', type=int, default=2, metavar='N', help='Number of experts to use for generation. Applies to MoE models like Mixtral.')
|
||||
group.add_argument('--enable_tp', action='store_true', help='Enable Tensor Parallelism (TP) in ExLlamaV2.')
|
||||
|
||||
# TensorRT-LLM
|
||||
group = parser.add_argument_group('TensorRT-LLM')
|
||||
|
|
@ -318,6 +324,8 @@ def fix_loader_name(name):
|
|||
return 'ExLlamav2_HF'
|
||||
elif name in ['exllamav3-hf', 'exllamav3_hf', 'exllama-v3-hf', 'exllama_v3_hf', 'exllama-v3_hf', 'exllama3-hf', 'exllama3_hf', 'exllama-3-hf', 'exllama_3_hf', 'exllama-3_hf']:
|
||||
return 'ExLlamav3_HF'
|
||||
elif name in ['exllamav3']:
|
||||
return 'ExLlamav3'
|
||||
elif name in ['tensorrt', 'tensorrtllm', 'tensorrt_llm', 'tensorrt-llm', 'tensort', 'tensortllm']:
|
||||
return 'TensorRT-LLM'
|
||||
|
||||
|
|
|
|||
|
|
@ -40,7 +40,7 @@ def _generate_reply(question, state, stopping_strings=None, is_chat=False, escap
|
|||
yield ''
|
||||
return
|
||||
|
||||
if shared.model.__class__.__name__ in ['LlamaServer', 'Exllamav2Model', 'TensorRTLLMModel', 'MLXModel']:
|
||||
if shared.model.__class__.__name__ in ['LlamaServer', 'Exllamav2Model', 'Exllamav3Model', 'TensorRTLLMModel', 'MLXModel']:
|
||||
generate_func = generate_reply_custom
|
||||
else:
|
||||
generate_func = generate_reply_HF
|
||||
|
|
@ -128,9 +128,9 @@ def encode(prompt, add_special_tokens=True, add_bos_token=True, truncation_lengt
|
|||
|
||||
from modules.torch_utils import get_device
|
||||
|
||||
if shared.model.__class__.__name__ in ['Exllamav2Model', 'TensorRTLLMModel']:
|
||||
if shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav3Model', 'TensorRTLLMModel']:
|
||||
input_ids = shared.tokenizer.encode(str(prompt))
|
||||
if shared.model.__class__.__name__ != 'Exllamav2Model':
|
||||
if shared.model.__class__.__name__ not in ['Exllamav2Model', 'Exllamav3Model']:
|
||||
input_ids = np.array(input_ids).reshape(1, len(input_ids))
|
||||
else:
|
||||
input_ids = shared.tokenizer.encode(str(prompt), return_tensors='pt', add_special_tokens=add_special_tokens)
|
||||
|
|
@ -148,7 +148,7 @@ def encode(prompt, add_special_tokens=True, add_bos_token=True, truncation_lengt
|
|||
if truncation_length is not None:
|
||||
input_ids = input_ids[:, -truncation_length:]
|
||||
|
||||
if shared.model.__class__.__name__ in ['Exllamav2Model', 'TensorRTLLMModel', 'MLXModel'] or shared.args.cpu:
|
||||
if shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav3Model', 'TensorRTLLMModel', 'MLXModel'] or shared.args.cpu:
|
||||
return input_ids
|
||||
else:
|
||||
device = get_device()
|
||||
|
|
|
|||
|
|
@ -155,6 +155,7 @@ def list_model_elements():
|
|||
'bf16',
|
||||
'autosplit',
|
||||
'enable_tp',
|
||||
'tp_backend',
|
||||
'no_flash_attn',
|
||||
'no_xformers',
|
||||
'no_sdpa',
|
||||
|
|
@ -167,6 +168,7 @@ def list_model_elements():
|
|||
'gpu_layers_draft',
|
||||
'device_draft',
|
||||
'ctx_size_draft',
|
||||
'mmproj',
|
||||
]
|
||||
|
||||
return elements
|
||||
|
|
|
|||
|
|
@ -54,7 +54,7 @@ def create_ui():
|
|||
gr.HTML(value='<div class="hover-element" onclick="void(0)"><span style="width: 100px; display: block" id="hover-element-button">☰</span><div class="hover-menu" id="hover-menu"></div>', elem_id='gr-hover')
|
||||
|
||||
with gr.Column(scale=10, elem_id='chat-input-container'):
|
||||
shared.gradio['textbox'] = gr.MultimodalTextbox(label='', placeholder='Send a message', file_types=['text', '.pdf'], file_count="multiple", elem_id='chat-input', elem_classes=['add_scrollbar'])
|
||||
shared.gradio['textbox'] = gr.MultimodalTextbox(label='', placeholder='Send a message', file_types=['text', '.pdf', 'image'], file_count="multiple", elem_id='chat-input', elem_classes=['add_scrollbar'])
|
||||
shared.gradio['typing-dots'] = gr.HTML(value='<div class="typing"><span></span><span class="dot1"></span><span class="dot2"></span></div>', label='typing', elem_id='typing-container')
|
||||
|
||||
with gr.Column(scale=1, elem_id='generate-stop-container'):
|
||||
|
|
@ -78,12 +78,19 @@ def create_ui():
|
|||
with gr.Row():
|
||||
shared.gradio['start_with'] = gr.Textbox(label='Start reply with', placeholder='Sure thing!', value=shared.settings['start_with'], elem_classes=['add_scrollbar'])
|
||||
|
||||
gr.HTML("<div style='margin: 0; border-bottom: 1px solid rgba(255,255,255,0.1);'></div>")
|
||||
|
||||
shared.gradio['reasoning_effort'] = gr.Dropdown(value=shared.settings['reasoning_effort'], choices=['low', 'medium', 'high'], label='Reasoning effort', info='Used by GPT-OSS.')
|
||||
shared.gradio['enable_thinking'] = gr.Checkbox(value=shared.settings['enable_thinking'], label='Enable thinking', info='Used by pre-2507 Qwen3.')
|
||||
|
||||
gr.HTML("<div style='margin: 0; border-bottom: 1px solid rgba(255,255,255,0.1);'></div>")
|
||||
|
||||
shared.gradio['enable_web_search'] = gr.Checkbox(value=shared.settings.get('enable_web_search', False), label='Activate web search', elem_id='web-search')
|
||||
with gr.Row(visible=shared.settings.get('enable_web_search', False)) as shared.gradio['web_search_row']:
|
||||
shared.gradio['web_search_pages'] = gr.Number(value=shared.settings.get('web_search_pages', 3), precision=0, label='Number of pages to download', minimum=1, maximum=10)
|
||||
|
||||
gr.HTML("<div style='margin: 0; border-bottom: 1px solid rgba(255,255,255,0.1);'></div>")
|
||||
|
||||
with gr.Row():
|
||||
shared.gradio['mode'] = gr.Radio(choices=['instruct', 'chat-instruct', 'chat'], value=None, label='Mode', info='Defines how the chat prompt is generated. In instruct and chat-instruct modes, the instruction template Parameters > Instruction template is used.', elem_id='chat-mode')
|
||||
|
||||
|
|
@ -93,6 +100,8 @@ def create_ui():
|
|||
with gr.Row():
|
||||
shared.gradio['chat-instruct_command'] = gr.Textbox(value=shared.settings['chat-instruct_command'], lines=12, label='Command for chat-instruct mode', info='<|character|> and <|prompt|> get replaced with the bot name and the regular chat prompt respectively.', visible=shared.settings['mode'] == 'chat-instruct', elem_classes=['add_scrollbar'])
|
||||
|
||||
gr.HTML("<div style='margin: 0; border-bottom: 1px solid rgba(255,255,255,0.1);'></div>")
|
||||
|
||||
with gr.Row():
|
||||
shared.gradio['count_tokens'] = gr.Button('Count tokens', size='sm')
|
||||
|
||||
|
|
|
|||
|
|
@ -42,10 +42,12 @@ def create_ui():
|
|||
with gr.Row():
|
||||
with gr.Column():
|
||||
shared.gradio['gpu_layers'] = gr.Slider(label="gpu-layers", minimum=0, maximum=get_initial_gpu_layers_max(), step=1, value=shared.args.gpu_layers, info='Must be greater than 0 for the GPU to be used. ⚠️ Lower this value if you can\'t load the model.')
|
||||
shared.gradio['ctx_size'] = gr.Slider(label='ctx-size', minimum=256, maximum=131072, step=256, value=shared.args.ctx_size, info='Context length. Common values: 4096, 8192, 16384, 32768, 65536, 131072. ⚠️ Lower this value if you can\'t load the model.')
|
||||
shared.gradio['ctx_size'] = gr.Slider(label='ctx-size', minimum=256, maximum=131072, step=256, value=shared.args.ctx_size, info='Context length. Common values: 4096, 8192, 16384, 32768, 65536, 131072.')
|
||||
shared.gradio['gpu_split'] = gr.Textbox(label='gpu-split', info='Comma-separated list of VRAM (in GB) to use per GPU. Example: 20,7,7')
|
||||
shared.gradio['attn_implementation'] = gr.Dropdown(label="attn-implementation", choices=['sdpa', 'eager', 'flash_attention_2'], value=shared.args.attn_implementation, info='Attention implementation.')
|
||||
shared.gradio['cache_type'] = gr.Dropdown(label="cache-type", choices=['fp16', 'q8_0', 'q4_0', 'fp8', 'q8', 'q7', 'q6', 'q5', 'q4', 'q3', 'q2'], value=shared.args.cache_type, allow_custom_value=True, info='Valid options: llama.cpp - fp16, q8_0, q4_0; ExLlamaV2 - fp16, fp8, q8, q6, q4; ExLlamaV3 - fp16, q2 to q8. For ExLlamaV3, you can type custom combinations for separate k/v bits (e.g. q4_q8).')
|
||||
shared.gradio['tp_backend'] = gr.Dropdown(label="tp-backend", choices=['native', 'nccl'], value=shared.args.tp_backend, info='The backend for tensor parallelism.')
|
||||
|
||||
with gr.Column():
|
||||
shared.gradio['vram_info'] = gr.HTML(value=get_initial_vram_info())
|
||||
shared.gradio['flash_attn'] = gr.Checkbox(label="flash-attn", value=shared.args.flash_attn, info='Use flash-attention.')
|
||||
|
|
@ -54,11 +56,17 @@ def create_ui():
|
|||
shared.gradio['load_in_4bit'] = gr.Checkbox(label="load-in-4bit", value=shared.args.load_in_4bit)
|
||||
shared.gradio['use_double_quant'] = gr.Checkbox(label="use_double_quant", value=shared.args.use_double_quant, info='Used by load-in-4bit.')
|
||||
shared.gradio['autosplit'] = gr.Checkbox(label="autosplit", value=shared.args.autosplit, info='Automatically split the model tensors across the available GPUs.')
|
||||
shared.gradio['enable_tp'] = gr.Checkbox(label="enable_tp", value=shared.args.enable_tp, info='Enable Tensor Parallelism (TP).')
|
||||
shared.gradio['enable_tp'] = gr.Checkbox(label="enable_tp", value=shared.args.enable_tp, info='Enable tensor parallelism (TP).')
|
||||
shared.gradio['cpp_runner'] = gr.Checkbox(label="cpp-runner", value=shared.args.cpp_runner, info='Enable inference with ModelRunnerCpp, which is faster than the default ModelRunner.')
|
||||
shared.gradio['trust_remote_code'] = gr.Checkbox(label="trust-remote-code", value=shared.args.trust_remote_code, info='Set trust_remote_code=True while loading the tokenizer/model. To enable this option, start the web UI with the --trust-remote-code flag.', interactive=shared.args.trust_remote_code)
|
||||
shared.gradio['tensorrt_llm_info'] = gr.Markdown('* TensorRT-LLM has to be installed manually in a separate Python 3.10 environment at the moment. For a guide, consult the description of [this PR](https://github.com/oobabooga/text-generation-webui/pull/5715). \n\n* `ctx_size` is only used when `cpp-runner` is checked.\n\n* `cpp_runner` does not support streaming at the moment.')
|
||||
|
||||
|
||||
# Multimodal
|
||||
with gr.Accordion("Multimodal (vision)", open=False, elem_classes='tgw-accordion') as shared.gradio['mmproj_accordion']:
|
||||
with gr.Row():
|
||||
shared.gradio['mmproj'] = gr.Dropdown(label="mmproj file", choices=utils.get_available_mmproj(), value=lambda: shared.args.mmproj or 'None', elem_classes='slim-dropdown', info='Select a file that matches your model. Must be placed in user_data/mmproj/', interactive=not mu)
|
||||
ui.create_refresh_button(shared.gradio['mmproj'], lambda: None, lambda: {'choices': utils.get_available_mmproj()}, 'refresh-button', interactive=not mu)
|
||||
|
||||
# Speculative decoding
|
||||
with gr.Accordion("Speculative decoding", open=False, elem_classes='tgw-accordion') as shared.gradio['speculative_decoding_accordion']:
|
||||
with gr.Row():
|
||||
|
|
|
|||
|
|
@ -154,6 +154,19 @@ def get_available_ggufs():
|
|||
return sorted(model_list, key=natural_keys)
|
||||
|
||||
|
||||
def get_available_mmproj():
|
||||
mmproj_dir = Path('user_data/mmproj')
|
||||
if not mmproj_dir.exists():
|
||||
return ['None']
|
||||
|
||||
mmproj_files = []
|
||||
for item in mmproj_dir.iterdir():
|
||||
if item.is_file() and item.suffix.lower() in ('.gguf', '.bin'):
|
||||
mmproj_files.append(item.name)
|
||||
|
||||
return ['None'] + sorted(mmproj_files, key=natural_keys)
|
||||
|
||||
|
||||
def get_available_presets():
|
||||
return sorted(set((k.stem for k in Path('user_data/presets').glob('*.yaml'))), key=natural_keys)
|
||||
|
||||
|
|
|
|||
|
|
@ -1,6 +1,8 @@
|
|||
import concurrent.futures
|
||||
import html
|
||||
import random
|
||||
import re
|
||||
import urllib.request
|
||||
from concurrent.futures import as_completed
|
||||
from datetime import datetime
|
||||
from urllib.parse import quote_plus
|
||||
|
|
@ -50,16 +52,21 @@ def download_web_page(url, timeout=10):
|
|||
def perform_web_search(query, num_pages=3, max_workers=5, timeout=10):
|
||||
"""Perform web search and return results with content"""
|
||||
try:
|
||||
# Use DuckDuckGo HTML search endpoint
|
||||
search_url = f"https://html.duckduckgo.com/html/?q={quote_plus(query)}"
|
||||
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
|
||||
|
||||
response = requests.get(search_url, headers=headers, timeout=timeout)
|
||||
response.raise_for_status()
|
||||
agents = [
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
|
||||
]
|
||||
|
||||
response_text = ""
|
||||
req = urllib.request.Request(search_url, headers={'User-Agent': random.choice(agents)})
|
||||
with urllib.request.urlopen(req, timeout=timeout) as response:
|
||||
response_text = response.read().decode('utf-8')
|
||||
|
||||
# Extract results with regex
|
||||
titles = re.findall(r'<a[^>]*class="[^"]*result__a[^"]*"[^>]*>(.*?)</a>', response.text, re.DOTALL)
|
||||
urls = re.findall(r'<a[^>]*class="[^"]*result__url[^"]*"[^>]*>(.*?)</a>', response.text, re.DOTALL)
|
||||
titles = re.findall(r'<a[^>]*class="[^"]*result__a[^"]*"[^>]*>(.*?)</a>', response_text, re.DOTALL)
|
||||
urls = re.findall(r'<a[^>]*class="[^"]*result__url[^"]*"[^>]*>(.*?)</a>', response_text, re.DOTALL)
|
||||
|
||||
# Prepare download tasks
|
||||
download_tasks = []
|
||||
|
|
|
|||
44
one_click.py
44
one_click.py
|
|
@ -16,7 +16,7 @@ import sys
|
|||
# os.environ["HCC_AMDGPU_TARGET"] = 'gfx1030'
|
||||
|
||||
# Define the required versions
|
||||
TORCH_VERSION = "2.6.0"
|
||||
TORCH_VERSION = "2.7.1"
|
||||
PYTHON_VERSION = "3.11"
|
||||
LIBSTDCXX_VERSION_LINUX = "12.1.0"
|
||||
|
||||
|
|
@ -113,17 +113,16 @@ def get_gpu_choice():
|
|||
choice = get_user_choice(
|
||||
"What is your GPU?",
|
||||
{
|
||||
'A': 'NVIDIA - CUDA 12.4',
|
||||
'A': 'NVIDIA',
|
||||
'B': 'AMD - Linux/macOS only, requires ROCm 6.2.4',
|
||||
'C': 'Apple M Series',
|
||||
'D': 'Intel Arc (beta)',
|
||||
'E': 'NVIDIA - CUDA 12.8',
|
||||
'N': 'CPU mode'
|
||||
},
|
||||
)
|
||||
|
||||
# Convert choice to GPU name
|
||||
gpu_choice = {"A": "NVIDIA", "B": "AMD", "C": "APPLE", "D": "INTEL", "E": "NVIDIA_CUDA128", "N": "NONE"}[choice]
|
||||
gpu_choice = {"A": "NVIDIA_CUDA128", "B": "AMD", "C": "APPLE", "D": "INTEL", "N": "NONE"}[choice]
|
||||
|
||||
# Save choice to state
|
||||
state['gpu_choice'] = gpu_choice
|
||||
|
|
@ -136,10 +135,8 @@ def get_pytorch_install_command(gpu_choice):
|
|||
"""Get PyTorch installation command based on GPU choice"""
|
||||
base_cmd = f"python -m pip install torch=={TORCH_VERSION} "
|
||||
|
||||
if gpu_choice == "NVIDIA":
|
||||
return base_cmd + "--index-url https://download.pytorch.org/whl/cu124"
|
||||
elif gpu_choice == "NVIDIA_CUDA128":
|
||||
return "python -m pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128"
|
||||
if gpu_choice == "NVIDIA_CUDA128":
|
||||
return base_cmd + "--index-url https://download.pytorch.org/whl/cu128"
|
||||
elif gpu_choice == "AMD":
|
||||
return base_cmd + "--index-url https://download.pytorch.org/whl/rocm6.2.4"
|
||||
elif gpu_choice in ["APPLE", "NONE"]:
|
||||
|
|
@ -157,10 +154,8 @@ def get_pytorch_update_command(gpu_choice):
|
|||
"""Get PyTorch update command based on GPU choice"""
|
||||
base_cmd = f"python -m pip install --upgrade torch=={TORCH_VERSION} "
|
||||
|
||||
if gpu_choice == "NVIDIA":
|
||||
return f"{base_cmd} --index-url https://download.pytorch.org/whl/cu124"
|
||||
elif gpu_choice == "NVIDIA_CUDA128":
|
||||
return "python -m pip install --upgrade torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128"
|
||||
if gpu_choice == "NVIDIA_CUDA128":
|
||||
return f"{base_cmd} --index-url https://download.pytorch.org/whl/cu128"
|
||||
elif gpu_choice == "AMD":
|
||||
return f"{base_cmd} --index-url https://download.pytorch.org/whl/rocm6.2.4"
|
||||
elif gpu_choice in ["APPLE", "NONE"]:
|
||||
|
|
@ -176,16 +171,14 @@ def get_requirements_file(gpu_choice):
|
|||
"""Get requirements file path based on GPU choice"""
|
||||
requirements_base = os.path.join("requirements", "full")
|
||||
|
||||
if gpu_choice == "AMD":
|
||||
if gpu_choice == "NVIDIA_CUDA128":
|
||||
file_name = f"requirements{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
||||
elif gpu_choice == "AMD":
|
||||
file_name = f"requirements_amd{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
||||
elif gpu_choice == "APPLE":
|
||||
file_name = f"requirements_apple_{'intel' if is_x86_64() else 'silicon'}.txt"
|
||||
elif gpu_choice in ["INTEL", "NONE"]:
|
||||
file_name = f"requirements_cpu_only{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
||||
elif gpu_choice == "NVIDIA":
|
||||
file_name = f"requirements{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
||||
elif gpu_choice == "NVIDIA_CUDA128":
|
||||
file_name = f"requirements_cuda128{'_noavx2' if not cpu_has_avx2() else ''}.txt"
|
||||
else:
|
||||
raise ValueError(f"Unknown GPU choice: {gpu_choice}")
|
||||
|
||||
|
|
@ -331,8 +324,6 @@ def install_webui():
|
|||
cmd_flags_file.write("\n--cpu\n")
|
||||
|
||||
# Handle CUDA version display
|
||||
elif any((is_windows(), is_linux())) and gpu_choice == "NVIDIA":
|
||||
print("CUDA: 12.4")
|
||||
elif any((is_windows(), is_linux())) and gpu_choice == "NVIDIA_CUDA128":
|
||||
print("CUDA: 12.8")
|
||||
|
||||
|
|
@ -368,6 +359,19 @@ def update_requirements(initial_installation=False, pull=True):
|
|||
assert_success=True
|
||||
)
|
||||
|
||||
# Check for outdated CUDA 12.4 installs and refuse to update
|
||||
state = load_state()
|
||||
if state.get('gpu_choice') == 'NVIDIA':
|
||||
print_big_message(
|
||||
"Your current installation uses CUDA 12.4, which has been removed.\n"
|
||||
"To update to the new default (CUDA 12.8), a clean installation is required.\n\n"
|
||||
"INSTRUCTIONS:\n"
|
||||
"1. Delete the 'installer_files' folder in your text-generation-webui directory.\n"
|
||||
"2. Run the start script again (e.g., start_windows.bat).\n\n"
|
||||
"This will create a fresh environment with the latest software."
|
||||
)
|
||||
sys.exit(0)
|
||||
|
||||
current_commit = get_current_commit()
|
||||
wheels_changed = not os.path.exists(state_file)
|
||||
if not wheels_changed:
|
||||
|
|
@ -404,7 +408,7 @@ def update_requirements(initial_installation=False, pull=True):
|
|||
with open(requirements_file, 'r') as f:
|
||||
after_pull_whl_lines = [line for line in f if '.whl' in line]
|
||||
|
||||
wheels_changed = wheels_changed or (before_pull_whl_lines != after_pull_whl_lines)
|
||||
wheels_changed = wheels_changed or (before_pull_whl_lines != after_pull_whl_lines)
|
||||
|
||||
# Check for changes to installer files
|
||||
for file in files_to_check:
|
||||
|
|
|
|||
|
|
@ -24,7 +24,7 @@ scipy
|
|||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.55.*
|
||||
triton-windows==3.2.0.post19; platform_system == "Windows"
|
||||
triton-windows==3.3.1.post19; platform_system == "Windows"
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
|
|
@ -34,12 +34,12 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# CUDA wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu124torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/kingbri1/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
|
|
|
|||
|
|
@ -33,7 +33,7 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# AMD wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+rocm6.2.4.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"
|
||||
|
|
|
|||
|
|
@ -33,7 +33,7 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# AMD wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+rocm6.2.4.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"
|
||||
|
|
|
|||
|
|
@ -33,7 +33,7 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# Mac wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5-py3-none-any.whl
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6-py3-none-any.whl
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl
|
||||
|
|
|
|||
|
|
@ -34,8 +34,8 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# Mac wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5-py3-none-any.whl
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6-py3-none-any.whl
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl
|
||||
|
|
|
|||
|
|
@ -33,5 +33,5 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# llama.cpp (CPU only, AVX2)
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
|
|
|
|||
|
|
@ -33,5 +33,5 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# llama.cpp (CPU only, no AVX2)
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
|
|
|
|||
|
|
@ -1,45 +0,0 @@
|
|||
accelerate==1.8.*
|
||||
bitsandbytes==0.46.*
|
||||
colorama
|
||||
datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.37.*
|
||||
html2text==2025.4.15
|
||||
jinja2==3.1.6
|
||||
markdown
|
||||
numpy==2.2.*
|
||||
pandas
|
||||
peft==0.16.*
|
||||
Pillow>=9.5.0
|
||||
psutil
|
||||
pydantic==2.8.2
|
||||
PyPDF2==3.0.1
|
||||
python-docx==1.1.2
|
||||
pyyaml
|
||||
requests
|
||||
rich
|
||||
safetensors==0.5.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.55.*
|
||||
triton-windows==3.3.1.post19; platform_system == "Windows"
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
# API
|
||||
flask_cloudflared==0.0.14
|
||||
sse-starlette==1.6.5
|
||||
tiktoken
|
||||
|
||||
# CUDA wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
|
|
@ -1,45 +0,0 @@
|
|||
accelerate==1.8.*
|
||||
bitsandbytes==0.46.*
|
||||
colorama
|
||||
datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.37.*
|
||||
html2text==2025.4.15
|
||||
jinja2==3.1.6
|
||||
markdown
|
||||
numpy==2.2.*
|
||||
pandas
|
||||
peft==0.16.*
|
||||
Pillow>=9.5.0
|
||||
psutil
|
||||
pydantic==2.8.2
|
||||
PyPDF2==3.0.1
|
||||
python-docx==1.1.2
|
||||
pyyaml
|
||||
requests
|
||||
rich
|
||||
safetensors==0.5.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.55.*
|
||||
triton-windows==3.3.1.post19; platform_system == "Windows"
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
# API
|
||||
flask_cloudflared==0.0.14
|
||||
sse-starlette==1.6.5
|
||||
tiktoken
|
||||
|
||||
# CUDA wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
|
|
@ -24,7 +24,7 @@ scipy
|
|||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.55.*
|
||||
triton-windows==3.2.0.post19; platform_system == "Windows"
|
||||
triton-windows==3.3.1.post19; platform_system == "Windows"
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
|
|
@ -34,12 +34,12 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# CUDA wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav3/releases/download/v0.0.5/exllamav3-0.0.5+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu124.torch2.6.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/turboderp-org/exllamav2/releases/download/v0.3.2/exllamav2-0.3.2-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
||||
https://github.com/kingbri1/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu124torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/kingbri1/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu128torch2.7.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
|
|
|
|||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# CUDA wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
|
|
|
|||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# Mac wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0"
|
||||
|
|
|
|||
|
|
@ -19,6 +19,6 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# Mac wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_15_0_arm64.whl; platform_system == "Darwin" and platform_release >= "24.0.0" and platform_release < "25.0.0"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0-py3-none-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0"
|
||||
|
|
|
|||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# llama.cpp (CPU only, AVX2)
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx2-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
|
|
|
|||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# llama.cpp (CPU only, no AVX2)
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cpuavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
|
|
|
|||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# CUDA wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+cu124avx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
|
|
|
|||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# CUDA wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkan-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
|
|
|
|||
|
|
@ -18,5 +18,5 @@ sse-starlette==1.6.5
|
|||
tiktoken
|
||||
|
||||
# CUDA wheels
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.33.0/llama_cpp_binaries-0.33.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-win_amd64.whl; platform_system == "Windows"
|
||||
https://github.com/oobabooga/llama-cpp-binaries/releases/download/v0.37.0/llama_cpp_binaries-0.37.0+vulkanavx-py3-none-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64"
|
||||
|
|
|
|||
0
user_data/mmproj/place-your-mmproj-here.txt
Normal file
0
user_data/mmproj/place-your-mmproj-here.txt
Normal file
Loading…
Reference in a new issue