Commit graph

413 commits

Author SHA1 Message Date
oobabooga
f5acf55207 Add --chat-template-file flag to override the default instruction template for API requests
Matches llama.cpp's flag name. Supports .jinja, .jinja2, and .yaml files.
Priority: per-request params > --chat-template-file > model's built-in template.
2026-03-06 14:04:16 -03:00
oobabooga
66fb79fe15 llama.cpp: Add --fit-target param 2026-03-06 01:55:48 -03:00
oobabooga
e81a47f708 Improve the API generation defaults --help message 2026-03-05 20:41:45 -08:00
oobabooga
27bcc45c18 API: Add command-line flags to override default generation parameters 2026-03-06 01:36:45 -03:00
oobabooga
e2548f69a9 Make user_data configurable: add --user-data-dir flag, auto-detect ../user_data
If --user-data-dir is not set, auto-detect: use ../user_data when
./user_data doesn't exist, making it easy to share user data across
portable builds by placing it one folder up.
2026-03-05 19:31:10 -08:00
oobabooga
f52d9336e5 TensorRT-LLM: Migrate from ModelRunner to LLM API, add concurrent API request support 2026-03-05 18:09:45 -08:00
oobabooga
9824c82cb6 API: Add parallel request support for llama.cpp and ExLlamaV3 2026-03-05 16:49:58 -08:00
oobabooga
2f08dce7b0 Remove ExLlamaV2 backend
- archived upstream: 7dc12af3a8
- replaced by ExLlamaV3, which has much better quantization accuracy
2026-03-05 14:02:13 -08:00
oobabooga
268cc3f100 Update TensorRT-LLM to v1.1.0 2026-03-05 09:32:28 -03:00
oobabooga
69fa4dd0b1 llama.cpp: allow ctx_size=0 for auto context via --fit 2026-03-04 19:33:20 -08:00
oobabooga
fbfcd59fe0 llama.cpp: Use -1 instead of 0 for auto gpu_layers 2026-03-04 19:21:45 -08:00
oobabooga
387cf9d8df Remove obsolete DeepSpeed inference code (2023 relic) 2026-03-04 17:20:34 -08:00
oobabooga
cdf0e392e6 llama.cpp: Reorganize speculative decoding UI and use recommended ngram-mod defaults 2026-03-04 12:05:08 -08:00
oobabooga
65de4c30c8 Add adaptive-p sampler and n-gram speculative decoding support 2026-03-04 09:41:29 -08:00
oobabooga
f4d787ab8d Delegate GPU layer allocation to llama.cpp's --fit 2026-03-04 06:37:50 -08:00
q5sys (JT)
7493fe7841
feat: Add a dropdown to save/load user personas (#7367) 2026-01-14 20:35:08 -03:00
oobabooga
e7c8b51fec Revert "Use flash_attention_2 by default for Transformers models"
This reverts commit 85f2df92e9.
2025-12-07 18:48:41 -08:00
oobabooga
85f2df92e9 Use flash_attention_2 by default for Transformers models 2025-12-07 06:56:58 -08:00
oobabooga
11937de517 Use flash attention for image generation by default 2025-12-05 12:13:24 -08:00
oobabooga
c11c14590a Image: Better LLM variation default prompt 2025-12-05 08:08:11 -08:00
oobabooga
8eac99599a Image: Better LLM variation default prompt 2025-12-04 19:58:06 -08:00
oobabooga
b4f06a50b0 fix: Pass bos_token and eos_token from metadata to jinja2
Fixes loading Seed-Instruct-36B
2025-12-04 19:11:31 -08:00
oobabooga
a90739f498 Image: Better LLM variation default prompt 2025-12-04 10:50:40 -08:00
oobabooga
ffef3c7b1d Image: Make the LLM Variations prompt configurable 2025-12-04 10:44:35 -08:00
oobabooga
2793153717 Image: Add LLM-generated prompt variations 2025-12-04 08:10:24 -08:00
oobabooga
c357eed4c7 Image: Remove the flash_attention_3 option (no idea how to get it working) 2025-12-03 18:40:34 -08:00
oobabooga
9448bf1caa Image generation: add torchao quantization (supports torch.compile) 2025-12-02 14:22:51 -08:00
oobabooga
6291e72129 Remove quanto for now (requires messy compilation) 2025-12-02 09:57:18 -08:00
oobabooga
b3666e140d
Add image generation support (#7328) 2025-12-02 14:55:38 -03:00
oobabooga
5327bc9397
Update modules/shared.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-28 22:48:05 -03:00
GodEmperor785
400bb0694b
Add slider for --ubatch-size for llama.cpp loader, change defaults for better MoE performance (#7316) 2025-11-21 16:56:02 -03:00
oobabooga
0d4eff284c Add a --cpu-moe model for llama.cpp 2025-11-19 05:23:43 -08:00
oobabooga
b5a6904c4a Make --trust-remote-code immutable from the UI/API 2025-10-14 20:47:01 -07:00
oobabooga
78ff21d512 Organize the --help message 2025-10-10 15:21:08 -07:00
oobabooga
13876a1ee8 llama.cpp: Remove the --flash-attn flag (it's always on now) 2025-08-30 20:28:26 -07:00
oobabooga
0b4518e61c "Text generation web UI" -> "Text Generation Web UI" 2025-08-27 05:53:09 -07:00
oobabooga
02ca96fa44 Multiple fixes 2025-08-25 22:17:22 -07:00
oobabooga
6c165d2e55 Fix the chat template 2025-08-25 18:28:43 -07:00
oobabooga
dbabe67e77 ExLlamaV3: Enable the --enable-tp option, add a --tp-backend option 2025-08-17 13:19:11 -07:00
altoiddealer
57f6e9af5a
Set multimodal status during Model Loading (#7199) 2025-08-13 16:47:27 -03:00
oobabooga
d86b0ec010
Add multimodal support (llama.cpp) (#7027) 2025-08-10 01:27:25 -03:00
Katehuuh
88127f46c1
Add multimodal support (ExLlamaV3) (#7174) 2025-08-08 23:31:16 -03:00
oobabooga
498778b8ac Add a new 'Reasoning effort' UI element 2025-08-05 15:19:11 -07:00
oobabooga
1d1b20bd77 Remove the --torch-compile option (it doesn't do anything currently) 2025-07-11 10:51:23 -07:00
oobabooga
273888f218 Revert "Use eager attention by default instead of sdpa"
This reverts commit bd4881c4dc.
2025-07-10 18:56:46 -07:00
oobabooga
e015355e4a Update README 2025-07-09 20:03:53 -07:00
oobabooga
bd4881c4dc Use eager attention by default instead of sdpa 2025-07-09 19:57:37 -07:00
oobabooga
6c2bdda0f0 Transformers loader: replace use_flash_attention_2/use_eager_attention with a unified attn_implementation
Closes #7107
2025-07-09 18:39:37 -07:00
oobabooga
faae4dc1b0
Autosave generated text in the Notebook tab (#7079) 2025-06-16 17:36:05 -03:00
oobabooga
de24b3bb31
Merge the Default and Notebook tabs into a single Notebook tab (#7078) 2025-06-16 13:19:29 -03:00