oobabooga
80d0c03bab
llama.cpp: Change the default --fit-target from 1024 to 512
2026-03-15 09:29:25 -07:00
oobabooga
f0c16813ef
Remove the rope scaling parameters
...
Now models have 131k+ context length. The parameters can still be
passed to llama.cpp through --extra-flags.
2026-03-14 19:43:25 -07:00
oobabooga
2d3a3794c9
Add a Top-P preset, make it the new default, clean up the built-in presets
2026-03-14 19:22:12 -07:00
oobabooga
4ae2bd86e2
Change the default ctx-size to 0 (auto) for llama.cpp
2026-03-14 15:30:01 -07:00
oobabooga
4b6c9db1c9
UI: Fix stale tool_sequence after edit and chat-instruct tool rendering
2026-03-12 13:12:18 -03:00
oobabooga
cf9ad8eafe
Initial tool-calling support in the UI
2026-03-12 01:16:19 -03:00
oobabooga
307c085d1b
Minor warning change
2026-03-09 21:44:53 -07:00
oobabooga
c604ca66de
Update the --multi-user warning
2026-03-09 21:36:04 -07:00
oobabooga
40f1837b42
README: Minor updates
2026-03-08 08:38:29 -07:00
oobabooga
f5acf55207
Add --chat-template-file flag to override the default instruction template for API requests
...
Matches llama.cpp's flag name. Supports .jinja, .jinja2, and .yaml files.
Priority: per-request params > --chat-template-file > model's built-in template.
2026-03-06 14:04:16 -03:00
oobabooga
66fb79fe15
llama.cpp: Add --fit-target param
2026-03-06 01:55:48 -03:00
oobabooga
e81a47f708
Improve the API generation defaults --help message
2026-03-05 20:41:45 -08:00
oobabooga
27bcc45c18
API: Add command-line flags to override default generation parameters
2026-03-06 01:36:45 -03:00
oobabooga
e2548f69a9
Make user_data configurable: add --user-data-dir flag, auto-detect ../user_data
...
If --user-data-dir is not set, auto-detect: use ../user_data when
./user_data doesn't exist, making it easy to share user data across
portable builds by placing it one folder up.
2026-03-05 19:31:10 -08:00
oobabooga
f52d9336e5
TensorRT-LLM: Migrate from ModelRunner to LLM API, add concurrent API request support
2026-03-05 18:09:45 -08:00
oobabooga
9824c82cb6
API: Add parallel request support for llama.cpp and ExLlamaV3
2026-03-05 16:49:58 -08:00
oobabooga
2f08dce7b0
Remove ExLlamaV2 backend
...
- archived upstream: 7dc12af3a8
- replaced by ExLlamaV3, which has much better quantization accuracy
2026-03-05 14:02:13 -08:00
oobabooga
268cc3f100
Update TensorRT-LLM to v1.1.0
2026-03-05 09:32:28 -03:00
oobabooga
69fa4dd0b1
llama.cpp: allow ctx_size=0 for auto context via --fit
2026-03-04 19:33:20 -08:00
oobabooga
fbfcd59fe0
llama.cpp: Use -1 instead of 0 for auto gpu_layers
2026-03-04 19:21:45 -08:00
oobabooga
387cf9d8df
Remove obsolete DeepSpeed inference code (2023 relic)
2026-03-04 17:20:34 -08:00
oobabooga
cdf0e392e6
llama.cpp: Reorganize speculative decoding UI and use recommended ngram-mod defaults
2026-03-04 12:05:08 -08:00
oobabooga
65de4c30c8
Add adaptive-p sampler and n-gram speculative decoding support
2026-03-04 09:41:29 -08:00
oobabooga
f4d787ab8d
Delegate GPU layer allocation to llama.cpp's --fit
2026-03-04 06:37:50 -08:00
q5sys (JT)
7493fe7841
feat: Add a dropdown to save/load user personas ( #7367 )
2026-01-14 20:35:08 -03:00
oobabooga
e7c8b51fec
Revert "Use flash_attention_2 by default for Transformers models"
...
This reverts commit 85f2df92e9 .
2025-12-07 18:48:41 -08:00
oobabooga
85f2df92e9
Use flash_attention_2 by default for Transformers models
2025-12-07 06:56:58 -08:00
oobabooga
11937de517
Use flash attention for image generation by default
2025-12-05 12:13:24 -08:00
oobabooga
c11c14590a
Image: Better LLM variation default prompt
2025-12-05 08:08:11 -08:00
oobabooga
8eac99599a
Image: Better LLM variation default prompt
2025-12-04 19:58:06 -08:00
oobabooga
b4f06a50b0
fix: Pass bos_token and eos_token from metadata to jinja2
...
Fixes loading Seed-Instruct-36B
2025-12-04 19:11:31 -08:00
oobabooga
a90739f498
Image: Better LLM variation default prompt
2025-12-04 10:50:40 -08:00
oobabooga
ffef3c7b1d
Image: Make the LLM Variations prompt configurable
2025-12-04 10:44:35 -08:00
oobabooga
2793153717
Image: Add LLM-generated prompt variations
2025-12-04 08:10:24 -08:00
oobabooga
c357eed4c7
Image: Remove the flash_attention_3 option (no idea how to get it working)
2025-12-03 18:40:34 -08:00
oobabooga
9448bf1caa
Image generation: add torchao quantization (supports torch.compile)
2025-12-02 14:22:51 -08:00
oobabooga
6291e72129
Remove quanto for now (requires messy compilation)
2025-12-02 09:57:18 -08:00
oobabooga
b3666e140d
Add image generation support ( #7328 )
2025-12-02 14:55:38 -03:00
oobabooga
5327bc9397
Update modules/shared.py
...
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-28 22:48:05 -03:00
GodEmperor785
400bb0694b
Add slider for --ubatch-size for llama.cpp loader, change defaults for better MoE performance ( #7316 )
2025-11-21 16:56:02 -03:00
oobabooga
0d4eff284c
Add a --cpu-moe model for llama.cpp
2025-11-19 05:23:43 -08:00
oobabooga
b5a6904c4a
Make --trust-remote-code immutable from the UI/API
2025-10-14 20:47:01 -07:00
oobabooga
78ff21d512
Organize the --help message
2025-10-10 15:21:08 -07:00
oobabooga
13876a1ee8
llama.cpp: Remove the --flash-attn flag (it's always on now)
2025-08-30 20:28:26 -07:00
oobabooga
0b4518e61c
"Text generation web UI" -> "Text Generation Web UI"
2025-08-27 05:53:09 -07:00
oobabooga
02ca96fa44
Multiple fixes
2025-08-25 22:17:22 -07:00
oobabooga
6c165d2e55
Fix the chat template
2025-08-25 18:28:43 -07:00
oobabooga
dbabe67e77
ExLlamaV3: Enable the --enable-tp option, add a --tp-backend option
2025-08-17 13:19:11 -07:00
altoiddealer
57f6e9af5a
Set multimodal status during Model Loading ( #7199 )
2025-08-13 16:47:27 -03:00
oobabooga
d86b0ec010
Add multimodal support (llama.cpp) ( #7027 )
2025-08-10 01:27:25 -03:00