oobabooga
f0c16813ef
Remove the rope scaling parameters
...
Now models have 131k+ context length. The parameters can still be
passed to llama.cpp through --extra-flags.
2026-03-14 19:43:25 -07:00
oobabooga
4ae2bd86e2
Change the default ctx-size to 0 (auto) for llama.cpp
2026-03-14 15:30:01 -07:00
oobabooga
d0a4993cf4
UI: Increase ctx-size slider maximum to 1M and step to 1024
2026-03-14 09:53:12 -07:00
oobabooga
4f82b71ef3
UI: Bump the ctx-size max from 131072 to 262144 (256K)
2026-03-12 14:56:35 -07:00
oobabooga
bbd43d9463
UI: Correctly propagate truncation_length when ctx_size is auto
2026-03-12 14:54:05 -07:00
oobabooga
66fb79fe15
llama.cpp: Add --fit-target param
2026-03-06 01:55:48 -03:00
oobabooga
e2548f69a9
Make user_data configurable: add --user-data-dir flag, auto-detect ../user_data
...
If --user-data-dir is not set, auto-detect: use ../user_data when
./user_data doesn't exist, making it easy to share user data across
portable builds by placing it one folder up.
2026-03-05 19:31:10 -08:00
oobabooga
249bd6eea2
UI: Update the parallel info message
2026-03-05 18:11:55 -08:00
oobabooga
f52d9336e5
TensorRT-LLM: Migrate from ModelRunner to LLM API, add concurrent API request support
2026-03-05 18:09:45 -08:00
oobabooga
9824c82cb6
API: Add parallel request support for llama.cpp and ExLlamaV3
2026-03-05 16:49:58 -08:00
oobabooga
2f08dce7b0
Remove ExLlamaV2 backend
...
- archived upstream: 7dc12af3a8
- replaced by ExLlamaV3, which has much better quantization accuracy
2026-03-05 14:02:13 -08:00
oobabooga
69fa4dd0b1
llama.cpp: allow ctx_size=0 for auto context via --fit
2026-03-04 19:33:20 -08:00
oobabooga
fbfcd59fe0
llama.cpp: Use -1 instead of 0 for auto gpu_layers
2026-03-04 19:21:45 -08:00
Sense_wang
7bf15ad933
fix: replace bare except clauses with except Exception ( #7400 )
2026-03-04 18:06:17 -03:00
oobabooga
cdf0e392e6
llama.cpp: Reorganize speculative decoding UI and use recommended ngram-mod defaults
2026-03-04 12:05:08 -08:00
oobabooga
65de4c30c8
Add adaptive-p sampler and n-gram speculative decoding support
2026-03-04 09:41:29 -08:00
oobabooga
f4d787ab8d
Delegate GPU layer allocation to llama.cpp's --fit
2026-03-04 06:37:50 -08:00
oobabooga
e7c8b51fec
Revert "Use flash_attention_2 by default for Transformers models"
...
This reverts commit 85f2df92e9 .
2025-12-07 18:48:41 -08:00
oobabooga
85f2df92e9
Use flash_attention_2 by default for Transformers models
2025-12-07 06:56:58 -08:00
GodEmperor785
400bb0694b
Add slider for --ubatch-size for llama.cpp loader, change defaults for better MoE performance ( #7316 )
2025-11-21 16:56:02 -03:00
oobabooga
0d4eff284c
Add a --cpu-moe model for llama.cpp
2025-11-19 05:23:43 -08:00
oobabooga
b5a6904c4a
Make --trust-remote-code immutable from the UI/API
2025-10-14 20:47:01 -07:00
oobabooga
13876a1ee8
llama.cpp: Remove the --flash-attn flag (it's always on now)
2025-08-30 20:28:26 -07:00
oobabooga
dbabe67e77
ExLlamaV3: Enable the --enable-tp option, add a --tp-backend option
2025-08-17 13:19:11 -07:00
oobabooga
7301452b41
UI: Minor info message change
2025-08-12 13:23:24 -07:00
oobabooga
d86b0ec010
Add multimodal support (llama.cpp) ( #7027 )
2025-08-10 01:27:25 -03:00
oobabooga
0c667de7a7
UI: Add a None option for the speculative decoding model ( closes #7145 )
2025-07-19 12:14:41 -07:00
oobabooga
1d1b20bd77
Remove the --torch-compile option (it doesn't do anything currently)
2025-07-11 10:51:23 -07:00
oobabooga
273888f218
Revert "Use eager attention by default instead of sdpa"
...
This reverts commit bd4881c4dc .
2025-07-10 18:56:46 -07:00
oobabooga
bd4881c4dc
Use eager attention by default instead of sdpa
2025-07-09 19:57:37 -07:00
oobabooga
6c2bdda0f0
Transformers loader: replace use_flash_attention_2/use_eager_attention with a unified attn_implementation
...
Closes #7107
2025-07-09 18:39:37 -07:00
Alidr79
e5767d4fc5
Update ui_model_menu.py blocking the --multi-user access in backend ( #7098 )
2025-07-06 21:48:53 -03:00
oobabooga
acd57b6a85
Minor UI change
2025-06-19 15:39:43 -07:00
oobabooga
f08db63fbc
Change some comments
2025-06-19 15:26:45 -07:00
oobabooga
9c6913ad61
Show file sizes on "Get file list"
2025-06-18 21:35:07 -07:00
Miriam
f4f621b215
ensure estimated vram is updated when switching between different models ( #7071 )
2025-06-13 02:56:33 -03:00
oobabooga
f337767f36
Add error handling for non-llama.cpp models in portable mode
2025-06-12 22:17:39 -07:00
oobabooga
889153952f
Lint
2025-06-10 09:02:52 -07:00
oobabooga
92adceb7b5
UI: Fix the model downloader progress bar
2025-06-01 19:22:21 -07:00
oobabooga
5d00574a56
Minor UI fixes
2025-05-20 16:20:49 -07:00
oobabooga
9ec46b8c44
Remove the HQQ loader (HQQ models can be loaded through Transformers)
2025-05-19 09:23:24 -07:00
oobabooga
2faaf18f1f
Add back the "Common values" to the ctx-size slider
2025-05-18 09:06:20 -07:00
oobabooga
1c549d176b
Fix GPU layers slider: honor saved settings and show true maximum
2025-05-16 17:26:13 -07:00
oobabooga
adb975a380
Prevent fractional gpu-layers in the UI
2025-05-16 12:52:43 -07:00
oobabooga
fc483650b5
Set the maximum gpu_layers value automatically when the model is loaded with --model
2025-05-16 11:58:17 -07:00
oobabooga
9ec9b1bf83
Auto-adjust GPU layers after model unload to utilize freed VRAM
2025-05-16 09:56:23 -07:00
oobabooga
4925c307cf
Auto-adjust GPU layers on context size and cache type changes + many fixes
2025-05-16 09:07:38 -07:00
oobabooga
cbf4daf1c8
Hide the LoRA menu in portable mode
2025-05-15 21:21:54 -07:00
oobabooga
5534d01da0
Estimate the VRAM for GGUF models + autoset gpu-layers ( #6980 )
2025-05-16 00:07:37 -03:00
oobabooga
c4a715fd1e
UI: Move the LoRA menu under "Other options"
2025-05-13 20:14:09 -07:00