oobabooga
66fb79fe15
llama.cpp: Add --fit-target param
2026-03-06 01:55:48 -03:00
oobabooga
e2548f69a9
Make user_data configurable: add --user-data-dir flag, auto-detect ../user_data
...
If --user-data-dir is not set, auto-detect: use ../user_data when
./user_data doesn't exist, making it easy to share user data across
portable builds by placing it one folder up.
2026-03-05 19:31:10 -08:00
oobabooga
249bd6eea2
UI: Update the parallel info message
2026-03-05 18:11:55 -08:00
oobabooga
f52d9336e5
TensorRT-LLM: Migrate from ModelRunner to LLM API, add concurrent API request support
2026-03-05 18:09:45 -08:00
oobabooga
9824c82cb6
API: Add parallel request support for llama.cpp and ExLlamaV3
2026-03-05 16:49:58 -08:00
oobabooga
2f08dce7b0
Remove ExLlamaV2 backend
...
- archived upstream: 7dc12af3a8
- replaced by ExLlamaV3, which has much better quantization accuracy
2026-03-05 14:02:13 -08:00
oobabooga
69fa4dd0b1
llama.cpp: allow ctx_size=0 for auto context via --fit
2026-03-04 19:33:20 -08:00
oobabooga
fbfcd59fe0
llama.cpp: Use -1 instead of 0 for auto gpu_layers
2026-03-04 19:21:45 -08:00
Sense_wang
7bf15ad933
fix: replace bare except clauses with except Exception ( #7400 )
2026-03-04 18:06:17 -03:00
oobabooga
cdf0e392e6
llama.cpp: Reorganize speculative decoding UI and use recommended ngram-mod defaults
2026-03-04 12:05:08 -08:00
oobabooga
65de4c30c8
Add adaptive-p sampler and n-gram speculative decoding support
2026-03-04 09:41:29 -08:00
oobabooga
f4d787ab8d
Delegate GPU layer allocation to llama.cpp's --fit
2026-03-04 06:37:50 -08:00
oobabooga
e7c8b51fec
Revert "Use flash_attention_2 by default for Transformers models"
...
This reverts commit 85f2df92e9 .
2025-12-07 18:48:41 -08:00
oobabooga
85f2df92e9
Use flash_attention_2 by default for Transformers models
2025-12-07 06:56:58 -08:00
GodEmperor785
400bb0694b
Add slider for --ubatch-size for llama.cpp loader, change defaults for better MoE performance ( #7316 )
2025-11-21 16:56:02 -03:00
oobabooga
0d4eff284c
Add a --cpu-moe model for llama.cpp
2025-11-19 05:23:43 -08:00
oobabooga
b5a6904c4a
Make --trust-remote-code immutable from the UI/API
2025-10-14 20:47:01 -07:00
oobabooga
13876a1ee8
llama.cpp: Remove the --flash-attn flag (it's always on now)
2025-08-30 20:28:26 -07:00
oobabooga
dbabe67e77
ExLlamaV3: Enable the --enable-tp option, add a --tp-backend option
2025-08-17 13:19:11 -07:00
oobabooga
7301452b41
UI: Minor info message change
2025-08-12 13:23:24 -07:00
oobabooga
d86b0ec010
Add multimodal support (llama.cpp) ( #7027 )
2025-08-10 01:27:25 -03:00
oobabooga
0c667de7a7
UI: Add a None option for the speculative decoding model ( closes #7145 )
2025-07-19 12:14:41 -07:00
oobabooga
1d1b20bd77
Remove the --torch-compile option (it doesn't do anything currently)
2025-07-11 10:51:23 -07:00
oobabooga
273888f218
Revert "Use eager attention by default instead of sdpa"
...
This reverts commit bd4881c4dc .
2025-07-10 18:56:46 -07:00
oobabooga
bd4881c4dc
Use eager attention by default instead of sdpa
2025-07-09 19:57:37 -07:00
oobabooga
6c2bdda0f0
Transformers loader: replace use_flash_attention_2/use_eager_attention with a unified attn_implementation
...
Closes #7107
2025-07-09 18:39:37 -07:00
Alidr79
e5767d4fc5
Update ui_model_menu.py blocking the --multi-user access in backend ( #7098 )
2025-07-06 21:48:53 -03:00
oobabooga
acd57b6a85
Minor UI change
2025-06-19 15:39:43 -07:00
oobabooga
f08db63fbc
Change some comments
2025-06-19 15:26:45 -07:00
oobabooga
9c6913ad61
Show file sizes on "Get file list"
2025-06-18 21:35:07 -07:00
Miriam
f4f621b215
ensure estimated vram is updated when switching between different models ( #7071 )
2025-06-13 02:56:33 -03:00
oobabooga
f337767f36
Add error handling for non-llama.cpp models in portable mode
2025-06-12 22:17:39 -07:00
oobabooga
889153952f
Lint
2025-06-10 09:02:52 -07:00
oobabooga
92adceb7b5
UI: Fix the model downloader progress bar
2025-06-01 19:22:21 -07:00
oobabooga
5d00574a56
Minor UI fixes
2025-05-20 16:20:49 -07:00
oobabooga
9ec46b8c44
Remove the HQQ loader (HQQ models can be loaded through Transformers)
2025-05-19 09:23:24 -07:00
oobabooga
2faaf18f1f
Add back the "Common values" to the ctx-size slider
2025-05-18 09:06:20 -07:00
oobabooga
1c549d176b
Fix GPU layers slider: honor saved settings and show true maximum
2025-05-16 17:26:13 -07:00
oobabooga
adb975a380
Prevent fractional gpu-layers in the UI
2025-05-16 12:52:43 -07:00
oobabooga
fc483650b5
Set the maximum gpu_layers value automatically when the model is loaded with --model
2025-05-16 11:58:17 -07:00
oobabooga
9ec9b1bf83
Auto-adjust GPU layers after model unload to utilize freed VRAM
2025-05-16 09:56:23 -07:00
oobabooga
4925c307cf
Auto-adjust GPU layers on context size and cache type changes + many fixes
2025-05-16 09:07:38 -07:00
oobabooga
cbf4daf1c8
Hide the LoRA menu in portable mode
2025-05-15 21:21:54 -07:00
oobabooga
5534d01da0
Estimate the VRAM for GGUF models + autoset gpu-layers ( #6980 )
2025-05-16 00:07:37 -03:00
oobabooga
c4a715fd1e
UI: Move the LoRA menu under "Other options"
2025-05-13 20:14:09 -07:00
oobabooga
3fa1a899ae
UI: Fix gpu-layers being ignored ( closes #6973 )
2025-05-13 12:07:59 -07:00
oobabooga
512bc2d0e0
UI: Update some labels
2025-05-08 23:43:55 -07:00
oobabooga
f8ef6e09af
UI: Make ctx-size a slider
2025-05-08 18:19:04 -07:00
oobabooga
a2ab42d390
UI: Remove the exllamav2 info message
2025-05-08 08:00:38 -07:00
oobabooga
348d4860c2
UI: Create a "Main options" section in the Model tab
2025-05-08 07:58:59 -07:00