Commit graph

155 commits

Author SHA1 Message Date
oobabooga 1972479610 Add the TP option to exllamav3_HF 2025-08-19 06:48:22 -07:00
oobabooga dbabe67e77 ExLlamaV3: Enable the --enable-tp option, add a --tp-backend option 2025-08-17 13:19:11 -07:00
oobabooga 2238302b49 ExLlamaV3: Add speculative decoding 2025-08-12 08:50:45 -07:00
oobabooga d86b0ec010
Add multimodal support (llama.cpp) (#7027) 2025-08-10 01:27:25 -03:00
oobabooga 544c3a7c9f Polish the new exllamav3 loader 2025-08-08 21:15:53 -07:00
Katehuuh 88127f46c1
Add multimodal support (ExLlamaV3) (#7174) 2025-08-08 23:31:16 -03:00
oobabooga 498778b8ac Add a new 'Reasoning effort' UI element 2025-08-05 15:19:11 -07:00
oobabooga 1d1b20bd77 Remove the --torch-compile option (it doesn't do anything currently) 2025-07-11 10:51:23 -07:00
oobabooga 6c2bdda0f0 Transformers loader: replace use_flash_attention_2/use_eager_attention with a unified attn_implementation
Closes #7107
2025-07-09 18:39:37 -07:00
oobabooga 9ec46b8c44 Remove the HQQ loader (HQQ models can be loaded through Transformers) 2025-05-19 09:23:24 -07:00
oobabooga 93e1850a2c Only show the VRAM info for llama.cpp 2025-05-15 21:42:15 -07:00
oobabooga 3fa1a899ae UI: Fix gpu-layers being ignored (closes #6973) 2025-05-13 12:07:59 -07:00
oobabooga a2ab42d390 UI: Remove the exllamav2 info message 2025-05-08 08:00:38 -07:00
oobabooga c4f36db0d8 llama.cpp: remove tfs (it doesn't get used) 2025-05-06 08:41:13 -07:00
oobabooga 1927afe894 Fix top_n_sigma not showing for llama.cpp 2025-05-06 08:18:49 -07:00
oobabooga d10bded7f8 UI: Add an enable_thinking option to enable/disable Qwen3 thinking 2025-04-28 22:37:01 -07:00
oobabooga 4a32e1f80c UI: show draft_max for ExLlamaV2 2025-04-26 18:01:44 -07:00
oobabooga d4017fbb6d
ExLlamaV3: Add kv cache quantization (#6903) 2025-04-25 21:32:00 -03:00
oobabooga d4b1e31c49 Use --ctx-size to specify the context size for all loaders
Old flags are still recognized as alternatives.
2025-04-25 16:59:03 -07:00
oobabooga 98f4c694b9 llama.cpp: Add --extra-flags parameter for passing additional flags to llama-server 2025-04-25 07:32:51 -07:00
oobabooga ae1fe87365
ExLlamaV2: Add speculative decoding (#6899) 2025-04-25 00:11:04 -03:00
oobabooga e99c20bcb0
llama.cpp: Add speculative decoding (#6891) 2025-04-23 20:10:16 -03:00
oobabooga 15989c2ed8 Make llama.cpp the default loader 2025-04-21 16:36:35 -07:00
oobabooga ae02ffc605
Refactor the transformers loader (#6859) 2025-04-20 13:33:47 -03:00
oobabooga 8144e1031e Remove deprecated command-line flags 2025-04-18 06:02:28 -07:00
oobabooga ae54d8faaa
New llama.cpp loader (#6846) 2025-04-18 09:59:37 -03:00
oobabooga 2c2d453c8c Revert "Use ExLlamaV2 (instead of the HF one) for EXL2 models for now"
This reverts commit 0ef1b8f8b4.
2025-04-17 21:31:32 -07:00
oobabooga 0ef1b8f8b4 Use ExLlamaV2 (instead of the HF one) for EXL2 models for now
It doesn't seem to have the "OverflowError" bug
2025-04-17 05:47:40 -07:00
oobabooga 8b8d39ec4e
Add ExLlamaV3 support (#6832) 2025-04-09 00:07:08 -03:00
oobabooga 5bcd2d7ad0
Add the top N-sigma sampler (#6796) 2025-03-14 16:45:11 -03:00
oobabooga 83c426e96b
Organize internals (#6646) 2025-01-10 18:04:32 -03:00
oobabooga 7157257c3f
Remove the AutoGPTQ loader (#6641) 2025-01-08 19:28:56 -03:00
oobabooga c0f600c887 Add a --torch-compile flag for transformers 2025-01-05 05:47:00 -08:00
oobabooga 11af199aff Add a "Static KV cache" option for transformers 2025-01-04 17:52:57 -08:00
oobabooga 3967520e71 Connect XTC, DRY, smoothing_factor, and dynatemp to ExLlamaV2 loader (non-HF) 2025-01-04 16:25:06 -08:00
oobabooga 39a5c9a49c
UI organization (#6618) 2024-12-29 11:16:17 -03:00
Diner Burger addad3c63e
Allow more granular KV cache settings (#6561) 2024-12-17 17:43:48 -03:00
oobabooga 93c250b9b6 Add a UI element for enable_tp 2024-10-01 11:16:15 -07:00
Philipp Emanuel Weidmann 301375834e
Exclude Top Choices (XTC): A sampler that boosts creativity, breaks writing clichés, and inhibits non-verbatim repetition (#6335) 2024-09-27 22:50:12 -03:00
oobabooga e6181e834a Remove AutoAWQ as a standalone loader
(it works better through transformers)
2024-07-23 15:31:17 -07:00
oobabooga aa809e420e Bump llama-cpp-python to 0.2.83, add back tensorcore wheels
Also add back the progress bar patch
2024-07-22 18:05:11 -07:00
oobabooga 11bbf71aa5
Bump back llama-cpp-python (#6257) 2024-07-22 16:19:41 -03:00
oobabooga 0f53a736c1 Revert the llama-cpp-python update 2024-07-22 12:02:25 -07:00
oobabooga a687f950ba Remove the tensorcores llama.cpp wheels
They are not faster than the default wheels anymore and they use a lot of space.
2024-07-22 11:54:35 -07:00
oobabooga e436d69e2b Add --no_xformers and --no_sdpa flags for ExllamaV2 2024-07-11 15:47:37 -07:00
GralchemOz 8a39f579d8
transformers: Add eager attention option to make Gemma-2 work properly (#6188) 2024-07-01 12:08:08 -03:00
oobabooga 4ea260098f llama.cpp: add 4-bit/8-bit kv cache options 2024-06-29 09:10:33 -07:00
oobabooga 577a8cd3ee
Add TensorRT-LLM support (#5715) 2024-06-24 02:30:03 -03:00
oobabooga 536f8d58d4 Do not expose alpha_value to llama.cpp & rope_freq_base to transformers
To avoid confusion
2024-06-23 22:09:24 -07:00
Forkoz 1d79aa67cf
Fix flash-attn UI parameter to actually store true. (#6076) 2024-06-13 00:34:54 -03:00