oobabooga
04213dff14
Address copilot feedback
2026-03-12 19:55:20 -07:00
oobabooga
fb1b3b6ddf
API: Rewrite logprobs for OpenAI spec compliance across all backends
...
- Rewrite logprobs output format to match the OpenAI specification for
both chat completions and completions endpoints
- Fix top_logprobs count being ignored for llama.cpp and ExLlamav3
backends in chat completions (always returned 1 instead of requested N)
- Fix non-streaming responses only returning logprobs for the last token
instead of all generated tokens (affects all HF-based loaders)
- Fix logprobs returning null for non-streaming chat requests on HF loaders
- Fix off-by-one returning one extra top alternative on HF loaders
2026-03-12 14:17:32 -03:00
oobabooga
5f6754c267
Fix stop button being ignored when token throttling is off
2026-03-06 17:12:34 -03:00
oobabooga
f52d9336e5
TensorRT-LLM: Migrate from ModelRunner to LLM API, add concurrent API request support
2026-03-05 18:09:45 -08:00
oobabooga
9824c82cb6
API: Add parallel request support for llama.cpp and ExLlamaV3
2026-03-05 16:49:58 -08:00
oobabooga
2f08dce7b0
Remove ExLlamaV2 backend
...
- archived upstream: 7dc12af3a8
- replaced by ExLlamaV3, which has much better quantization accuracy
2026-03-05 14:02:13 -08:00
oobabooga
387cf9d8df
Remove obsolete DeepSpeed inference code (2023 relic)
2026-03-04 17:20:34 -08:00
oobabooga
65de4c30c8
Add adaptive-p sampler and n-gram speculative decoding support
2026-03-04 09:41:29 -08:00
oobabooga
a78ca6ffcd
Remove a comment
2025-08-11 12:33:38 -07:00
Katehuuh
88127f46c1
Add multimodal support (ExLlamaV3) ( #7174 )
2025-08-08 23:31:16 -03:00
oobabooga
635e6efd18
Ignore add_bos_token in instruct prompts, let the jinja2 template decide
2025-07-10 07:14:01 -07:00
oobabooga
609c3ac893
Optimize the end of generation with llama.cpp
2025-06-15 08:03:27 -07:00
oobabooga
efd9c9707b
Fix random seeds being saved to settings.yaml
2025-06-09 20:57:25 -07:00
oobabooga
bb409c926e
Update only the last message during streaming + add back dynamic UI update speed ( #7038 )
2025-06-02 09:50:17 -03:00
oobabooga
f59998d268
Don't limit the number of prompt characters printed with --verbose
2025-05-29 13:08:48 -07:00
oobabooga
126b3a768f
Revert "Dynamic Chat Message UI Update Speed ( #6952 )" (for now)
...
This reverts commit 8137eb8ef4 .
2025-05-18 12:38:36 -07:00
oobabooga
2826c60044
Use logger for "Output generated in ..." messages
2025-05-13 14:45:46 -07:00
oobabooga
8984e95c67
UI: More friendly message when no model is loaded
2025-05-09 07:21:05 -07:00
mamei16
8137eb8ef4
Dynamic Chat Message UI Update Speed ( #6952 )
2025-05-05 18:05:23 -03:00
oobabooga
3f26b0408b
Fix after 9e3867dc83
2025-05-02 16:17:22 -07:00
oobabooga
9e3867dc83
llama.cpp: Fix manual random seeds
2025-05-02 09:36:15 -07:00
oobabooga
cd5c32dc19
UI: Fix max_updates_second not working
2025-04-30 14:54:05 -07:00
oobabooga
f1b64df8dd
EXL2: add another torch.cuda.synchronize() call to prevent errors
2025-04-24 09:03:49 -07:00
oobabooga
ff1c00bdd9
llama.cpp: set the random seed manually
2025-04-20 19:08:44 -07:00
oobabooga
ae02ffc605
Refactor the transformers loader ( #6859 )
2025-04-20 13:33:47 -03:00
oobabooga
6ba0164c70
Lint
2025-04-19 17:45:21 -07:00
oobabooga
5ab069786b
llama.cpp: add back the two encode calls (they are harmless now)
2025-04-19 17:38:36 -07:00
oobabooga
ba976d1390
llama.cpp: avoid two 'encode' calls
2025-04-19 16:35:01 -07:00
oobabooga
ae54d8faaa
New llama.cpp loader ( #6846 )
2025-04-18 09:59:37 -03:00
oobabooga
5c2f8d828e
Fix exllamav2 generating eos randomly after previous fix
2025-04-18 05:42:38 -07:00
oobabooga
ce9e2d94b1
Revert "Attempt at solving the ExLlamaV2 issue"
...
This reverts commit c9b3c9dfbf .
2025-04-17 22:03:21 -07:00
oobabooga
5dfab7d363
New attempt at solving the exl2 issue
2025-04-17 22:03:11 -07:00
oobabooga
c9b3c9dfbf
Attempt at solving the ExLlamaV2 issue
2025-04-17 21:45:15 -07:00
oobabooga
5bcd2d7ad0
Add the top N-sigma sampler ( #6796 )
2025-03-14 16:45:11 -03:00
oobabooga
83c426e96b
Organize internals ( #6646 )
2025-01-10 18:04:32 -03:00
oobabooga
11af199aff
Add a "Static KV cache" option for transformers
2025-01-04 17:52:57 -08:00
Petr Korolev
13c033c745
Fix CUDA error on MPS backend during API request ( #6572 )
...
---------
Co-authored-by: oobabooga <oobabooga4@gmail.com>
2025-01-02 00:06:11 -03:00
oobabooga
7b88724711
Make responses start faster by removing unnecessary cleanup calls ( #6625 )
2025-01-01 18:33:38 -03:00
oobabooga
cca9d6e22d
Lint
2024-10-01 10:21:06 -07:00
Philipp Emanuel Weidmann
301375834e
Exclude Top Choices (XTC): A sampler that boosts creativity, breaks writing clichés, and inhibits non-verbatim repetition ( #6335 )
2024-09-27 22:50:12 -03:00
GralchemOz
4c74c7a116
Fix UnicodeDecodeError for BPE-based Models (especially GLM-4) ( #6357 )
2024-09-02 23:00:59 -03:00
oobabooga
9dcff21da9
Remove unnecessary shared.previous_model_name variable
2024-07-28 18:35:11 -07:00
oobabooga
577a8cd3ee
Add TensorRT-LLM support ( #5715 )
2024-06-24 02:30:03 -03:00
Belladore
46174a2d33
Fix error when bos_token_id is None. ( #6061 )
2024-06-12 22:52:27 -03:00
Belladore
a363cdfca1
Fix missing bos token for some models (including Llama-3) ( #6050 )
2024-05-27 09:21:30 -03:00
Philipp Emanuel Weidmann
852c943769
DRY: A modern repetition penalty that reliably prevents looping ( #5677 )
2024-05-19 23:53:47 -03:00
oobabooga
9f77ed1b98
--idle-timeout flag to unload the model if unused for N minutes ( #6026 )
2024-05-19 23:29:39 -03:00
oobabooga
a4611232b7
Make --verbose output less spammy
2024-05-18 09:57:00 -07:00
oobabooga
70845c76fb
Add back the max_updates_second parameter ( #5937 )
2024-04-26 10:14:51 -03:00
oobabooga
6761b5e7c6
Improved instruct style (with syntax highlighting & LaTeX rendering) ( #5936 )
2024-04-26 10:13:11 -03:00