Commit graph

98 commits

Author SHA1 Message Date
oobabooga
42dfcdfc5b API: Add warning about vanilla llama-server not supporting prompt logprobs + instructions 2026-04-02 20:46:27 -07:00
oobabooga
71c1a52afe API: Implement echo + logprobs for /v1/completions endpoint 2026-03-31 07:43:11 -07:00
oobabooga
0466b6e271 ik_llama.cpp: Auto-enable Hadamard KV cache rotation with quantized cache 2026-03-29 15:52:36 -07:00
oobabooga
4979e87e48 Add ik_llama.cpp support via ik_llama_cpp_binaries package 2026-03-28 12:09:00 -03:00
oobabooga
4cbea02ed4 Add ik_llama.cpp support via --ik flag 2026-03-26 06:54:47 -07:00
oobabooga
a7ef430b38 Revert "llama.cpp: Don't suppress llama-server logs"
This reverts commit 9488df3e48.
2026-03-23 20:22:51 -07:00
oobabooga
286bbb685d Revert "Follow-up to previous commit"
This reverts commit 1dda5e4711.
2026-03-23 20:22:46 -07:00
oobabooga
1dda5e4711 Follow-up to previous commit 2026-03-21 20:58:45 -07:00
oobabooga
9488df3e48 llama.cpp: Don't suppress llama-server logs 2026-03-21 20:47:26 -07:00
oobabooga
e0e20ab9e7 Minor cleanup across multiple modules 2026-03-19 08:02:23 -07:00
oobabooga
7e54e7b7ae llama.cpp: Support literal flags in --extra-flags (e.g. --rpc, --jinja)
The old format is still accepted for backwards compatibility.
2026-03-17 19:47:55 -07:00
oobabooga
2a6b1fdcba Fix --extra-flags breaking short long-form-only flags like --rpc
Closes #7357
2026-03-17 18:29:15 -07:00
oobabooga
9119ce0680 llama.cpp: Use --fit-ctx 8192 when --fit on is used
This sets the minimum acceptable context length, which by default is 4096.
2026-03-15 09:24:14 -07:00
oobabooga
f0c16813ef Remove the rope scaling parameters
Now models have 131k+ context length. The parameters can still be
passed to llama.cpp through --extra-flags.
2026-03-14 19:43:25 -07:00
oobabooga
4ae2bd86e2 Change the default ctx-size to 0 (auto) for llama.cpp 2026-03-14 15:30:01 -07:00
oobabooga
04213dff14 Address copilot feedback 2026-03-12 19:55:20 -07:00
oobabooga
bbd43d9463 UI: Correctly propagate truncation_length when ctx_size is auto 2026-03-12 14:54:05 -07:00
oobabooga
8aeaa76365 Forward logit_bias, logprobs, and n to llama.cpp backend
- Forward logit_bias and logprobs natively to llama.cpp
- Support n>1 completions with seed increment for diversity
- Fix logprobs returning empty dict when not requested
2026-03-10 10:41:45 -03:00
oobabooga
7a8ca9f2b0 Fix passing adaptive-p to llama-server 2026-03-08 04:09:40 -07:00
oobabooga
5fa709a3f4 llama.cpp server: use port+5 offset and suppress No parser definition detected logs 2026-03-06 18:52:34 -08:00
oobabooga
93ebfa2b7e Fix llama-server output filter for new log format 2026-03-06 02:38:13 -03:00
oobabooga
66fb79fe15 llama.cpp: Add --fit-target param 2026-03-06 01:55:48 -03:00
oobabooga
e2548f69a9 Make user_data configurable: add --user-data-dir flag, auto-detect ../user_data
If --user-data-dir is not set, auto-detect: use ../user_data when
./user_data doesn't exist, making it easy to share user data across
portable builds by placing it one folder up.
2026-03-05 19:31:10 -08:00
oobabooga
9824c82cb6 API: Add parallel request support for llama.cpp and ExLlamaV3 2026-03-05 16:49:58 -08:00
oobabooga
69fa4dd0b1 llama.cpp: allow ctx_size=0 for auto context via --fit 2026-03-04 19:33:20 -08:00
oobabooga
fbfcd59fe0 llama.cpp: Use -1 instead of 0 for auto gpu_layers 2026-03-04 19:21:45 -08:00
oobabooga
da3010c3ed tiny improvements to llama_cpp_server.py 2026-03-04 15:54:37 -08:00
Sense_wang
7bf15ad933
fix: replace bare except clauses with except Exception (#7400) 2026-03-04 18:06:17 -03:00
oobabooga
cdf0e392e6 llama.cpp: Reorganize speculative decoding UI and use recommended ngram-mod defaults 2026-03-04 12:05:08 -08:00
oobabooga
65de4c30c8 Add adaptive-p sampler and n-gram speculative decoding support 2026-03-04 09:41:29 -08:00
oobabooga
f4d787ab8d Delegate GPU layer allocation to llama.cpp's --fit 2026-03-04 06:37:50 -08:00
oobabooga
8a3d866401 Fix temperature_last having no effect in llama.cpp server sampler order 2026-03-04 06:10:51 -08:00
oobabooga
c54e8a2b3d Try to spawn llama.cpp on port 5001 instead of random port 2026-01-28 08:23:55 -08:00
GodEmperor785
400bb0694b
Add slider for --ubatch-size for llama.cpp loader, change defaults for better MoE performance (#7316) 2025-11-21 16:56:02 -03:00
oobabooga
0d4eff284c Add a --cpu-moe model for llama.cpp 2025-11-19 05:23:43 -08:00
mamei16
308e726e11
log error when llama-server request exceeds context size (#7263) 2025-10-12 23:00:11 -03:00
oobabooga
f3829b268a llama.cpp: Always pass --flash-attn on 2025-09-02 12:12:17 -07:00
oobabooga
13876a1ee8 llama.cpp: Remove the --flash-attn flag (it's always on now) 2025-08-30 20:28:26 -07:00
oobabooga
3ad5970374 Make the llama.cpp --verbose output less verbose 2025-08-25 17:43:21 -07:00
oobabooga
8be798e15f llama.cpp: Fix stderr deadlock while loading some multimodal models 2025-08-24 12:20:05 -07:00
oobabooga
7fe8da8944 Minor simplification after f247c2ae62 2025-08-22 14:42:56 -07:00
altoiddealer
57f6e9af5a
Set multimodal status during Model Loading (#7199) 2025-08-13 16:47:27 -03:00
oobabooga
8d7b88106a Revert "mtmd: Fail early if images are provided but the model doesn't support them (llama.cpp)"
This reverts commit d8fcc71616.
2025-08-12 13:20:16 -07:00
oobabooga
d8fcc71616 mtmd: Fail early if images are provided but the model doesn't support them (llama.cpp) 2025-08-11 18:02:33 -07:00
oobabooga
e6447cd24a mtmd: Update the llama-server request 2025-08-11 17:42:35 -07:00
oobabooga
0e3def449a llama.cpp: --swa-full to llama-server when streaming-llm is checked 2025-08-11 15:17:25 -07:00
oobabooga
b62c8845f3 mtmd: Fix /chat/completions for llama.cpp 2025-08-11 12:01:59 -07:00
oobabooga
2f90ac9880 Move the new image_utils.py file to modules/ 2025-08-09 21:41:38 -07:00
oobabooga
d86b0ec010
Add multimodal support (llama.cpp) (#7027) 2025-08-10 01:27:25 -03:00
oobabooga
609c3ac893 Optimize the end of generation with llama.cpp 2025-06-15 08:03:27 -07:00