llama.cpp: Use --fit-ctx 8192 when --fit on is used

This sets the minimum acceptable context length, which by default is 4096.
2026-03-18 03:14:39 +01:00 · 2026-03-15 09:22:38 -07:00 · 2026-03-15 09:22:38 -07:00 · 9119ce0680
parent 5763cab3c4
commit 9119ce0680
1 changed files with 1 additions and 0 deletions
--- a/modules/llama_cpp_server.py
+++ b/modules/llama_cpp_server.py
@ -378,6 +378,7 @@ class LlamaServer:
            cmd += ["--gpu-layers", str(shared.args.gpu_layers), "--fit", "off"]
        else:
            cmd += ["--fit", "on"]
+            cmd += ["--fit-ctx", "8192"]
            if shared.args.fit_target:
                cmd += ["--fit-target", shared.args.fit_target]