The monitor physically cannot paint faster than its refresh rate, so
intermediate morphdom calls between frames do redundant parsing, diffing,
and patching work that is never displayed.
Extract thinking/reasoning blocks (e.g. <think>...</think>) into a
separate reasoning_content field on the assistant message, matching
the convention used by DeepSeek, llama.cpp, and SGLang.
- Rewrite logprobs output format to match the OpenAI specification for
both chat completions and completions endpoints
- Fix top_logprobs count being ignored for llama.cpp and ExLlamav3
backends in chat completions (always returned 1 instead of requested N)
- Fix non-streaming responses only returning logprobs for the last token
instead of all generated tokens (affects all HF-based loaders)
- Fix logprobs returning null for non-streaming chat requests on HF loaders
- Fix off-by-one returning one extra top alternative on HF loaders
- Return proper OpenAI error format ({"error": {...}}) instead of HTTP 500 for validation errors
- Send data: [DONE] at the end of SSE streams
- Fix finish_reason so "tool_calls" takes priority over "length"
- Stop including usage in streaming chunks when include_usage is not set
- Handle "developer" role in messages (treated same as "system")
- Add logprobs and top_logprobs parameters for chat completions
- Fix chat completions logprobs not working with llama.cpp and ExLlamav3 backends
- Add max_completion_tokens as an alias for max_tokens in chat completions
- Forward logit_bias and logprobs natively to llama.cpp
- Support n>1 completions with seed increment for diversity
- Fix logprobs returning empty dict when not requested
- Use config.eos_token_id_list for all EOS tokens as stop conditions
(fixes models like Llama-3 that define multiple EOS token IDs)
- Load vision/draft models before main model so autosplit accounts
for their VRAM usage
- Fix loss computation in ExLlamav3_HF: use cache across chunks so
sequences longer than 2048 tokens get correct perplexity values