API: Rewrite logprobs for OpenAI spec compliance across all backends

- Rewrite logprobs output format to match the OpenAI specification for both chat completions and completions endpoints - Fix top_logprobs count being ignored for llama.cpp and ExLlamav3 backends in chat completions (always returned 1 instead of requested N) - Fix non-streaming responses only returning logprobs for the last token instead of all generated tokens (affects all HF-based loaders) - Fix logprobs returning null for non-streaming chat requests on HF loaders - Fix off-by-one returning one extra top alternative on HF loaders
2026-04-07 15:43:49 +00:00 · 2026-03-12 14:16:34 -03:00 · 2026-03-12 14:16:34 -03:00 · fb1b3b6ddf
commit fb1b3b6ddf
parent 5a017aa338
3 changed files with 149 additions and 43 deletions
--- a/modules/text_generation.py
+++ b/modules/text_generation.py
@ -78,10 +78,13 @@ def _generate_reply(question, state, stopping_strings=None, is_chat=False, escap
    reply = ''
    is_stream = state['stream']
    if len(all_stop_strings) > 0 and not state['stream']:
+        original_logits_processor = state.get('logits_processor')
        stop_event_ref = state.pop('stop_event', None)
        state = copy.deepcopy(state)
        if stop_event_ref is not None:
            state['stop_event'] = stop_event_ref
+        if original_logits_processor is not None:
+            state['logits_processor'] = original_logits_processor
        state['stream'] = True

    # Generate