text-generation-webui/docs/12 - OpenAI API.md

18 KiB

OpenAI compatible API

The main API for this project is meant to be a drop-in replacement to the OpenAI API, including Chat and Completions endpoints.

  • It is 100% offline and private.
  • It doesn't create any logs.
  • It doesn't connect to OpenAI.
  • It doesn't use the openai-python library.

Starting the API

Add --api to your command-line flags.

  • To create a public Cloudflare URL, add the --public-api flag.
  • To listen on your local network, add the --listen flag.
  • To change the port, which is 5000 by default, use --api-port 1234 (change 1234 to your desired port number).
  • To use SSL, add --ssl-keyfile key.pem --ssl-certfile cert.pem. ⚠️ Note: this doesn't work with --public-api since Cloudflare already uses HTTPS by default.
  • To use an API key for authentication, add --api-key yourkey.

Examples

For the documentation with all the endpoints, parameters and their types, consult http://127.0.0.1:5000/docs or the typing.py file.

The official examples in the OpenAI documentation should also work, and the same parameters apply (although the API here has more optional parameters).

Completions

curl http://127.0.0.1:5000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a cake recipe:\n\n1.",
    "max_tokens": 512,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20
  }'

Chat completions

Works best with instruction-following models. If the "instruction_template" variable is not provided, it will be guessed automatically based on the model name using the regex patterns in user_data/models/config.yaml.

curl http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20
  }'

Chat completions with characters

curl http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Hello! Who are you?"
      }
    ],
    "mode": "chat-instruct",
    "character": "Example",
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20
  }'

Multimodal/vision (llama.cpp and ExLlamaV3)

curl http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Please describe what you see in this image."},
          {"type": "image_url", "image_url": {"url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"}}
        ]
      }
    ],
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20
  }'

For base64-encoded images, just replace the inner "url" value with this format: data:image/FORMAT;base64,BASE64_STRING where FORMAT is the file type (png, jpeg, gif, etc.) and BASE64_STRING is your base64-encoded image data.

With /v1/completions
curl http://127.0.0.1:5000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "About image <__media__> and image <__media__>, what I can say is that the first one"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"
            }
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/strawberry.png?raw=true"
            }
          }
        ]
      }
    ],
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20
  }'

For base64-encoded images, just replace the inner "url" values with this format: data:image/FORMAT;base64,BASE64_STRING where FORMAT is the file type (png, jpeg, gif, etc.) and BASE64_STRING is your base64-encoded image data.

Image generation

curl http://127.0.0.1:5000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "an orange tree",
    "steps": 9,
    "cfg_scale": 0,
    "batch_size": 1,
    "batch_count": 1
  }'

You need to load an image model first. You can do this via the UI, or by adding --image-model your_model_name when launching the server.

The output is a JSON object containing a data array. Each element has a b64_json field with the base64-encoded PNG image:

{
  "created": 1764791227,
  "data": [
    {
      "b64_json": "iVBORw0KGgo..."
    }
  ]
}

SSE streaming

curl http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "stream": true
  }'

Logits

curl -k http://127.0.0.1:5000/v1/internal/logits \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Who is best, Asuka or Rei? Answer:",
    "use_samplers": false
  }'

Logits after sampling parameters

curl -k http://127.0.0.1:5000/v1/internal/logits \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Who is best, Asuka or Rei? Answer:",
    "use_samplers": true,
    "top_k": 3
  }'

List models

curl -k http://127.0.0.1:5000/v1/internal/model/list \
  -H "Content-Type: application/json"

Load model

curl -k http://127.0.0.1:5000/v1/internal/model/load \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "Qwen_Qwen3-0.6B-Q4_K_M.gguf",
    "args": {
      "ctx_size": 32768,
      "flash_attn": true,
      "cache_type": "q8_0"
    }
  }'

Python chat example

import requests

url = "http://127.0.0.1:5000/v1/chat/completions"

headers = {
    "Content-Type": "application/json"
}

history = []

while True:
    user_message = input("> ")
    history.append({"role": "user", "content": user_message})
    data = {
        "messages": history,
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20
    }

    response = requests.post(url, headers=headers, json=data, verify=False)
    assistant_message = response.json()['choices'][0]['message']['content']
    history.append({"role": "assistant", "content": assistant_message})
    print(assistant_message)

Python chat example with streaming

Start the script with python -u to see the output in real time.

import requests
import sseclient  # pip install sseclient-py
import json

url = "http://127.0.0.1:5000/v1/chat/completions"

headers = {
    "Content-Type": "application/json"
}

history = []

while True:
    user_message = input("> ")
    history.append({"role": "user", "content": user_message})
    data = {
        "stream": True,
        "messages": history,
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20
    }

    stream_response = requests.post(url, headers=headers, json=data, verify=False, stream=True)
    client = sseclient.SSEClient(stream_response)

    assistant_message = ''
    for event in client.events():
        payload = json.loads(event.data)
        chunk = payload['choices'][0]['delta']['content']
        assistant_message += chunk
        print(chunk, end='')

    print()
    history.append({"role": "assistant", "content": assistant_message})

Python completions example with streaming

Start the script with python -u to see the output in real time.

import json
import requests
import sseclient  # pip install sseclient-py

url = "http://127.0.0.1:5000/v1/completions"

headers = {
    "Content-Type": "application/json"
}

data = {
    "prompt": "This is a cake recipe:\n\n1.",
    "max_tokens": 512,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "stream": True,
}

stream_response = requests.post(url, headers=headers, json=data, verify=False, stream=True)
client = sseclient.SSEClient(stream_response)

print(data['prompt'], end='')
for event in client.events():
    payload = json.loads(event.data)
    print(payload['choices'][0]['text'], end='')

print()

Python example with API key

Replace

headers = {
    "Content-Type": "application/json"
}

with

headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer yourPassword123"
}

in any of the examples above.

Tool/Function Calling Example

You need to use a model with tools support. The prompt will be automatically formatted using the model's Jinja2 template.

Request:

curl http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What time is it currently in New York City?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_time",
          "description": "Get current time in a specific timezones",
          "parameters": {
            "type": "object",
            "required": ["timezone"],
            "properties": {
              "timezone": {
                "type": "string",
                "description": "IANA timezone name (e.g., America/New_York, Europe/London). Use Europe/Berlin as local timezone if no timezone provided by the user."
              }
            }
          }
        }
      }
    ]
  }'

Sample response:

{
    "id": "chatcmpl-1746532051477984256",
    "object": "chat.completion",
    "created": 1746532051,
    "model": "qwen2.5-coder-14b-instruct-q4_k_m.gguf",
    "choices": [
        {
            "index": 0,
            "finish_reason": "tool_calls",
            "message": {
                "role": "assistant",
                "content": "```xml\n<function>\n{\n  \"name\": \"get_current_time\",\n  \"arguments\": {\n    \"timezone\": \"America/New_York\"\n  }\n}\n</function>\n```"
            },
            "tool_calls": [
                {
                    "type": "function",
                    "function": {
                        "name": "get_current_time",
                        "arguments": "{\"timezone\": \"America/New_York\"}"
                    },
                    "id": "call_52ij07mh",
                    "index": "0"
                }
            ]
        }
    ],
    "usage": {
        "prompt_tokens": 224,
        "completion_tokens": 38,
        "total_tokens": 262
    }
}

Environment variables

The following environment variables can be used (they take precedence over everything else):

Variable Name Description Example Value
OPENEDAI_PORT Port number 5000
OPENEDAI_CERT_PATH SSL certificate file path cert.pem
OPENEDAI_KEY_PATH SSL key file path key.pem
OPENEDAI_DEBUG Enable debugging (set to 1) 1
OPENEDAI_EMBEDDING_MODEL Embedding model (if applicable) sentence-transformers/all-mpnet-base-v2
OPENEDAI_EMBEDDING_DEVICE Embedding device (if applicable) cuda

Persistent settings with settings.yaml

You can also set the following variables in your settings.yaml file:

openai-embedding_device: cuda
openai-embedding_model: "sentence-transformers/all-mpnet-base-v2"
openai-debug: 1

Third-party application setup

You can usually force an application that uses the OpenAI API to connect to the local API by using the following environment variables:

OPENAI_API_HOST=http://127.0.0.1:5000

or

OPENAI_API_KEY=sk-111111111111111111111111111111111111111111111111
OPENAI_API_BASE=http://127.0.0.1:5000/v1

With the official python openai client (v1.x), the address can be set like this:

from openai import OpenAI

client = OpenAI(
    api_key="sk-111111111111111111111111111111111111111111111111",
    base_url="http://127.0.0.1:5000/v1"
)

response = client.chat.completions.create(
    model="x",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

With the official Node.js openai client (v4.x):

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "http://127.0.0.1:5000/v1",
});

const response = await client.chat.completions.create({
  model: "x",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);

Embeddings (alpha)

Embeddings requires sentence-transformers installed, but chat and completions will function without it loaded. The embeddings endpoint is currently using the HuggingFace model: sentence-transformers/all-mpnet-base-v2 for embeddings. This produces 768 dimensional embeddings. The model is small and fast. This model and embedding size may change in the future.

model name dimensions input max tokens speed size Avg. performance
all-mpnet-base-v2 768 384 2800 420M 63.3
all-MiniLM-L6-v2 384 256 14200 80M 58.8

In short, the all-MiniLM-L6-v2 model is 5x faster, 5x smaller ram, 2x smaller storage, and still offers good quality. Stats from (https://www.sbert.net/docs/pretrained_models.html). To change the model from the default you can set the environment variable OPENEDAI_EMBEDDING_MODEL, ex. "OPENEDAI_EMBEDDING_MODEL=all-MiniLM-L6-v2".

Warning: You cannot mix embeddings from different models even if they have the same dimensions. They are not comparable.

Compatibility

API endpoint notes
/v1/chat/completions Use with instruction-following models. Supports streaming, tool calls.
/v1/completions Text completion endpoint.
/v1/embeddings Using SentenceTransformer embeddings.
/v1/images/generations Image generation, response_format='b64_json' only.
/v1/moderations Basic support via embeddings.
/v1/models Lists models. Currently loaded model first.
/v1/models/{id} Returns model info.
/v1/audio/* Supported.
/v1/images/edits Not yet supported.
/v1/images/variations Not yet supported.

Applications

Almost everything needs the OPENAI_API_KEY and OPENAI_API_BASE environment variables set, but there are some exceptions.

Compatibility Application/Library Website Notes
openai-python https://github.com/openai/openai-python Use OpenAI(base_url="http://127.0.0.1:5000/v1"). Only the endpoints from above work.
openai-node https://github.com/openai/openai-node Use new OpenAI({baseURL: "http://127.0.0.1:5000/v1"}). See example above.
anse https://github.com/anse-app/anse API Key & URL configurable in UI, Images also work.
shell_gpt https://github.com/TheR1D/shell_gpt OPENAI_API_HOST=http://127.0.0.1:5000
gpt-shell https://github.com/jla/gpt-shell OPENAI_API_BASE=http://127.0.0.1:5000/v1
gpt-discord-bot https://github.com/openai/gpt-discord-bot OPENAI_API_BASE=http://127.0.0.1:5000/v1
OpenAI for Notepad++ https://github.com/Krazal/nppopenai api_url=http://127.0.0.1:5000 in the config file, or environment variables.
vscode-openai https://marketplace.visualstudio.com/items?itemName=AndrewButson.vscode-openai OPENAI_API_BASE=http://127.0.0.1:5000/v1
langchain https://github.com/hwchase17/langchain Use base_url="http://127.0.0.1:5000/v1". Results depend on model and prompt formatting.