## OpenAI compatible API The main API for this project is meant to be a drop-in replacement to the OpenAI API, including Chat and Completions endpoints. * It is 100% offline and private. * It doesn't create any logs. * It doesn't connect to OpenAI. * It doesn't use the openai-python library. ### Starting the API Add `--api` to your command-line flags. * To create a public Cloudflare URL, add the `--public-api` flag. * To listen on your local network, add the `--listen` flag. * To change the port, which is 5000 by default, use `--api-port 1234` (change 1234 to your desired port number). * To use SSL, add `--ssl-keyfile key.pem --ssl-certfile cert.pem`. ⚠️ **Note**: this doesn't work with `--public-api` since Cloudflare already uses HTTPS by default. * To use an API key for authentication, add `--api-key yourkey`. ### Examples For the documentation with all the endpoints, parameters and their types, consult `http://127.0.0.1:5000/docs` or the [typing.py](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/typing.py) file. The official examples in the [OpenAI documentation](https://platform.openai.com/docs/api-reference) should also work, and the same parameters apply (although the API here has more optional parameters). #### Completions ```shell curl http://127.0.0.1:5000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "This is a cake recipe:\n\n1.", "max_tokens": 512, "temperature": 0.6, "top_p": 0.95, "top_k": 20 }' ``` #### Chat completions Works best with instruction-following models. If the "instruction_template" variable is not provided, it will be guessed automatically based on the model name using the regex patterns in `user_data/models/config.yaml`. ```shell curl http://127.0.0.1:5000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "user", "content": "Hello!" } ], "temperature": 0.6, "top_p": 0.95, "top_k": 20 }' ``` #### Chat completions with characters ```shell curl http://127.0.0.1:5000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "user", "content": "Hello! Who are you?" } ], "mode": "chat-instruct", "character": "Example", "temperature": 0.6, "top_p": 0.95, "top_k": 20 }' ``` #### Multimodal/vision (llama.cpp and ExLlamaV3) ##### With /v1/chat/completions (recommended!) ```shell curl http://127.0.0.1:5000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "user", "content": [ {"type": "text", "text": "Please describe what you see in this image."}, {"type": "image_url", "image_url": {"url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true"}} ] } ], "temperature": 0.6, "top_p": 0.95, "top_k": 20 }' ``` For base64-encoded images, just replace the inner "url" value with this format: `data:image/FORMAT;base64,BASE64_STRING` where FORMAT is the file type (png, jpeg, gif, etc.) and BASE64_STRING is your base64-encoded image data. ##### With /v1/completions ```shell curl http://127.0.0.1:5000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "user", "content": [ { "type": "text", "text": "About image <__media__> and image <__media__>, what I can say is that the first one" }, { "type": "image_url", "image_url": { "url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/cat.png?raw=true" } }, { "type": "image_url", "image_url": { "url": "https://github.com/turboderp-org/exllamav3/blob/master/examples/media/strawberry.png?raw=true" } } ] } ], "temperature": 0.6, "top_p": 0.95, "top_k": 20 }' ``` For base64-encoded images, just replace the inner "url" values with this format: `data:image/FORMAT;base64,BASE64_STRING` where FORMAT is the file type (png, jpeg, gif, etc.) and BASE64_STRING is your base64-encoded image data. #### Image generation ```shell curl http://127.0.0.1:5000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "prompt": "an orange tree", "steps": 9, "cfg_scale": 0, "batch_size": 1, "batch_count": 1 }' ``` You need to load an image model first. You can do this via the UI, or by adding `--image-model your_model_name` when launching the server. The output is a JSON object containing a `data` array. Each element has a `b64_json` field with the base64-encoded PNG image: ```json { "created": 1764791227, "data": [ { "b64_json": "iVBORw0KGgo..." } ] } ``` #### SSE streaming ```shell curl http://127.0.0.1:5000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "user", "content": "Hello!" } ], "temperature": 0.6, "top_p": 0.95, "top_k": 20, "stream": true }' ``` #### Logits ```shell curl -k http://127.0.0.1:5000/v1/internal/logits \ -H "Content-Type: application/json" \ -d '{ "prompt": "Who is best, Asuka or Rei? Answer:", "use_samplers": false }' ``` #### Logits after sampling parameters ```shell curl -k http://127.0.0.1:5000/v1/internal/logits \ -H "Content-Type: application/json" \ -d '{ "prompt": "Who is best, Asuka or Rei? Answer:", "use_samplers": true, "top_k": 3 }' ``` #### List models ```shell curl -k http://127.0.0.1:5000/v1/internal/model/list \ -H "Content-Type: application/json" ``` #### Load model ```shell curl -k http://127.0.0.1:5000/v1/internal/model/load \ -H "Content-Type: application/json" \ -d '{ "model_name": "Qwen_Qwen3-0.6B-Q4_K_M.gguf", "args": { "ctx_size": 32768, "flash_attn": true, "cache_type": "q8_0" } }' ``` #### Python chat example ```python import requests url = "http://127.0.0.1:5000/v1/chat/completions" headers = { "Content-Type": "application/json" } history = [] while True: user_message = input("> ") history.append({"role": "user", "content": user_message}) data = { "messages": history, "temperature": 0.6, "top_p": 0.95, "top_k": 20 } response = requests.post(url, headers=headers, json=data, verify=False) assistant_message = response.json()['choices'][0]['message']['content'] history.append({"role": "assistant", "content": assistant_message}) print(assistant_message) ``` #### Python chat example with streaming Start the script with `python -u` to see the output in real time. ```python import requests import sseclient # pip install sseclient-py import json url = "http://127.0.0.1:5000/v1/chat/completions" headers = { "Content-Type": "application/json" } history = [] while True: user_message = input("> ") history.append({"role": "user", "content": user_message}) data = { "stream": True, "messages": history, "temperature": 0.6, "top_p": 0.95, "top_k": 20 } stream_response = requests.post(url, headers=headers, json=data, verify=False, stream=True) client = sseclient.SSEClient(stream_response) assistant_message = '' for event in client.events(): payload = json.loads(event.data) chunk = payload['choices'][0]['delta']['content'] assistant_message += chunk print(chunk, end='') print() history.append({"role": "assistant", "content": assistant_message}) ``` #### Python completions example with streaming Start the script with `python -u` to see the output in real time. ```python import json import requests import sseclient # pip install sseclient-py url = "http://127.0.0.1:5000/v1/completions" headers = { "Content-Type": "application/json" } data = { "prompt": "This is a cake recipe:\n\n1.", "max_tokens": 512, "temperature": 0.6, "top_p": 0.95, "top_k": 20, "stream": True, } stream_response = requests.post(url, headers=headers, json=data, verify=False, stream=True) client = sseclient.SSEClient(stream_response) print(data['prompt'], end='') for event in client.events(): payload = json.loads(event.data) print(payload['choices'][0]['text'], end='') print() ``` #### Python example with API key Replace ```python headers = { "Content-Type": "application/json" } ``` with ```python headers = { "Content-Type": "application/json", "Authorization": "Bearer yourPassword123" } ``` in any of the examples above. #### Tool/Function Calling Example You need to use a model with tools support. The prompt will be automatically formatted using the model's Jinja2 template. Request: ``` curl http://127.0.0.1:5000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What time is it currently in New York City?" } ], "tools": [ { "type": "function", "function": { "name": "get_current_time", "description": "Get current time in a specific timezones", "parameters": { "type": "object", "required": ["timezone"], "properties": { "timezone": { "type": "string", "description": "IANA timezone name (e.g., America/New_York, Europe/London). Use Europe/Berlin as local timezone if no timezone provided by the user." } } } } } ] }' ``` Sample response: ``` { "id": "chatcmpl-1746532051477984256", "object": "chat.completion", "created": 1746532051, "model": "qwen2.5-coder-14b-instruct-q4_k_m.gguf", "choices": [ { "index": 0, "finish_reason": "tool_calls", "message": { "role": "assistant", "content": "```xml\n\n{\n \"name\": \"get_current_time\",\n \"arguments\": {\n \"timezone\": \"America/New_York\"\n }\n}\n\n```" }, "tool_calls": [ { "type": "function", "function": { "name": "get_current_time", "arguments": "{\"timezone\": \"America/New_York\"}" }, "id": "call_52ij07mh", "index": "0" } ] } ], "usage": { "prompt_tokens": 224, "completion_tokens": 38, "total_tokens": 262 } } ``` ### Environment variables The following environment variables can be used (they take precedence over everything else): | Variable Name | Description | Example Value | |------------------------|------------------------------------|----------------------------| | `OPENEDAI_PORT` | Port number | 5000 | | `OPENEDAI_CERT_PATH` | SSL certificate file path | cert.pem | | `OPENEDAI_KEY_PATH` | SSL key file path | key.pem | | `OPENEDAI_DEBUG` | Enable debugging (set to 1) | 1 | | `OPENEDAI_EMBEDDING_MODEL` | Embedding model (if applicable) | sentence-transformers/all-mpnet-base-v2 | | `OPENEDAI_EMBEDDING_DEVICE` | Embedding device (if applicable) | cuda | #### Persistent settings with `settings.yaml` You can also set the following variables in your `settings.yaml` file: ``` openai-embedding_device: cuda openai-embedding_model: "sentence-transformers/all-mpnet-base-v2" openai-debug: 1 ``` ### Third-party application setup You can usually force an application that uses the OpenAI API to connect to the local API by using the following environment variables: ```shell OPENAI_API_HOST=http://127.0.0.1:5000 ``` or ```shell OPENAI_API_KEY=sk-111111111111111111111111111111111111111111111111 OPENAI_API_BASE=http://127.0.0.1:5000/v1 ``` With the [official python openai client](https://github.com/openai/openai-python) (v1.x), the address can be set like this: ```python from openai import OpenAI client = OpenAI( api_key="sk-111111111111111111111111111111111111111111111111", base_url="http://127.0.0.1:5000/v1" ) response = client.chat.completions.create( model="x", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) ``` With the [official Node.js openai client](https://github.com/openai/openai-node) (v4.x): ```js import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY, baseURL: "http://127.0.0.1:5000/v1", }); const response = await client.chat.completions.create({ model: "x", messages: [{ role: "user", content: "Hello!" }], }); console.log(response.choices[0].message.content); ``` ### Embeddings (alpha) Embeddings requires `sentence-transformers` installed, but chat and completions will function without it loaded. The embeddings endpoint is currently using the HuggingFace model: `sentence-transformers/all-mpnet-base-v2` for embeddings. This produces 768 dimensional embeddings. The model is small and fast. This model and embedding size may change in the future. | model name | dimensions | input max tokens | speed | size | Avg. performance | | ---------------------- | ---------- | ---------------- | ----- | ---- | ---------------- | | all-mpnet-base-v2 | 768 | 384 | 2800 | 420M | 63.3 | | all-MiniLM-L6-v2 | 384 | 256 | 14200 | 80M | 58.8 | In short, the all-MiniLM-L6-v2 model is 5x faster, 5x smaller ram, 2x smaller storage, and still offers good quality. Stats from (https://www.sbert.net/docs/pretrained_models.html). To change the model from the default you can set the environment variable `OPENEDAI_EMBEDDING_MODEL`, ex. "OPENEDAI_EMBEDDING_MODEL=all-MiniLM-L6-v2". Warning: You cannot mix embeddings from different models even if they have the same dimensions. They are not comparable. ### Compatibility | API endpoint | notes | | ------------------------- | --------------------------------------------------------------------------- | | /v1/chat/completions | Use with instruction-following models. Supports streaming, tool calls. | | /v1/completions | Text completion endpoint. | | /v1/embeddings | Using SentenceTransformer embeddings. | | /v1/images/generations | Image generation, response_format='b64_json' only. | | /v1/moderations | Basic support via embeddings. | | /v1/models | Lists models. Currently loaded model first. | | /v1/models/{id} | Returns model info. | | /v1/audio/\* | Supported. | | /v1/images/edits | Not yet supported. | | /v1/images/variations | Not yet supported. | #### Applications Almost everything needs the `OPENAI_API_KEY` and `OPENAI_API_BASE` environment variables set, but there are some exceptions. | Compatibility | Application/Library | Website | Notes | | ------------- | -------------------- | ------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------- | | ✅❌ | openai-python | https://github.com/openai/openai-python | Use `OpenAI(base_url="http://127.0.0.1:5000/v1")`. Only the endpoints from above work. | | ✅❌ | openai-node | https://github.com/openai/openai-node | Use `new OpenAI({baseURL: "http://127.0.0.1:5000/v1"})`. See example above. | | ✅ | anse | https://github.com/anse-app/anse | API Key & URL configurable in UI, Images also work. | | ✅ | shell_gpt | https://github.com/TheR1D/shell_gpt | OPENAI_API_HOST=http://127.0.0.1:5000 | | ✅ | gpt-shell | https://github.com/jla/gpt-shell | OPENAI_API_BASE=http://127.0.0.1:5000/v1 | | ✅ | gpt-discord-bot | https://github.com/openai/gpt-discord-bot | OPENAI_API_BASE=http://127.0.0.1:5000/v1 | | ✅ | OpenAI for Notepad++ | https://github.com/Krazal/nppopenai | api_url=http://127.0.0.1:5000 in the config file, or environment variables. | | ✅ | vscode-openai | https://marketplace.visualstudio.com/items?itemName=AndrewButson.vscode-openai | OPENAI_API_BASE=http://127.0.0.1:5000/v1 | | ✅❌ | langchain | https://github.com/hwchase17/langchain | Use `base_url="http://127.0.0.1:5000/v1"`. Results depend on model and prompt formatting. |