API: Add parallel request support for llama.cpp and ExLlamaV3

This commit is contained in:
oobabooga 2026-03-05 16:49:58 -08:00
parent 2f08dce7b0
commit 9824c82cb6
10 changed files with 198 additions and 63 deletions

View file

@ -338,6 +338,35 @@ for event in client.events():
print()
```
#### Python parallel requests example
The API supports handling multiple requests in parallel. For ExLlamaV3, this works out of the box. For llama.cpp, you need to pass `--parallel N` to set the number of concurrent slots.
```python
import concurrent.futures
import requests
url = "http://127.0.0.1:5000/v1/chat/completions"
prompts = [
"Write a haiku about the ocean.",
"Explain quantum computing in simple terms.",
"Tell me a joke about programmers.",
]
def send_request(prompt):
response = requests.post(url, json={
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 200,
})
return response.json()["choices"][0]["message"]["content"]
with concurrent.futures.ThreadPoolExecutor() as executor:
results = list(executor.map(send_request, prompts))
for prompt, result in zip(prompts, results):
print(f"Q: {prompt}\nA: {result}\n")
```
#### Python example with API key
Replace