API reference
Create chat completion
POST /v1/chat/completions
Generate a model response for the given chat conversation. The request is OpenAI-compatible; clients built against the OpenAI Python or Node SDK work by overriding base_url.
Surface: OpenAI-compatible. For the Anthropic-compatible surface, see POST /anthropic/v1/messages.
Request
POST /v1/chat/completions
Content-Type: application/json
Authorization: Bearer ${API_KEY}Required body parameters
messages
object[], required, length >= 1. The conversation history. Each message is one of four shapes, distinguished by role:
| Role | Required fields | Notes |
|---|---|---|
system | content (string), role | Set the model's behavior. |
user | content (string), role | End-user input. |
assistant | content (string, nullable), role | Replay previous model output for multi-turn context. May carry tool_calls. |
tool | content (string), role, tool_call_id (string) | Return the result of a tool the model requested. |
All four message shapes accept an optional name (string) to label the participant — useful for distinguishing multiple users or named system personas.
model
string, required. The model ID to route the request to.
The set of accepted IDs is determined by the gateway catalog at request time. Query GET /v1/models for the current list; do not hard-code the catalog. Common families include claude, gpt, deepseek, glm, and kimi (exact IDs vary).
Output-control parameters
max_tokens
integer, nullable. Hard cap on the number of tokens the model may generate. The total of prompt tokens plus generated tokens is bounded by the model's context window. Per-model context and output limits are listed in List models.
temperature
number, nullable. Range 0–2. Default 1. Higher values produce more varied output; lower values produce more deterministic output. Prefer adjusting either temperature or top_p, not both.
top_p
number, nullable. Range 0–1. Default 1. Nucleus-sampling threshold: the model considers only tokens whose cumulative probability mass falls within the top top_p fraction.
response_format
object, nullable. Constrains output format.
| Field | Type | Values | Notes |
|---|---|---|---|
type | string | text (default), json_object | Set to json_object to require valid JSON. |
When using json_object, you must also instruct the model in a system or user message to produce JSON. Without this, the model can emit a runaway stream of whitespace until it hits max_tokens. The response may be truncated at the cutoff; check finish_reason == "length" to detect this.
stop
string or string[], nullable. Up to 16 sequences. Generation stops when any sequence is produced. The stop sequence itself is not included in the output.
Streaming parameters
stream
boolean, nullable. The OpenAI specification treats this as the toggle between a single JSON response and SSE chunks. ezrouter always returns SSE regardless of this value (see the warning at the top of this page); setting stream: true is recommended for compatibility but does not change server behavior.
stream_options
object, nullable. Only meaningful if stream: true is set.
| Field | Type | Notes |
|---|---|---|
include_usage | boolean | If true, an additional chunk is emitted before data: [DONE] carrying the final usage object with token counts. Without this flag the usage is still present on the final non-[DONE] chunk; ezrouter does not strictly require include_usage to expose it. |
Tool-calling parameters
tools
object[], nullable. List of functions the model may call. Maximum 128 functions.
| Field | Type | Notes |
|---|---|---|
type | string, required | Must be "function". |
function.name | string, required | Function identifier. Pattern: [a-zA-Z0-9_-], max 64 chars. |
function.description | string | Helps the model choose when to call this function. |
function.parameters | object | JSON-Schema describing arguments. See JSON Schema reference. Omit to declare a no-argument function. |
function.strict | boolean, default false | Beta. Enforce strict-mode argument validation against the schema. |
See Tool calls guide for end-to-end usage.
tool_choice
string or object, nullable. Controls tool dispatch.
| Value | Meaning |
|---|---|
"none" | Do not call any tool; respond with a normal message. Default when no tools are provided. |
"auto" | Model decides whether to call a tool. Default when tools are provided. |
"required" | Model must call at least one tool. |
{"type":"function","function":{"name":"..."}} | Force a specific function. |
Thinking-mode parameters
thinking
object, nullable. Toggles thinking mode on models that support it.
| Field | Type | Values | Notes |
|---|---|---|---|
type | string | enabled (default), disabled | When disabled, the model skips the hidden reasoning step. |
Not all models support thinking mode; consult per-model capability via GET /v1/models. See Thinking mode guide for details.
reasoning_effort
string, nullable. Possible values: high (default for normal requests), max (automatic for some agentic clients). For compatibility with OpenAI's vocabulary, low and medium map to high; xhigh maps to max.
Auxiliary parameters
user_id
string, nullable. Pattern [a-zA-Z0-9_-], max 512 chars. An opaque identifier you assign to distinguish end-users in your application. Do not include personally identifying information. Used by ezrouter for usage attribution and abuse-pattern detection.
logprobs
boolean, nullable. If true, the response includes the log probability of each output token in choices[].logprobs.content.
top_logprobs
integer, nullable. Range 0–20. Number of most-likely alternative tokens to return at each position with their log probabilities. Requires logprobs: true.
Deprecated parameters
| Parameter | Status |
|---|---|
frequency_penalty | Accepted but ignored. Has no effect on output. |
presence_penalty | Accepted but ignored. Has no effect on output. |
Response
200 OK — a sequence of SSE chunks terminated by data: [DONE]. Each chunk is a JSON object with object: "chat.completion.chunk".
The chunk schema below applies to every non-[DONE] chunk. Most chunks carry a partial delta; the final chunk before [DONE] carries the full usage object and an empty choices array.
Chunk schema
| Field | Type | Notes |
|---|---|---|
id | string, required | Opaque chunk identifier. Same value across all chunks of one response. Format is not stable; do not pattern-match. |
object | string, required | Always "chat.completion.chunk". |
created | integer, required | Unix timestamp (seconds) when the chat completion started. |
model | string, required | The model ID that handled the request. |
system_fingerprint | string, nullable | Always null on ezrouter. OpenAI populates this for backend tracking; ezrouter does not currently surface a fingerprint. |
choices | object[], required | Per-choice deltas. Empty array on the final usage-only chunk. |
usage | object, nullable | null on intermediate chunks. Populated on the final usage chunk (see below). |
choices[].delta
| Field | Type | Notes |
|---|---|---|
role | string, nullable | "assistant" on the first chunk; absent thereafter. |
content | string, nullable | Token text fragment. |
reasoning_content | string, nullable | Thinking-mode hidden chain-of-thought. Present only when thinking mode is active. |
tool_calls | object[], nullable | Streamed tool-call deltas. See below. |
choices[].finish_reason
string, nullable. null on intermediate chunks. Populated on the choice's last chunk with one of:
| Value | Meaning |
|---|---|
stop | Natural stop or hit one of the stop sequences. |
length | Reached max_tokens or the model's context window. |
tool_calls | Model emitted one or more tool calls; client should execute them and reply with a tool message. |
content_filter | Output was withheld by upstream provider content filtering. |
Cross-model variance observed:
lengthis emitted by most models when truncated bymax_tokens, but
claude-haiku-4-5 and claude-sonnet-4-6 have been seen to return stop in cases where other models return length. Treat both as terminal; check whether the response content was truncated by comparing token counts to max_tokens rather than only inspecting finish_reason.
tool_callsfinish_reason is emitted by 6 of 7 routable models when
a tool is invocable, but only claude-opus-4-7 has been verified to also emit the structured tool_calls delta payload reliably. Other models may set finish_reason: tool_calls without emitting parseable tool-call data. See the note in §response-tool-calls and per-model guidance in Tool calls guide.
Other finish_reason values may appear depending on the upstream provider. Treat unknown values as terminal.
choices[].logprobs
object, nullable. Present only when logprobs: true was requested.
| Field | Type | Notes |
|---|---|---|
content | object[], required | Per-output-token logprob records. |
reasoning_content | object[], nullable | Per-reasoning-token logprob records (thinking mode only). |
Each logprob record has the shape:
| Field | Type | Notes |
|---|---|---|
token | string | The token. |
logprob | number | Log probability. -9999.0 indicates the token fell outside the top-20 candidates. |
bytes | integer[], nullable | UTF-8 byte sequence of the token. null if the token has no byte representation. |
top_logprobs | object[] | Alternative tokens with their logprobs (when top_logprobs was requested). Each entry has token, logprob, bytes. |
choices[].tool_calls
object[], present when the model invokes tools.
| Field | Type | Notes |
|---|---|---|
id | string, required | Tool-call identifier; pass back as tool_call_id in the follow-up tool message. |
type | string, required | Always "function". |
function.name | string, required | Function the model wants to invoke. |
function.arguments | string, required | JSON-encoded arguments. The model may produce invalid JSON or hallucinate fields; validate before executing. |
usage
Present on the final chunk before [DONE]. ezrouter extends the OpenAI shape with internal accounting fields:
| Field | Type | Notes |
|---|---|---|
prompt_tokens | integer | Tokens billed for the input. Authoritative. |
completion_tokens | integer | Tokens billed for the output. Authoritative. |
total_tokens | integer | prompt_tokens + completion_tokens. |
prompt_tokens_details | object | Sub-breakdown: cached_tokens, text_tokens, audio_tokens, image_tokens. |
completion_tokens_details | object | Sub-breakdown: text_tokens, audio_tokens, reasoning_tokens. |
input_tokens | integer | Anthropic-style alias of prompt_tokens. Prefer the OpenAI-named field. |
output_tokens | integer | Anthropic-style alias. Currently observed as 0 regardless of actual output; rely on completion_tokens. |
claude_cache_creation_5_m_tokens | integer | Claude-prompt-cache billing bucket (5-minute TTL). Zero unless prompt caching is in use. |
claude_cache_creation_1_h_tokens | integer | Claude-prompt-cache billing bucket (1-hour TTL). Zero unless prompt caching is in use. |
usage_semantic | string | Internal tag. Typically "openai". |
usage_source | string | Internal tag identifying the upstream provider, e.g. "anthropic". |
Only prompt_tokens / completion_tokens / total_tokens and the _details breakdowns should be relied on by client code. The alias and Claude-specific fields are surfaced for observability and may change.
Custom response headers
Every response includes these ezrouter-specific headers:
| Header | Format | Purpose |
|---|---|---|
x-new-api-version | YYYYMMDD-HHMMSS | Gateway build stamp at the moment the request was served. |
x-oneapi-request-id | opaque string | Request-tracing identifier. Include this verbatim when filing support tickets. |
Standard rate-limit headers (X-RateLimit-, RateLimit-, Retry-After) are not emitted under any condition. See Differences from OpenAI.
Example
A request to claude-sonnet-4-6:
curl -sS https://www.ezrouter.dev/v1/chat/completions \
-H "Authorization: Bearer ${API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"messages": [
{"role": "user", "content": "Reply with only the word OK."}
],
"max_tokens": 10
}'Response (abbreviated; same shape whether or not stream: true was set):
data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"finish_reason":null,"logprobs":null}],"usage":null}
data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[{"delta":{"content":"OK"},"index":0,"finish_reason":null,"logprobs":null}],"usage":null}
data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[{"delta":{},"index":0,"finish_reason":"stop","logprobs":null}],"usage":null}
data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[],"usage":{"prompt_tokens":28,"completion_tokens":4,"total_tokens":32,"prompt_tokens_details":{"cached_tokens":0,"text_tokens":0,"audio_tokens":0,"image_tokens":0},"completion_tokens_details":{"text_tokens":0,"audio_tokens":0,"reasoning_tokens":0},"input_tokens":28,"output_tokens":0,"usage_semantic":"openai","usage_source":"anthropic","claude_cache_creation_5_m_tokens":0,"claude_cache_creation_1_h_tokens":0}}
data: [DONE]End-to-end SDK examples (Python, Node, curl with streaming parsers) live in the Cookbook.
Differences from OpenAI
ezrouter is OpenAI-compatible at the request-shape level but diverges in seven observable ways. Client code written for OpenAI generally works, but the assumptions below are unsafe:
- The server always returns SSE. Setting
stream: falsedoes not
produce a single JSON body; the response is still a data: ... chunk sequence. Clients that call response.json() will fail; use an SSE parser.
idformat is opaque. Identifiers may begin withmsg_01(when
passing through Anthropic-backed models) or msg_<random>_<hash> (when ezrouter synthesizes one). Never pattern-match.
system_fingerprintis alwaysnull. OpenAI uses this for
backend version tracking; ezrouter does not.
usageincludes extra fields. Anthropic-style aliases
(input_tokens, output_tokens) and Claude-cache billing buckets appear on the final chunk. output_tokens is currently unreliable — read completion_tokens instead.
- Custom headers.
x-new-api-versionandx-oneapi-request-id
replace OpenAI's request-tracking headers.
- Error envelope differs. See Error codes
for the ezrouter envelope shape. Notably, error.type is not the OpenAI typed taxonomy in all cases, and error.message for some classes is in Chinese.
- No rate-limit headers, ever.
Retry-After,X-RateLimit-*, and
RateLimit-* are never emitted. Under load the gateway absorbs backpressure via upstream queueing rather than returning 429. Build clients with timeouts and latency-based backoff; do not retry on 429 (it does not occur for capacity reasons). See Rate limits.