API reference

Create chat completion

POST /v1/chat/completions

Generate a model response for the given chat conversation. The request is OpenAI-compatible; clients built against the OpenAI Python or Node SDK work by overriding base_url.

Surface: OpenAI-compatible. For the Anthropic-compatible surface, see POST /anthropic/v1/messages.

Request

http

POST /v1/chat/completions
Content-Type: application/json
Authorization: Bearer ${API_KEY}

Required body parameters

`messages`

object[], required, length >= 1. The conversation history. Each message is one of four shapes, distinguished by role:

Role	Required fields	Notes
`system`	`content` (string), `role`	Set the model's behavior.
`user`	`content` (string), `role`	End-user input.
`assistant`	`content` (string, nullable), `role`	Replay previous model output for multi-turn context. May carry `tool_calls`.
`tool`	`content` (string), `role`, `tool_call_id` (string)	Return the result of a tool the model requested.

All four message shapes accept an optional name (string) to label the participant — useful for distinguishing multiple users or named system personas.

`model`

string, required. The model ID to route the request to.

The set of accepted IDs is determined by the gateway catalog at request time. Query GET /v1/models for the current list; do not hard-code the catalog. Common families include claude, gpt, deepseek, glm, and kimi (exact IDs vary).

Output-control parameters

`max_tokens`

integer, nullable. Hard cap on the number of tokens the model may generate. The total of prompt tokens plus generated tokens is bounded by the model's context window. Per-model context and output limits are listed in List models.

`temperature`

number, nullable. Range 0–2. Default 1. Higher values produce more varied output; lower values produce more deterministic output. Prefer adjusting either temperature or top_p, not both.

`top_p`

number, nullable. Range 0–1. Default 1. Nucleus-sampling threshold: the model considers only tokens whose cumulative probability mass falls within the top top_p fraction.

`response_format`

object, nullable. Constrains output format.

Field	Type	Values	Notes
`type`	string	`text` (default), `json_object`	Set to `json_object` to require valid JSON.

When using json_object, you must also instruct the model in a system or user message to produce JSON. Without this, the model can emit a runaway stream of whitespace until it hits max_tokens. The response may be truncated at the cutoff; check finish_reason == "length" to detect this.

`stop`

string or string[], nullable. Up to 16 sequences. Generation stops when any sequence is produced. The stop sequence itself is not included in the output.

Streaming parameters

`stream`

boolean, nullable. The OpenAI specification treats this as the toggle between a single JSON response and SSE chunks. ezrouter always returns SSE regardless of this value (see the warning at the top of this page); setting stream: true is recommended for compatibility but does not change server behavior.

`stream_options`

object, nullable. Only meaningful if stream: true is set.

Field	Type	Notes
`include_usage`	boolean	If true, an additional chunk is emitted before `data: [DONE]` carrying the final `usage` object with token counts. Without this flag the `usage` is still present on the final non-`[DONE]` chunk; ezrouter does not strictly require `include_usage` to expose it.

Tool-calling parameters

`tools`

object[], nullable. List of functions the model may call. Maximum 128 functions.

Field	Type	Notes
`type`	string, required	Must be `"function"`.
`function.name`	string, required	Function identifier. Pattern: `[a-zA-Z0-9_-]`, max 64 chars.
`function.description`	string	Helps the model choose when to call this function.
`function.parameters`	object	JSON-Schema describing arguments. See JSON Schema reference. Omit to declare a no-argument function.
`function.strict`	boolean, default `false`	Beta. Enforce strict-mode argument validation against the schema.

See Tool calls guide for end-to-end usage.

`tool_choice`

string or object, nullable. Controls tool dispatch.

Value	Meaning
`"none"`	Do not call any tool; respond with a normal message. Default when no `tools` are provided.
`"auto"`	Model decides whether to call a tool. Default when `tools` are provided.
`"required"`	Model must call at least one tool.
`{"type":"function","function":{"name":"..."}}`	Force a specific function.

Thinking-mode parameters

`thinking`

object, nullable. Toggles thinking mode on models that support it.

Field	Type	Values	Notes
`type`	string	`enabled` (default), `disabled`	When `disabled`, the model skips the hidden reasoning step.

Not all models support thinking mode; consult per-model capability via GET /v1/models. See Thinking mode guide for details.

`reasoning_effort`

string, nullable. Possible values: high (default for normal requests), max (automatic for some agentic clients). For compatibility with OpenAI's vocabulary, low and medium map to high; xhigh maps to max.

Auxiliary parameters

`user_id`

string, nullable. Pattern [a-zA-Z0-9_-], max 512 chars. An opaque identifier you assign to distinguish end-users in your application. Do not include personally identifying information. Used by ezrouter for usage attribution and abuse-pattern detection.

`logprobs`

boolean, nullable. If true, the response includes the log probability of each output token in choices[].logprobs.content.

`top_logprobs`

integer, nullable. Range 0–20. Number of most-likely alternative tokens to return at each position with their log probabilities. Requires logprobs: true.

Deprecated parameters

Parameter	Status
`frequency_penalty`	Accepted but ignored. Has no effect on output.
`presence_penalty`	Accepted but ignored. Has no effect on output.

Response

200 OK — a sequence of SSE chunks terminated by data: [DONE]. Each chunk is a JSON object with object: "chat.completion.chunk".

The chunk schema below applies to every non-[DONE] chunk. Most chunks carry a partial delta; the final chunk before [DONE] carries the full usage object and an empty choices array.

Chunk schema

Field	Type	Notes
`id`	string, required	Opaque chunk identifier. Same value across all chunks of one response. Format is not stable; do not pattern-match.
`object`	string, required	Always `"chat.completion.chunk"`.
`created`	integer, required	Unix timestamp (seconds) when the chat completion started.
`model`	string, required	The model ID that handled the request.
`system_fingerprint`	string, nullable	Always `null` on ezrouter. OpenAI populates this for backend tracking; ezrouter does not currently surface a fingerprint.
`choices`	object[], required	Per-choice deltas. Empty array on the final usage-only chunk.
`usage`	object, nullable	`null` on intermediate chunks. Populated on the final usage chunk (see below).

`choices[].delta`

Field	Type	Notes
`role`	string, nullable	`"assistant"` on the first chunk; absent thereafter.
`content`	string, nullable	Token text fragment.
`reasoning_content`	string, nullable	Thinking-mode hidden chain-of-thought. Present only when thinking mode is active.
`tool_calls`	object[], nullable	Streamed tool-call deltas. See below.

`choices[].finish_reason`

string, nullable. null on intermediate chunks. Populated on the choice's last chunk with one of:

Value	Meaning
`stop`	Natural stop or hit one of the `stop` sequences.
`length`	Reached `max_tokens` or the model's context window.
`tool_calls`	Model emitted one or more tool calls; client should execute them and reply with a `tool` message.
`content_filter`	Output was withheld by upstream provider content filtering.

Cross-model variance observed:

length is emitted by most models when truncated by max_tokens, but

claude-haiku-4-5 and claude-sonnet-4-6 have been seen to return stop in cases where other models return length. Treat both as terminal; check whether the response content was truncated by comparing token counts to max_tokens rather than only inspecting finish_reason.

tool_calls finish_reason is emitted by 6 of 7 routable models when

a tool is invocable, but only claude-opus-4-7 has been verified to also emit the structured tool_calls delta payload reliably. Other models may set finish_reason: tool_calls without emitting parseable tool-call data. See the note in §response-tool-calls and per-model guidance in Tool calls guide.

Other finish_reason values may appear depending on the upstream provider. Treat unknown values as terminal.

`choices[].logprobs`

object, nullable. Present only when logprobs: true was requested.

Field	Type	Notes
`content`	object[], required	Per-output-token logprob records.
`reasoning_content`	object[], nullable	Per-reasoning-token logprob records (thinking mode only).

Each logprob record has the shape:

Field	Type	Notes
`token`	string	The token.
`logprob`	number	Log probability. `-9999.0` indicates the token fell outside the top-20 candidates.
`bytes`	integer[], nullable	UTF-8 byte sequence of the token. `null` if the token has no byte representation.
`top_logprobs`	object[]	Alternative tokens with their logprobs (when `top_logprobs` was requested). Each entry has `token`, `logprob`, `bytes`.

`choices[].tool_calls`

object[], present when the model invokes tools.

Field	Type	Notes
`id`	string, required	Tool-call identifier; pass back as `tool_call_id` in the follow-up `tool` message.
`type`	string, required	Always `"function"`.
`function.name`	string, required	Function the model wants to invoke.
`function.arguments`	string, required	JSON-encoded arguments. The model may produce invalid JSON or hallucinate fields; validate before executing.

`usage`

Present on the final chunk before [DONE]. ezrouter extends the OpenAI shape with internal accounting fields:

Field	Type	Notes
`prompt_tokens`	integer	Tokens billed for the input. Authoritative.
`completion_tokens`	integer	Tokens billed for the output. Authoritative.
`total_tokens`	integer	`prompt_tokens + completion_tokens`.
`prompt_tokens_details`	object	Sub-breakdown: `cached_tokens`, `text_tokens`, `audio_tokens`, `image_tokens`.
`completion_tokens_details`	object	Sub-breakdown: `text_tokens`, `audio_tokens`, `reasoning_tokens`.
`input_tokens`	integer	Anthropic-style alias of `prompt_tokens`. Prefer the OpenAI-named field.
`output_tokens`	integer	Anthropic-style alias. Currently observed as `0` regardless of actual output; rely on `completion_tokens`.
`claude_cache_creation_5_m_tokens`	integer	Claude-prompt-cache billing bucket (5-minute TTL). Zero unless prompt caching is in use.
`claude_cache_creation_1_h_tokens`	integer	Claude-prompt-cache billing bucket (1-hour TTL). Zero unless prompt caching is in use.
`usage_semantic`	string	Internal tag. Typically `"openai"`.
`usage_source`	string	Internal tag identifying the upstream provider, e.g. `"anthropic"`.

Only prompt_tokens / completion_tokens / total_tokens and the _details breakdowns should be relied on by client code. The alias and Claude-specific fields are surfaced for observability and may change.

Custom response headers

Every response includes these ezrouter-specific headers:

Header	Format	Purpose
`x-new-api-version`	`YYYYMMDD-HHMMSS`	Gateway build stamp at the moment the request was served.
`x-oneapi-request-id`	opaque string	Request-tracing identifier. Include this verbatim when filing support tickets.

Standard rate-limit headers (X-RateLimit-, RateLimit-, Retry-After) are not emitted under any condition. See Differences from OpenAI.

Example

A request to claude-sonnet-4-6:

bash

curl -sS https://www.ezrouter.dev/v1/chat/completions \
  -H "Authorization: Bearer ${API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [
      {"role": "user", "content": "Reply with only the word OK."}
    ],
    "max_tokens": 10
  }'

Response (abbreviated; same shape whether or not stream: true was set):

text

data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"finish_reason":null,"logprobs":null}],"usage":null}

data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[{"delta":{"content":"OK"},"index":0,"finish_reason":null,"logprobs":null}],"usage":null}

data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[{"delta":{},"index":0,"finish_reason":"stop","logprobs":null}],"usage":null}

data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[],"usage":{"prompt_tokens":28,"completion_tokens":4,"total_tokens":32,"prompt_tokens_details":{"cached_tokens":0,"text_tokens":0,"audio_tokens":0,"image_tokens":0},"completion_tokens_details":{"text_tokens":0,"audio_tokens":0,"reasoning_tokens":0},"input_tokens":28,"output_tokens":0,"usage_semantic":"openai","usage_source":"anthropic","claude_cache_creation_5_m_tokens":0,"claude_cache_creation_1_h_tokens":0}}

data: [DONE]

End-to-end SDK examples (Python, Node, curl with streaming parsers) live in the Cookbook.

Differences from OpenAI

ezrouter is OpenAI-compatible at the request-shape level but diverges in seven observable ways. Client code written for OpenAI generally works, but the assumptions below are unsafe:

The server always returns SSE. Setting stream: false does not

produce a single JSON body; the response is still a data: ... chunk sequence. Clients that call response.json() will fail; use an SSE parser.

id format is opaque. Identifiers may begin with msg_01 (when

passing through Anthropic-backed models) or msg_<random>_<hash> (when ezrouter synthesizes one). Never pattern-match.

system_fingerprint is always null. OpenAI uses this for

backend version tracking; ezrouter does not.

usage includes extra fields. Anthropic-style aliases

(input_tokens, output_tokens) and Claude-cache billing buckets appear on the final chunk. output_tokens is currently unreliable — read completion_tokens instead.

Custom headers. x-new-api-version and x-oneapi-request-id

replace OpenAI's request-tracking headers.

Error envelope differs. See Error codes

for the ezrouter envelope shape. Notably, error.type is not the OpenAI typed taxonomy in all cases, and error.message for some classes is in Chinese.

No rate-limit headers, ever. Retry-After, X-RateLimit-*, and

RateLimit-* are never emitted. Under load the gateway absorbs backpressure via upstream queueing rather than returning 429. Build clients with timeouts and latency-based backoff; do not retry on 429 (it does not occur for capacity reasons). See Rate limits.