API reference

Create chat completion

POST /v1/chat/completions

Generate a model response for the given chat conversation. The request is OpenAI-compatible; clients built against the OpenAI Python or Node SDK work by overriding base_url.

Surface: OpenAI-compatible. For the Anthropic-compatible surface, see POST /anthropic/v1/messages.

Request

http
POST /v1/chat/completions
Content-Type: application/json
Authorization: Bearer ${API_KEY}

Required body parameters

messages

object[], required, length >= 1. The conversation history. Each message is one of four shapes, distinguished by role:

RoleRequired fieldsNotes
systemcontent (string), roleSet the model's behavior.
usercontent (string), roleEnd-user input.
assistantcontent (string, nullable), roleReplay previous model output for multi-turn context. May carry tool_calls.
toolcontent (string), role, tool_call_id (string)Return the result of a tool the model requested.

All four message shapes accept an optional name (string) to label the participant — useful for distinguishing multiple users or named system personas.

model

string, required. The model ID to route the request to.

The set of accepted IDs is determined by the gateway catalog at request time. Query GET /v1/models for the current list; do not hard-code the catalog. Common families include claude, gpt, deepseek, glm, and kimi (exact IDs vary).

Output-control parameters

max_tokens

integer, nullable. Hard cap on the number of tokens the model may generate. The total of prompt tokens plus generated tokens is bounded by the model's context window. Per-model context and output limits are listed in List models.

temperature

number, nullable. Range 02. Default 1. Higher values produce more varied output; lower values produce more deterministic output. Prefer adjusting either temperature or top_p, not both.

top_p

number, nullable. Range 01. Default 1. Nucleus-sampling threshold: the model considers only tokens whose cumulative probability mass falls within the top top_p fraction.

response_format

object, nullable. Constrains output format.

FieldTypeValuesNotes
typestringtext (default), json_objectSet to json_object to require valid JSON.

When using json_object, you must also instruct the model in a system or user message to produce JSON. Without this, the model can emit a runaway stream of whitespace until it hits max_tokens. The response may be truncated at the cutoff; check finish_reason == "length" to detect this.

stop

string or string[], nullable. Up to 16 sequences. Generation stops when any sequence is produced. The stop sequence itself is not included in the output.

Streaming parameters

stream

boolean, nullable. The OpenAI specification treats this as the toggle between a single JSON response and SSE chunks. ezrouter always returns SSE regardless of this value (see the warning at the top of this page); setting stream: true is recommended for compatibility but does not change server behavior.

stream_options

object, nullable. Only meaningful if stream: true is set.

FieldTypeNotes
include_usagebooleanIf true, an additional chunk is emitted before data: [DONE] carrying the final usage object with token counts. Without this flag the usage is still present on the final non-[DONE] chunk; ezrouter does not strictly require include_usage to expose it.

Tool-calling parameters

tools

object[], nullable. List of functions the model may call. Maximum 128 functions.

FieldTypeNotes
typestring, requiredMust be "function".
function.namestring, requiredFunction identifier. Pattern: [a-zA-Z0-9_-], max 64 chars.
function.descriptionstringHelps the model choose when to call this function.
function.parametersobjectJSON-Schema describing arguments. See JSON Schema reference. Omit to declare a no-argument function.
function.strictboolean, default falseBeta. Enforce strict-mode argument validation against the schema.

See Tool calls guide for end-to-end usage.

tool_choice

string or object, nullable. Controls tool dispatch.

ValueMeaning
"none"Do not call any tool; respond with a normal message. Default when no tools are provided.
"auto"Model decides whether to call a tool. Default when tools are provided.
"required"Model must call at least one tool.
{"type":"function","function":{"name":"..."}}Force a specific function.

Thinking-mode parameters

thinking

object, nullable. Toggles thinking mode on models that support it.

FieldTypeValuesNotes
typestringenabled (default), disabledWhen disabled, the model skips the hidden reasoning step.

Not all models support thinking mode; consult per-model capability via GET /v1/models. See Thinking mode guide for details.

reasoning_effort

string, nullable. Possible values: high (default for normal requests), max (automatic for some agentic clients). For compatibility with OpenAI's vocabulary, low and medium map to high; xhigh maps to max.

Auxiliary parameters

user_id

string, nullable. Pattern [a-zA-Z0-9_-], max 512 chars. An opaque identifier you assign to distinguish end-users in your application. Do not include personally identifying information. Used by ezrouter for usage attribution and abuse-pattern detection.

logprobs

boolean, nullable. If true, the response includes the log probability of each output token in choices[].logprobs.content.

top_logprobs

integer, nullable. Range 020. Number of most-likely alternative tokens to return at each position with their log probabilities. Requires logprobs: true.

Deprecated parameters

ParameterStatus
frequency_penaltyAccepted but ignored. Has no effect on output.
presence_penaltyAccepted but ignored. Has no effect on output.

Response

200 OK — a sequence of SSE chunks terminated by data: [DONE]. Each chunk is a JSON object with object: "chat.completion.chunk".

The chunk schema below applies to every non-[DONE] chunk. Most chunks carry a partial delta; the final chunk before [DONE] carries the full usage object and an empty choices array.

Chunk schema

FieldTypeNotes
idstring, requiredOpaque chunk identifier. Same value across all chunks of one response. Format is not stable; do not pattern-match.
objectstring, requiredAlways "chat.completion.chunk".
createdinteger, requiredUnix timestamp (seconds) when the chat completion started.
modelstring, requiredThe model ID that handled the request.
system_fingerprintstring, nullableAlways null on ezrouter. OpenAI populates this for backend tracking; ezrouter does not currently surface a fingerprint.
choicesobject[], requiredPer-choice deltas. Empty array on the final usage-only chunk.
usageobject, nullablenull on intermediate chunks. Populated on the final usage chunk (see below).

choices[].delta

FieldTypeNotes
rolestring, nullable"assistant" on the first chunk; absent thereafter.
contentstring, nullableToken text fragment.
reasoning_contentstring, nullableThinking-mode hidden chain-of-thought. Present only when thinking mode is active.
tool_callsobject[], nullableStreamed tool-call deltas. See below.

choices[].finish_reason

string, nullable. null on intermediate chunks. Populated on the choice's last chunk with one of:

ValueMeaning
stopNatural stop or hit one of the stop sequences.
lengthReached max_tokens or the model's context window.
tool_callsModel emitted one or more tool calls; client should execute them and reply with a tool message.
content_filterOutput was withheld by upstream provider content filtering.

Cross-model variance observed:

  • length is emitted by most models when truncated by max_tokens, but

claude-haiku-4-5 and claude-sonnet-4-6 have been seen to return stop in cases where other models return length. Treat both as terminal; check whether the response content was truncated by comparing token counts to max_tokens rather than only inspecting finish_reason.

  • tool_calls finish_reason is emitted by 6 of 7 routable models when

a tool is invocable, but only claude-opus-4-7 has been verified to also emit the structured tool_calls delta payload reliably. Other models may set finish_reason: tool_calls without emitting parseable tool-call data. See the note in §response-tool-calls and per-model guidance in Tool calls guide.

Other finish_reason values may appear depending on the upstream provider. Treat unknown values as terminal.

choices[].logprobs

object, nullable. Present only when logprobs: true was requested.

FieldTypeNotes
contentobject[], requiredPer-output-token logprob records.
reasoning_contentobject[], nullablePer-reasoning-token logprob records (thinking mode only).

Each logprob record has the shape:

FieldTypeNotes
tokenstringThe token.
logprobnumberLog probability. -9999.0 indicates the token fell outside the top-20 candidates.
bytesinteger[], nullableUTF-8 byte sequence of the token. null if the token has no byte representation.
top_logprobsobject[]Alternative tokens with their logprobs (when top_logprobs was requested). Each entry has token, logprob, bytes.

choices[].tool_calls

object[], present when the model invokes tools.

FieldTypeNotes
idstring, requiredTool-call identifier; pass back as tool_call_id in the follow-up tool message.
typestring, requiredAlways "function".
function.namestring, requiredFunction the model wants to invoke.
function.argumentsstring, requiredJSON-encoded arguments. The model may produce invalid JSON or hallucinate fields; validate before executing.

usage

Present on the final chunk before [DONE]. ezrouter extends the OpenAI shape with internal accounting fields:

FieldTypeNotes
prompt_tokensintegerTokens billed for the input. Authoritative.
completion_tokensintegerTokens billed for the output. Authoritative.
total_tokensintegerprompt_tokens + completion_tokens.
prompt_tokens_detailsobjectSub-breakdown: cached_tokens, text_tokens, audio_tokens, image_tokens.
completion_tokens_detailsobjectSub-breakdown: text_tokens, audio_tokens, reasoning_tokens.
input_tokensintegerAnthropic-style alias of prompt_tokens. Prefer the OpenAI-named field.
output_tokensintegerAnthropic-style alias. Currently observed as 0 regardless of actual output; rely on completion_tokens.
claude_cache_creation_5_m_tokensintegerClaude-prompt-cache billing bucket (5-minute TTL). Zero unless prompt caching is in use.
claude_cache_creation_1_h_tokensintegerClaude-prompt-cache billing bucket (1-hour TTL). Zero unless prompt caching is in use.
usage_semanticstringInternal tag. Typically "openai".
usage_sourcestringInternal tag identifying the upstream provider, e.g. "anthropic".

Only prompt_tokens / completion_tokens / total_tokens and the _details breakdowns should be relied on by client code. The alias and Claude-specific fields are surfaced for observability and may change.

Custom response headers

Every response includes these ezrouter-specific headers:

HeaderFormatPurpose
x-new-api-versionYYYYMMDD-HHMMSSGateway build stamp at the moment the request was served.
x-oneapi-request-idopaque stringRequest-tracing identifier. Include this verbatim when filing support tickets.

Standard rate-limit headers (X-RateLimit-, RateLimit-, Retry-After) are not emitted under any condition. See Differences from OpenAI.

Example

A request to claude-sonnet-4-6:

bash
curl -sS https://www.ezrouter.dev/v1/chat/completions \
  -H "Authorization: Bearer ${API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [
      {"role": "user", "content": "Reply with only the word OK."}
    ],
    "max_tokens": 10
  }'

Response (abbreviated; same shape whether or not stream: true was set):

text
data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"finish_reason":null,"logprobs":null}],"usage":null}

data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[{"delta":{"content":"OK"},"index":0,"finish_reason":null,"logprobs":null}],"usage":null}

data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[{"delta":{},"index":0,"finish_reason":"stop","logprobs":null}],"usage":null}

data: {"id":"msg_01...","object":"chat.completion.chunk","created":1779738598,"model":"claude-sonnet-4-6","system_fingerprint":null,"choices":[],"usage":{"prompt_tokens":28,"completion_tokens":4,"total_tokens":32,"prompt_tokens_details":{"cached_tokens":0,"text_tokens":0,"audio_tokens":0,"image_tokens":0},"completion_tokens_details":{"text_tokens":0,"audio_tokens":0,"reasoning_tokens":0},"input_tokens":28,"output_tokens":0,"usage_semantic":"openai","usage_source":"anthropic","claude_cache_creation_5_m_tokens":0,"claude_cache_creation_1_h_tokens":0}}

data: [DONE]

End-to-end SDK examples (Python, Node, curl with streaming parsers) live in the Cookbook.

Differences from OpenAI

ezrouter is OpenAI-compatible at the request-shape level but diverges in seven observable ways. Client code written for OpenAI generally works, but the assumptions below are unsafe:

  1. The server always returns SSE. Setting stream: false does not

produce a single JSON body; the response is still a data: ... chunk sequence. Clients that call response.json() will fail; use an SSE parser.

  1. id format is opaque. Identifiers may begin with msg_01 (when

passing through Anthropic-backed models) or msg_<random>_<hash> (when ezrouter synthesizes one). Never pattern-match.

  1. system_fingerprint is always null. OpenAI uses this for

backend version tracking; ezrouter does not.

  1. usage includes extra fields. Anthropic-style aliases

(input_tokens, output_tokens) and Claude-cache billing buckets appear on the final chunk. output_tokens is currently unreliable — read completion_tokens instead.

  1. Custom headers. x-new-api-version and x-oneapi-request-id

replace OpenAI's request-tracking headers.

  1. Error envelope differs. See Error codes

for the ezrouter envelope shape. Notably, error.type is not the OpenAI typed taxonomy in all cases, and error.message for some classes is in Chinese.

  1. No rate-limit headers, ever. Retry-After, X-RateLimit-*, and

RateLimit-* are never emitted. Under load the gateway absorbs backpressure via upstream queueing rather than returning 429. Build clients with timeouts and latency-based backoff; do not retry on 429 (it does not occur for capacity reasons). See Rate limits.