Getting started

Rate limits and capacity

ezrouter applies fair-use capacity limits at the gateway's discretion. The contract is:

We do not publish specific request-per-minute, tokens-per-minute,

or concurrency numbers. Effective capacity varies with operational conditions and may change at any time without prior notice.

Capacity is governed by the

Terms of Service §4 — Reasonable use clause, which reserves the right to throttle, quota, or suspend accounts exhibiting usage patterns that materially exceed bona fide single-account application use.

The gateway absorbs backpressure via upstream queueing rather than

returning explicit 429 responses. Latency grows under load rather than requests being rejected.

What this means for clients

Build defensively. Specifically:

Set sensible request timeouts on your HTTP client. Suggested

starting point: 60 seconds for typical chat completions, proportionally higher for very long responses or thinking-mode requests.

Backoff on actual 5xx and network errors, not on 429 (which

ezrouter does not emit for capacity reasons — see Error codes).

Detect slow responses with latency metrics, not with status

codes or response headers. Standard rate-limit headers (X-RateLimit-, RateLimit-, Retry-After) are not emitted by ezrouter under any condition.

If a request is taking unexpectedly long, prefer cancelling and

retrying with the same or a different model rather than waiting indefinitely. Different models route through different upstream providers with different load characteristics at any given moment.

Per-end-user attribution

You can pass an optional user_id parameter on chat-completion requests to attribute traffic to individual end-users in your application. This helps ezrouter distinguish per-user usage patterns when investigating abuse reports tied to your account.

json

{
  "model": "claude-sonnet-4-6",
  "messages": [{"role": "user", "content": "Hello!"}],
  "user_id": "your_internal_user_id"
}

If you are using the OpenAI Python or Node SDK, place user_id under extra_body:

python

response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={"user_id": "your_internal_user_id"}
)

Constraints on the value:

Pattern: [a-zA-Z0-9_-], max 512 characters
Do not include personally identifying information — names,

emails, phone numbers, etc. Use an opaque identifier from your side (e.g. a hashed user ID).

user_id is informational only; it does not unlock additional capacity or change how requests are queued.

Request keep-alive behavior

For requests that take significant time to begin generating, the ezrouter gateway holds the HTTP connection open and may send keep-alive padding so intermediate proxies do not close the connection. You may observe:

Streaming requests: SSE keep-alive comments of the form

: keep-alive interleaved with the actual data: {...} chunks. Standard SSE parsers ignore comment lines starting with :; if you are writing a parser by hand, ensure you do too.

Non-streaming requests: blank lines may appear before the

actual JSON body. The OpenAI surface always returns SSE regardless of stream value (see POST /v1/chat/completions → Differences from OpenAI), so this point is mostly historical for ezrouter — every chat response is parsed as SSE.

If a request has not begun producing any output after roughly 10 minutes, the gateway will close the connection. Clients should treat this as a transient failure and retry.