Getting started

Rate limits and capacity

ezrouter applies fair-use capacity limits at the gateway's discretion. The contract is:

  • We do not publish specific request-per-minute, tokens-per-minute,

or concurrency numbers. Effective capacity varies with operational conditions and may change at any time without prior notice.

  • Capacity is governed by the

Terms of Service §4 — Reasonable use clause, which reserves the right to throttle, quota, or suspend accounts exhibiting usage patterns that materially exceed bona fide single-account application use.

  • The gateway absorbs backpressure via upstream queueing rather than

returning explicit 429 responses. Latency grows under load rather than requests being rejected.

What this means for clients

Build defensively. Specifically:

  • Set sensible request timeouts on your HTTP client. Suggested

starting point: 60 seconds for typical chat completions, proportionally higher for very long responses or thinking-mode requests.

  • Backoff on actual 5xx and network errors, not on 429 (which

ezrouter does not emit for capacity reasons — see Error codes).

  • Detect slow responses with latency metrics, not with status

codes or response headers. Standard rate-limit headers (X-RateLimit-, RateLimit-, Retry-After) are not emitted by ezrouter under any condition.

  • If a request is taking unexpectedly long, prefer cancelling and

retrying with the same or a different model rather than waiting indefinitely. Different models route through different upstream providers with different load characteristics at any given moment.

Per-end-user attribution

You can pass an optional user_id parameter on chat-completion requests to attribute traffic to individual end-users in your application. This helps ezrouter distinguish per-user usage patterns when investigating abuse reports tied to your account.

json
{
  "model": "claude-sonnet-4-6",
  "messages": [{"role": "user", "content": "Hello!"}],
  "user_id": "your_internal_user_id"
}

If you are using the OpenAI Python or Node SDK, place user_id under extra_body:

python
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={"user_id": "your_internal_user_id"}
)

Constraints on the value:

  • Pattern: [a-zA-Z0-9_-], max 512 characters
  • Do not include personally identifying information — names,

emails, phone numbers, etc. Use an opaque identifier from your side (e.g. a hashed user ID).

user_id is informational only; it does not unlock additional capacity or change how requests are queued.

Request keep-alive behavior

For requests that take significant time to begin generating, the ezrouter gateway holds the HTTP connection open and may send keep-alive padding so intermediate proxies do not close the connection. You may observe:

  • Streaming requests: SSE keep-alive comments of the form

: keep-alive interleaved with the actual data: {...} chunks. Standard SSE parsers ignore comment lines starting with :; if you are writing a parser by hand, ensure you do too.

  • Non-streaming requests: blank lines may appear before the

actual JSON body. The OpenAI surface always returns SSE regardless of stream value (see POST /v1/chat/completions → Differences from OpenAI), so this point is mostly historical for ezrouter — every chat response is parsed as SSE.

If a request has not begun producing any output after roughly 10 minutes, the gateway will close the connection. Clients should treat this as a transient failure and retry.