Getting started
Rate limits and capacity
ezrouter applies fair-use capacity limits at the gateway's discretion. The contract is:
- We do not publish specific request-per-minute, tokens-per-minute,
or concurrency numbers. Effective capacity varies with operational conditions and may change at any time without prior notice.
- Capacity is governed by the
Terms of Service §4 — Reasonable use clause, which reserves the right to throttle, quota, or suspend accounts exhibiting usage patterns that materially exceed bona fide single-account application use.
- The gateway absorbs backpressure via upstream queueing rather than
returning explicit 429 responses. Latency grows under load rather than requests being rejected.
What this means for clients
Build defensively. Specifically:
- Set sensible request timeouts on your HTTP client. Suggested
starting point: 60 seconds for typical chat completions, proportionally higher for very long responses or thinking-mode requests.
- Backoff on actual
5xxand network errors, not on429(which
ezrouter does not emit for capacity reasons — see Error codes).
- Detect slow responses with latency metrics, not with status
codes or response headers. Standard rate-limit headers (X-RateLimit-, RateLimit-, Retry-After) are not emitted by ezrouter under any condition.
- If a request is taking unexpectedly long, prefer cancelling and
retrying with the same or a different model rather than waiting indefinitely. Different models route through different upstream providers with different load characteristics at any given moment.
Per-end-user attribution
You can pass an optional user_id parameter on chat-completion requests to attribute traffic to individual end-users in your application. This helps ezrouter distinguish per-user usage patterns when investigating abuse reports tied to your account.
{
"model": "claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Hello!"}],
"user_id": "your_internal_user_id"
}If you are using the OpenAI Python or Node SDK, place user_id under extra_body:
response = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Hello"}],
extra_body={"user_id": "your_internal_user_id"}
)Constraints on the value:
- Pattern:
[a-zA-Z0-9_-], max 512 characters - Do not include personally identifying information — names,
emails, phone numbers, etc. Use an opaque identifier from your side (e.g. a hashed user ID).
user_id is informational only; it does not unlock additional capacity or change how requests are queued.
Request keep-alive behavior
For requests that take significant time to begin generating, the ezrouter gateway holds the HTTP connection open and may send keep-alive padding so intermediate proxies do not close the connection. You may observe:
- Streaming requests: SSE keep-alive comments of the form
: keep-alive interleaved with the actual data: {...} chunks. Standard SSE parsers ignore comment lines starting with :; if you are writing a parser by hand, ensure you do too.
- Non-streaming requests: blank lines may appear before the
actual JSON body. The OpenAI surface always returns SSE regardless of stream value (see POST /v1/chat/completions → Differences from OpenAI), so this point is mostly historical for ezrouter — every chat response is parsed as SSE.
If a request has not begun producing any output after roughly 10 minutes, the gateway will close the connection. Clients should treat this as a transient failure and retry.