Getting started

Tokens and token usage

Tokens are the units that large language models read and write — and the units ezrouter bills on. A token is usually a short sub-word: a common English word fits in one token, a less common word splits across two or three, and most punctuation is a token of its own.

Different upstream providers tokenize differently, so the same input string yields a different token count on claude-sonnet-4-6 than on deepseek-v4-pro. The number you are billed for is the count the gateway records in the response usage block, not any estimate from your client.

Reading the `usage` block

Every chat completion response (the final SSE chunk) carries a usage object. The fields you actually need are the standard OpenAI three:

json

{
  "usage": {
    "prompt_tokens": 1840,
    "completion_tokens": 412,
    "total_tokens": 2252
  }
}

prompt_tokens — everything you sent: system prompt, message

history, tool definitions, and the current user message.

completion_tokens — what the model generated. Counted before any

client-side trimming.

total_tokens — the sum. Use this for back-of-envelope cost

estimates: total_tokens × rate-per-1M-tokens ÷ 1_000_000.

The same final chunk carries ezrouter-specific extensions (prompt_tokens_details, usage_semantic, Anthropic-aliased fields like input_tokens / output_tokens). Treat the OpenAI-named fields as authoritative. The Anthropic-aliased output_tokens field has been observed to read 0 even when completion_tokens > 0 — do not substitute it. See chat completions divergences for the full surface delta.

Cached tokens

When a prompt is served from prompt cache, the gateway reports the cached portion separately:

json

{
  "usage": {
    "prompt_tokens": 25231,
    "completion_tokens": 96,
    "total_tokens": 25327,
    "prompt_tokens_details": {
      "cached_tokens": 24901
    }
  }
}

The cached portion is billed at a discounted rate (see pricing). On claude-* models, prompt_tokens may be inflated relative to the unique prompt size because Anthropic's billing accounting includes cumulative cache reads. Derive cost from prompt_tokens_details.cached_tokens plus the non-cached remainder, not from prompt_tokens × input_rate alone. See KV cache for the per-model behavior matrix.

Rough character-to-token ratios

For sizing prompts before you send them, these ratios are good enough:

1 English character ≈ 0.3 token
1 Chinese character ≈ 0.6 token
1 line of code ≈ 4–8 tokens, depending on density
A typical 1,000-word English document ≈ 1,300 tokens

These are averages across modern BPE tokenizers. The actual count for your specific input depends on the model. When the answer matters (billing reconciliation, context-window planning), send a one-token request (max_tokens: 1) and read usage.prompt_tokens from the response.

Context window and max output

Every model has two distinct limits:

Context window — the total prompt_tokens + completion_tokens

the model can hold in one request.

Max output — the most completion_tokens the model will produce

in one response.

The current limits for each routed model are in the API reference. Sending a request whose prompt is larger than the model's context window returns upstream_error. Sending a request that would generate more than the model's max_output quietly truncates at the output cap — finish_reason reads length on most models (see finish_reason variance).

Estimating before you send

For client-side estimation, use the upstream tokenizer for the model you are calling. ezrouter does not currently ship its own tokenizer library; the upstream tokenizers are correct:

claude-* models — Anthropic's count_tokens endpoint or the

@anthropic-ai/tokenizer package.

gpt-* models — OpenAI's tiktoken library

(cl100k_base / o200k_base).

deepseek-* models — the upstream vendor's BPE tokenizer

package (matches the V4 family routed by ezrouter).

glm- and kimi- models — vendor-published tokenizers.

If you do not want to maintain a tokenizer per family, the safe strategy is a 20% safety margin against the published context window and a recheck against usage.prompt_tokens after the first response.

Pricing — per-million-token rates by model.
Rate limits — separate from token budgets;

governs request frequency, not token count.

KV cache — when cached tokens kick in.