Getting started

Tokens and token usage

Tokens are the units that large language models read and write — and the units ezrouter bills on. A token is usually a short sub-word: a common English word fits in one token, a less common word splits across two or three, and most punctuation is a token of its own.

Different upstream providers tokenize differently, so the same input string yields a different token count on claude-sonnet-4-6 than on deepseek-v4-pro. The number you are billed for is the count the gateway records in the response usage block, not any estimate from your client.

Reading the usage block

Every chat completion response (the final SSE chunk) carries a usage object. The fields you actually need are the standard OpenAI three:

json
{
  "usage": {
    "prompt_tokens": 1840,
    "completion_tokens": 412,
    "total_tokens": 2252
  }
}
  • prompt_tokens — everything you sent: system prompt, message

history, tool definitions, and the current user message.

  • completion_tokens — what the model generated. Counted before any

client-side trimming.

  • total_tokens — the sum. Use this for back-of-envelope cost

estimates: total_tokens × rate-per-1M-tokens ÷ 1_000_000.

The same final chunk carries ezrouter-specific extensions (prompt_tokens_details, usage_semantic, Anthropic-aliased fields like input_tokens / output_tokens). Treat the OpenAI-named fields as authoritative. The Anthropic-aliased output_tokens field has been observed to read 0 even when completion_tokens > 0 — do not substitute it. See chat completions divergences for the full surface delta.

Cached tokens

When a prompt is served from prompt cache, the gateway reports the cached portion separately:

json
{
  "usage": {
    "prompt_tokens": 25231,
    "completion_tokens": 96,
    "total_tokens": 25327,
    "prompt_tokens_details": {
      "cached_tokens": 24901
    }
  }
}

The cached portion is billed at a discounted rate (see pricing). On claude-* models, prompt_tokens may be inflated relative to the unique prompt size because Anthropic's billing accounting includes cumulative cache reads. Derive cost from prompt_tokens_details.cached_tokens plus the non-cached remainder, not from prompt_tokens × input_rate alone. See KV cache for the per-model behavior matrix.

Rough character-to-token ratios

For sizing prompts before you send them, these ratios are good enough:

  • 1 English character ≈ 0.3 token
  • 1 Chinese character ≈ 0.6 token
  • 1 line of code ≈ 4–8 tokens, depending on density
  • A typical 1,000-word English document ≈ 1,300 tokens

These are averages across modern BPE tokenizers. The actual count for your specific input depends on the model. When the answer matters (billing reconciliation, context-window planning), send a one-token request (max_tokens: 1) and read usage.prompt_tokens from the response.

Context window and max output

Every model has two distinct limits:

  • Context window — the total prompt_tokens + completion_tokens

the model can hold in one request.

  • Max output — the most completion_tokens the model will produce

in one response.

The current limits for each routed model are in the API reference. Sending a request whose prompt is larger than the model's context window returns upstream_error. Sending a request that would generate more than the model's max_output quietly truncates at the output cap — finish_reason reads length on most models (see finish_reason variance).

Estimating before you send

For client-side estimation, use the upstream tokenizer for the model you are calling. ezrouter does not currently ship its own tokenizer library; the upstream tokenizers are correct:

  • claude-* models — Anthropic's count_tokens endpoint or the

@anthropic-ai/tokenizer package.

  • gpt-* models — OpenAI's tiktoken library

(cl100k_base / o200k_base).

  • deepseek-* models — the upstream vendor's BPE tokenizer

package (matches the V4 family routed by ezrouter).

  • glm- and kimi- models — vendor-published tokenizers.

If you do not want to maintain a tokenizer per family, the safe strategy is a 20% safety margin against the published context window and a recheck against usage.prompt_tokens after the first response.

  • Pricing — per-million-token rates by model.
  • Rate limits — separate from token budgets;

governs request frequency, not token count.

  • KV cache — when cached tokens kick in.