Guides

Multi-round chat

The /v1/chat/completions endpoint is stateless: the gateway does not remember previous turns between requests. To carry context across turns, your client must include the full conversation history in the messages array of every call.

The pattern

python
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["EZROUTER_API_KEY"],
    base_url="https://www.ezrouter.dev/v1",
)

messages = [{"role": "user", "content": "What's the highest mountain in the world?"}]

# Round 1
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=messages,
)
messages.append(response.choices[0].message)

# Round 2 — append the new user turn, send the full history
messages.append({"role": "user", "content": "What is the second?"})
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=messages,
)
messages.append(response.choices[0].message)

After round 2 the messages array contains:

json
[
  {"role": "user", "content": "What's the highest mountain in the world?"},
  {"role": "assistant", "content": "Mount Everest."},
  {"role": "user", "content": "What is the second?"},
  {"role": "assistant", "content": "K2."}
]

That same array (plus the next user turn) becomes the input to round 3.

Growing context, growing cost

Each request bills for every prompt token, so the cost of round N scales with the total length of all prior turns. Two ways to keep this bounded:

  • Trim old turns. Drop the earliest user/assistant pairs once the

conversation passes a token budget. For most assistants, keeping the last 10–20 turns is enough.

  • Lean on prompt cache. If the first part of the conversation is

stable (system prompt, persistent context), the gateway will cache it after the first call and bill subsequent calls at the cached rate for that prefix. See KV cache for the per-model behavior and pricing implications.

System prompts

Put long-lived instructions in a role: "system" message at the start. The system prompt counts toward prompt_tokens like any other message but conventionally is not edited turn-to-turn.

python
messages = [
    {"role": "system", "content": "You are a terse code reviewer. Reply in fewer than 100 words."},
    {"role": "user", "content": "Review this function: def add(a, b): return a + b"},
]

Context window awareness

The conversation cannot exceed the model's context window. When you approach the limit, the upstream returns an upstream_error. The window sizes for each routed model are listed in the API reference.

A defensive pattern: track usage.prompt_tokens after each call. When it crosses 80% of the model's context, drop the oldest turns.

Thinking-mode caveat

For claude-* models on the OpenAI surface with thinking enabled, the model's internal reasoning_content from one turn does not feed back into the next turn through this multi-round pattern. The reasoning buffer is not exposed as a re-injectable field. If you need the model to keep its chain of thought across turns, use the Anthropic surface instead — it round-trips the thinking blocks natively.

the messages parameter shape.