Guides
Thinking mode
Thinking mode (also called extended thinking, reasoning mode, or chain-of-thought) lets a model produce hidden reasoning tokens before its final answer. The reasoning is exposed through a separate reasoning_content field so you can display the thinking process if you choose, or discard it.
ezrouter exposes thinking through both surfaces, but with important caveats about which models support it and how reliably the reasoning round-trips across turns.
Which models support thinking
| Model | Thinking support | Surface notes |
|---|---|---|
claude-opus-4-7 | ✓ | Adaptive thinking; works on both surfaces. Best round-trip on Anthropic surface. |
claude-sonnet-4-6 | ✓ | Extended thinking; same surface caveats. |
claude-haiku-4-5 | ✓ | Extended thinking; same surface caveats. |
deepseek-v4-pro | (upstream supports) | Behavior across the ezrouter gateway not fully verified for v1. |
deepseek-v4-flash | (upstream supports) | Same. |
gpt-, glm-, kimi-* | family-dependent | Not verified for v1; treat as best-effort. |
For mission-critical reasoning workloads, use the claude family.
Enabling thinking
OpenAI surface
The OpenAI surface accepts a thinking field in extra_body and a top-level reasoning_effort:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["EZROUTER_API_KEY"],
base_url="https://www.ezrouter.dev/v1",
)
resp = client.chat.completions.create(
model="claude-opus-4-7",
messages=[{"role": "user", "content": "Is 9.11 greater than 9.8?"}],
reasoning_effort="high",
extra_body={"thinking": {"type": "enabled"}},
)
reasoning = getattr(resp.choices[0].message, "reasoning_content", None)
answer = resp.choices[0].message.content
if reasoning:
print("THINKING:", reasoning)
print("ANSWER: ", answer)reasoning_effort accepts:
| Value | Meaning |
|---|---|
low / medium | Mapped to high by the gateway for compatibility. |
high | Standard. The default when thinking is enabled. |
max | Deepest reasoning; slowest, most expensive. |
xhigh | Mapped to max. |
Anthropic surface
The Anthropic surface uses a native thinking parameter:
import anthropic, os
client = anthropic.Anthropic(
base_url="https://www.ezrouter.dev/anthropic",
api_key=os.environ["EZROUTER_API_KEY"],
)
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
thinking={"type": "enabled"},
messages=[{"role": "user", "content": "Is 9.11 greater than 9.8?"}],
)
for block in resp.content:
if block.type == "thinking":
print("THINKING:", block.thinking)
elif block.type == "text":
print("ANSWER: ", block.text)Anthropic's budget_tokens modifier on thinking is currently ignored by the gateway; the effort is controlled via the surface's own output_config.effort field (mapped from reasoning_effort on the OpenAI surface).
Parameters disabled in thinking mode
When thinking is enabled, the following parameters have no effect even though they are accepted:
temperaturetop_ppresence_penaltyfrequency_penalty
This is for compatibility — setting them does not error, but the model ignores them. The reasoning process and final output are controlled by the model itself, not by sampling parameters.
The cross-turn round-trip problem
The chain-of-thought content from one turn is not reliably preserved across turns on the OpenAI surface. This is the GW-003 gateway bug.
On the OpenAI surface:
- Between two
usermessages, the intermediateassistant's
reasoning_content is not re-injected into the next request. Even if you include it in messages, the gateway strips it before forwarding to the upstream.
- This means a multi-turn thinking-mode conversation loses the
reasoning context between turns.
On the Anthropic surface:
- Thinking blocks round-trip correctly when included as content
blocks in the next request.
- This is the preferred surface for any multi-turn workflow that
depends on reasoning continuity.
If your application needs the model to "remember its earlier reasoning" across turns, use the Anthropic surface today.
Multi-turn example (Anthropic surface)
messages = []
def ask(user_text: str):
messages.append({"role": "user", "content": user_text})
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
thinking={"type": "enabled"},
messages=messages,
)
# append the full assistant content (including thinking blocks)
# so the next turn keeps the reasoning context
messages.append({"role": "assistant", "content": resp.content})
return resp
ask("9.11 and 9.8 — which is greater?")
ask("How many Rs are there in 'strawberry'?")Each turn's assistant message preserves the thinking blocks; the next turn re-injects them so the model can build on its earlier reasoning.
Thinking with tool calls
Thinking mode and tool calls compose:
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
thinking={"type": "enabled"},
tools=tools,
messages=messages,
)The model may emit thinking blocks and tool_use blocks in the same turn. Your client should iterate over resp.content, running any tool_use blocks and appending their results before the next request. Preserve the thinking blocks in the message history for the reasoning to carry forward.
For the OpenAI surface plus tool calls, both GW-001 (tool-call deltas) and GW-003 (reasoning round-trip) bite. Use Anthropic surface for thinking + agent workflows.
Token costs
Thinking tokens count toward completion_tokens in the OpenAI surface response. The Anthropic surface accounts them separately as thinking blocks but bills them at the output rate. Verify expectations against pricing before deploying a thinking-mode workload at scale.
When to use thinking
Good fits:
- Math, logic, or step-wise problems where chain-of-thought
improves correctness.
- Code refactoring or complex agent planning.
- Any task where reasoning quality outranks latency.
Bad fits:
- High-volume cheap tasks (classification, extraction). The extra
thinking tokens inflate cost and latency without much accuracy gain.
- Strict-latency interactive UI. Thinking adds seconds.
Related
- Tool calls — thinking + tools.
- Anthropic API — surface used for reliable
thinking round-trip.
- Multi-round chat — conversation history
structure.