Guides

Thinking mode

Thinking mode (also called extended thinking, reasoning mode, or chain-of-thought) lets a model produce hidden reasoning tokens before its final answer. The reasoning is exposed through a separate reasoning_content field so you can display the thinking process if you choose, or discard it.

ezrouter exposes thinking through both surfaces, but with important caveats about which models support it and how reliably the reasoning round-trips across turns.

Which models support thinking

ModelThinking supportSurface notes
claude-opus-4-7Adaptive thinking; works on both surfaces. Best round-trip on Anthropic surface.
claude-sonnet-4-6Extended thinking; same surface caveats.
claude-haiku-4-5Extended thinking; same surface caveats.
deepseek-v4-pro(upstream supports)Behavior across the ezrouter gateway not fully verified for v1.
deepseek-v4-flash(upstream supports)Same.
gpt-, glm-, kimi-*family-dependentNot verified for v1; treat as best-effort.

For mission-critical reasoning workloads, use the claude family.

Enabling thinking

OpenAI surface

The OpenAI surface accepts a thinking field in extra_body and a top-level reasoning_effort:

python
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["EZROUTER_API_KEY"],
    base_url="https://www.ezrouter.dev/v1",
)

resp = client.chat.completions.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": "Is 9.11 greater than 9.8?"}],
    reasoning_effort="high",
    extra_body={"thinking": {"type": "enabled"}},
)

reasoning = getattr(resp.choices[0].message, "reasoning_content", None)
answer = resp.choices[0].message.content

if reasoning:
    print("THINKING:", reasoning)
print("ANSWER:  ", answer)

reasoning_effort accepts:

ValueMeaning
low / mediumMapped to high by the gateway for compatibility.
highStandard. The default when thinking is enabled.
maxDeepest reasoning; slowest, most expensive.
xhighMapped to max.

Anthropic surface

The Anthropic surface uses a native thinking parameter:

python
import anthropic, os

client = anthropic.Anthropic(
    base_url="https://www.ezrouter.dev/anthropic",
    api_key=os.environ["EZROUTER_API_KEY"],
)

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    thinking={"type": "enabled"},
    messages=[{"role": "user", "content": "Is 9.11 greater than 9.8?"}],
)

for block in resp.content:
    if block.type == "thinking":
        print("THINKING:", block.thinking)
    elif block.type == "text":
        print("ANSWER:  ", block.text)

Anthropic's budget_tokens modifier on thinking is currently ignored by the gateway; the effort is controlled via the surface's own output_config.effort field (mapped from reasoning_effort on the OpenAI surface).

Parameters disabled in thinking mode

When thinking is enabled, the following parameters have no effect even though they are accepted:

  • temperature
  • top_p
  • presence_penalty
  • frequency_penalty

This is for compatibility — setting them does not error, but the model ignores them. The reasoning process and final output are controlled by the model itself, not by sampling parameters.

The cross-turn round-trip problem

The chain-of-thought content from one turn is not reliably preserved across turns on the OpenAI surface. This is the GW-003 gateway bug.

On the OpenAI surface:

  • Between two user messages, the intermediate assistant's

reasoning_content is not re-injected into the next request. Even if you include it in messages, the gateway strips it before forwarding to the upstream.

  • This means a multi-turn thinking-mode conversation loses the

reasoning context between turns.

On the Anthropic surface:

  • Thinking blocks round-trip correctly when included as content

blocks in the next request.

  • This is the preferred surface for any multi-turn workflow that

depends on reasoning continuity.

If your application needs the model to "remember its earlier reasoning" across turns, use the Anthropic surface today.

Multi-turn example (Anthropic surface)

python
messages = []

def ask(user_text: str):
    messages.append({"role": "user", "content": user_text})
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        thinking={"type": "enabled"},
        messages=messages,
    )
    # append the full assistant content (including thinking blocks)
    # so the next turn keeps the reasoning context
    messages.append({"role": "assistant", "content": resp.content})
    return resp

ask("9.11 and 9.8 — which is greater?")
ask("How many Rs are there in 'strawberry'?")

Each turn's assistant message preserves the thinking blocks; the next turn re-injects them so the model can build on its earlier reasoning.

Thinking with tool calls

Thinking mode and tool calls compose:

python
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    thinking={"type": "enabled"},
    tools=tools,
    messages=messages,
)

The model may emit thinking blocks and tool_use blocks in the same turn. Your client should iterate over resp.content, running any tool_use blocks and appending their results before the next request. Preserve the thinking blocks in the message history for the reasoning to carry forward.

For the OpenAI surface plus tool calls, both GW-001 (tool-call deltas) and GW-003 (reasoning round-trip) bite. Use Anthropic surface for thinking + agent workflows.

Token costs

Thinking tokens count toward completion_tokens in the OpenAI surface response. The Anthropic surface accounts them separately as thinking blocks but bills them at the output rate. Verify expectations against pricing before deploying a thinking-mode workload at scale.

When to use thinking

Good fits:

  • Math, logic, or step-wise problems where chain-of-thought

improves correctness.

  • Code refactoring or complex agent planning.
  • Any task where reasoning quality outranks latency.

Bad fits:

  • High-volume cheap tasks (classification, extraction). The extra

thinking tokens inflate cost and latency without much accuracy gain.

  • Strict-latency interactive UI. Thinking adds seconds.

thinking round-trip.

structure.