Appearance
Provider Experiments
This page summarizes local provider experiments run against the SDK integration layer, with a narrow focus on two things:
- reasoning output shape
- tool-calling reliability
The goal was not to compare answer quality. It was to understand which provider paths keep the app-facing schema stable and which paths force provider-specific branching into runtime code.
Scope
The experiments covered these routes:
- direct OpenAI via AI SDK
- direct OpenRouter via AI SDK
- Bifrost -> OpenAI
- Bifrost -> Gemini
- Bifrost -> OpenRouter -> OpenAI
- Bifrost -> OpenRouter -> Gemini
The common reasoning prompt used in the comparison runs was:
text
Think deeply and solve this carefully. You have 25 horses. You can race 5 horses at a time. There is no stopwatch, so you only know the finishing order within each race. What is the minimum number of races needed to determine the 1st, 2nd, and 3rd fastest horses? Explain briefly and end with exactly one line in the format: Final answer: <number> races.All successful runs answered:
text
Final answer: 7 races.Findings
Direct OpenAI is the clean baseline
Direct OpenAI reasoning stayed on the standard AI SDK reasoning event family:
reasoningreasoning-deltareasoning-startreasoning-end
This is the cleanest baseline from the experiments. Reasoning metadata was also minimal compared with OpenRouter-backed runs.
Direct OpenRouter varies by routed model
Direct OpenRouter did not keep reasoning events uniform across the models tested.
For openrouter/openai/gpt-5.4-nano, the streamed reasoning output included:
reasoningreasoning-deltareasoning-startreasoning-endreasoning.summaryreasoning.encrypted
For openrouter/google/gemini-3-flash-preview, the streamed reasoning output included:
reasoningreasoning-deltareasoning-startreasoning-endreasoning.textreasoning.encrypted
This is the main schema issue that leaked into app code. Even though OpenRouter presents one Responses API, the inner reasoning event taxonomy still differed across routed providers.
Bifrost normalizes reasoning shape well
Bifrost produced a stable reasoning event family across the tested provider paths.
For:
Bifrost -> OpenAIBifrost -> GeminiBifrost -> OpenRouter -> OpenAIBifrost -> OpenRouter -> Gemini
the reasoning stream stayed on the same family:
reasoningreasoning-deltareasoning-startreasoning-end
This was the strongest normalization result in the experiments. In practice, Bifrost hid the reasoning.summary vs reasoning.text split that appeared in direct OpenRouter runs.
Reasoning Comparison
What matters for app code
At the AI SDK surface, all routes still returned the expected high-level fields such as:
textreasoningusage
The problem was lower-level streaming and provider-specific details. For reasoning-heavy code paths, the important difference was:
- direct OpenAI stayed stable
- direct OpenRouter changed reasoning event taxonomy by routed model
- Bifrost normalized those differences back into one app-facing reasoning family
Practical implication
If runtime code needs to parse or stream reasoning, direct OpenRouter increases branching pressure. Bifrost reduces that pressure by normalizing reasoning before the rest of the app sees it.
Tool Calling
Reasoning normalization is only half of the story. Tool-calling reliability matters more in production.
OpenRouter-backed models were safer on chat than responses
An app-level experiment temporarily changed bifrostChatModel(...) to use the Bifrost Responses path instead of Chat Completions for non-OpenAI models.
That change caused real failures:
- Studio execution-plan e2e failed because a
consultSpecialisttool call arrived with malformed input and the run stayed incomplete. - A tool e2e path failed when
researchTopiceventually usedopenrouter/stepfun/step-3.5-flashthrough/v1/responsesand the upstream returned400 Invalid Responses API request. - A direct Studio chat request using
openrouter/google/gemini-3-flash-previewstarted, made one tool call, then failed mid-stream with400 Invalid Responses API request.
The important point is that these were not theoretical schema issues. They were app-visible failures in real turn execution.
Why this happens
The local Bifrost source explains the behavior:
- OpenRouter chat requests are forwarded to
/v1/chat/completions. - OpenRouter responses requests are forwarded to
/v1/responses. - For OpenRouter Responses, Bifrost uses the generic OpenAI Responses request builder.
- Bifrost does not automatically downgrade a failing Responses request back to Chat Completions.
That means openrouter/* models are only as good as OpenRouter's Responses compatibility for that exact routed model.
By contrast, direct providers like Gemini have provider-specific Responses conversion inside Bifrost, which is a stronger path than treating them as generic OpenAI-compatible providers.
Recommendations
Keep this routing rule for now
- Use
bifrostModel(...)for providers and models with validated Responses support. - Keep
openrouter/*onbifrostChatModel(...).
That is the right production tradeoff today. It keeps reasoning normalized and avoids the OpenRouter Responses failures seen in tool-heavy app flows.
Do not treat Chat Completions as second-class
For this codebase, Chat Completions is not a downgrade. It is the more reliable path for OpenRouter-backed models.
The underlying model quality is still the same model. The main difference is transport and payload compatibility:
- Chat worked reliably in the app
- Responses caused model-specific breakage on OpenRouter-backed routes
Re-test OpenRouter Responses later
OpenRouter documents its Responses API as beta and warns about breaking changes. If Responses support matures for the models used in this repo, this should be retested. Until then, the safer production rule is:
- direct providers with validated Responses support: use Responses when needed
- OpenRouter-backed models: keep using Chat Completions