Provider Experiments

This page summarizes local provider experiments run against the SDK integration layer, with a narrow focus on two things:

reasoning output shape
tool-calling reliability

The goal was not to compare answer quality. It was to understand which provider paths keep the app-facing schema stable and which paths force provider-specific branching into runtime code.

Scope

The experiments covered these routes:

direct OpenAI via AI SDK
direct OpenRouter via AI SDK
Bifrost -> OpenAI
Bifrost -> Gemini
Bifrost -> OpenRouter -> OpenAI
Bifrost -> OpenRouter -> Gemini

The common reasoning prompt used in the comparison runs was:

text

Think deeply and solve this carefully. You have 25 horses. You can race 5 horses at a time. There is no stopwatch, so you only know the finishing order within each race. What is the minimum number of races needed to determine the 1st, 2nd, and 3rd fastest horses? Explain briefly and end with exactly one line in the format: Final answer: <number> races.

All successful runs answered:

text

Final answer: 7 races.

Findings

Direct OpenAI is the clean baseline

Direct OpenAI reasoning stayed on the standard AI SDK reasoning event family:

reasoning
reasoning-delta
reasoning-start
reasoning-end

This is the cleanest baseline from the experiments. Reasoning metadata was also minimal compared with OpenRouter-backed runs.

Direct OpenRouter varies by routed model

Direct OpenRouter did not keep reasoning events uniform across the models tested.

For openrouter/openai/gpt-5.4-nano, the streamed reasoning output included:

reasoning
reasoning-delta
reasoning-start
reasoning-end
reasoning.summary
reasoning.encrypted

For openrouter/google/gemini-3-flash-preview, the streamed reasoning output included:

reasoning
reasoning-delta
reasoning-start
reasoning-end
reasoning.text
reasoning.encrypted

This is the main schema issue that leaked into app code. Even though OpenRouter presents one Responses API, the inner reasoning event taxonomy still differed across routed providers.

Bifrost normalizes reasoning shape well

Bifrost produced a stable reasoning event family across the tested provider paths.

For:

Bifrost -> OpenAI
Bifrost -> Gemini
Bifrost -> OpenRouter -> OpenAI
Bifrost -> OpenRouter -> Gemini

the reasoning stream stayed on the same family:

reasoning
reasoning-delta
reasoning-start
reasoning-end

This was the strongest normalization result in the experiments. In practice, Bifrost hid the reasoning.summary vs reasoning.text split that appeared in direct OpenRouter runs.

Reasoning Comparison

What matters for app code

At the AI SDK surface, all routes still returned the expected high-level fields such as:

text
reasoning
usage

The problem was lower-level streaming and provider-specific details. For reasoning-heavy code paths, the important difference was:

direct OpenAI stayed stable
direct OpenRouter changed reasoning event taxonomy by routed model
Bifrost normalized those differences back into one app-facing reasoning family

Practical implication

If runtime code needs to parse or stream reasoning, direct OpenRouter increases branching pressure. Bifrost reduces that pressure by normalizing reasoning before the rest of the app sees it.

Tool Calling

Reasoning normalization is only half of the story. Tool-calling reliability matters more in production.

OpenRouter-backed models were safer on chat than responses

An app-level experiment temporarily changed bifrostChatModel(...) to use the Bifrost Responses path instead of Chat Completions for non-OpenAI models.

That change caused real failures:

Studio execution-plan e2e failed because a consultSpecialist tool call arrived with malformed input and the run stayed incomplete.
A tool e2e path failed when researchTopic eventually used openrouter/stepfun/step-3.5-flash through /v1/responses and the upstream returned 400 Invalid Responses API request.
A direct Studio chat request using openrouter/google/gemini-3-flash-preview started, made one tool call, then failed mid-stream with 400 Invalid Responses API request.

The important point is that these were not theoretical schema issues. They were app-visible failures in real turn execution.

Why this happens

The local Bifrost source explains the behavior:

OpenRouter chat requests are forwarded to /v1/chat/completions.
OpenRouter responses requests are forwarded to /v1/responses.
For OpenRouter Responses, Bifrost uses the generic OpenAI Responses request builder.
Bifrost does not automatically downgrade a failing Responses request back to Chat Completions.

That means openrouter/* models are only as good as OpenRouter's Responses compatibility for that exact routed model.

By contrast, direct providers like Gemini have provider-specific Responses conversion inside Bifrost, which is a stronger path than treating them as generic OpenAI-compatible providers.

Recommendations

Keep this routing rule for now

Use bifrostModel(...) for providers and models with validated Responses support.
Keep openrouter/* on bifrostChatModel(...).

That is the right production tradeoff today. It keeps reasoning normalized and avoids the OpenRouter Responses failures seen in tool-heavy app flows.

Do not treat Chat Completions as second-class

For this codebase, Chat Completions is not a downgrade. It is the more reliable path for OpenRouter-backed models.

The underlying model quality is still the same model. The main difference is transport and payload compatibility:

Chat worked reliably in the app
Responses caused model-specific breakage on OpenRouter-backed routes

Re-test OpenRouter Responses later

OpenRouter documents its Responses API as beta and warns about breaking changes. If Responses support matures for the models used in this repo, this should be retested. Until then, the safer production rule is:

direct providers with validated Responses support: use Responses when needed
OpenRouter-backed models: keep using Chat Completions

Provider Experiments ​

Scope ​

Findings ​

Direct OpenAI is the clean baseline ​

Direct OpenRouter varies by routed model ​

Bifrost normalizes reasoning shape well ​

Reasoning Comparison ​

What matters for app code ​

Practical implication ​

Tool Calling ​

OpenRouter-backed models were safer on chat than responses ​

Why this happens ​

Recommendations ​

Keep this routing rule for now ​

Do not treat Chat Completions as second-class ​

Re-test OpenRouter Responses later ​