Guilherme Costa

Founding AI Engineer · Copenhagen

Model provider variance in structured extraction

May 2026 · Technical note

TL;DR

We spent several hours debugging what looked like a prompt-engineering issue in a structured extraction pipeline. This turned out not to be a prompt issue.

The same OpenRouter model (google/gemini-3-flash-preview) was being routed to different upstream providers, and one provider consistently failed to extract certain fields when assistant history contained list-shaped content.

Practical lesson:

  • log the routed provider
  • treat provider identity as part of the request fingerprint
  • be careful with latency-based routing on structured-output workloads

For structured-output workloads, provider choice can be part of correctness, not just latency or cost.

We run an LLM workflow that extracts structured constraints from conversation history and feeds them into a planning state machine.

One of the user flows kept failing. The conversation looked roughly like this:

User: "I want to go to Brazil in two weeks."

Assistant: suggests destination cities.

User: "Let's go for Rio! We're two adults. No specific dates. Budget less than 5000 euros."

At that point the planner should advance toward itinerary generation, given that it had captured all necessary constraints: destinations, travellers, budget and date flexibility.

Instead, the extraction layer returned:

{
  "destinations": [{ "type": "fixed", "options": ["Rio"] }],
  "travellers": { "adults": 2 }
}

The budget and date flexibility were missing.

The state machine interpreted this as incomplete planning state and routed the conversation back into information gathering.

Theory 1: prompt specificity. Our first assumption was that the prompt probably was not explicit enough. The extraction prompt mentioned budget mostly in negative examples ("don't infer budget from destination, etc."), so we added explicit extraction guidance:

Set budget whenever the user explicitly states a price expectation, whether numeric or qualitative.

No meaningful improvement.

Theory 2: schema shape. Maybe the optional-field schema was making it too easy for the model to emit a minimally valid object and stop early. We tried two schema changes:

  1. Add a mandatory field-scan checklist, in case the model was skipping optional fields too quickly.
  2. Add a wrapper schema forcing every field into either extracted or not_mentioned, in case the model was emitting a minimally valid object and stopping early.

None fixed it. The wrapper actually made things worse.

In some runs the model confidently classified explicitly stated fields as "not mentioned":

{
  "extracted": {
    "destinations": [{ "type": "fixed", "options": ["Rio"] }],
    "travellers": { "adults": 2 }
  },
  "not_mentioned": [
    "budget",
    "flexible_dates"
  ]
}

Theory 3: conversation history. Maybe assistant history was contaminating extraction. Removing assistant history made the extraction succeed reliably. Reintroducing assistant history caused failures to return.

Initially we thought the numerical values in the assistant response were confusing the model, such as:

  • flight prices
  • EUR amounts
  • durations

But the pattern was more specific than that.

Assistant history Result
Full destination suggestions with prices Fail
Trailing conversational question only Pass
Numbered destination list without prices Fail
No assistant history Pass

The failure now looked more correlated with the shape of the assistant message than with its size or numerical content: a long conversational paragraph passed, while a short numbered list failed. We were close to implementing an architectural workaround and stripping suggestion lists before extraction, but first we reran the failing case several times to confirm the pattern. It did not hold. One run passed unexpectedly, then failed again, then passed. Something outside the prompt, schema and visible message history had to be changing.


After the prompt and schema theories failed, we started looking for anything that could differ between otherwise identical extraction calls. The application was using OpenRouter with this model:

model: "google/gemini-3-flash-preview"

and latency-based routing:

sort: "latency"

That meant two calls with the same model name, prompt, schema and temperature could still be routed to different upstream providers depending on real-time latency. We were not logging the routed provider (in retrospect, we should have been). We added a small helper:

export function extractProvider(result: unknown): string | null {
  const r = result as {
    providerMetadata?: { openrouter?: { providerName?: string } };
    response?: { body?: { provider?: string } };
  };

  return (
    r?.providerMetadata?.openrouter?.providerName ??
    r?.response?.body?.provider ??
    null
  );
}

Once provider information was visible in logs, the behavior became reproducible almost immediately.

We reran the evaluation:

  • 5 runs per provider
  • multiple history shapes
  • temperature 0
  • identical prompt and schema
Provider Pass rate
Google AI Studio 20 / 20
Google (Vertex AI) 10 / 20

The Vertex breakdown was more interesting:

Assistant history Vertex pass rate
Full suggestion list 0 / 5
Trailing question only 5 / 5
Numbered destination list 0 / 5
No assistant history 5 / 5

So the earlier "history contamination" hypothesis was not entirely wrong; it just only existed on one provider. The same prompt, schema, messages and model identifier produced materially different structured outputs depending on routing. A sanitized failing response looked like this:

{
  "provider": "Google",
  "output": {
    "destinations": [{ "type": "fixed", "options": ["Rio"] }],
    "travellers": { "adults": 2 }
  }
}

The corresponding successful run:

{
  "provider": "Google AI Studio",
  "output": {
    "destinations": [{ "type": "fixed", "options": ["Rio"] }],
    "travel_period": "no specific dates",
    "travellers": { "adults": 2 },
    "budget": "less than 5000 euros",
    "flexible_dates": true
  }
}

The production fix was provider selection, not prompt tuning:

provider: {
  require_parameters: true,
  order: ["Google AI Studio"],
  allow_fallbacks: true,
}

We removed latency-based routing for this extraction workload and explicitly preferred the provider that behaved reliably.

No prompt changes were needed in the final version.


A few things stood out afterwards.

  1. "Same model" is not necessarily the same behavior. Once routing layers are involved, a model name becomes more like an abstract interface than a single inference implementation. Differences in templating, structured-output handling, or decoding behavior can matter a lot more than expected.
  2. Provider observability mattered more than prompt iteration. Most of the debugging time was spent modifying prompts and schemas because we assumed the inference path was stable, when it wasn't.
  3. Preview-tier models seem especially susceptible to this kind of variance. This was observed on a preview Gemini release, and OpenRouter has already written publicly about provider variance and Exacto, as well as Auto Exacto, partly for this reason.

The more important observation is simply that provider-level behavioral variance exists at all, and that it can remain invisible unless you log for it explicitly.