Does Gemini 2.5 Pro earn its keep on photo OCR?
First production run of the F1–F6 routing infrastructure. Same prompts, same Lambdas, same iOS app —
only modelId and promptId change between paired sweeps.
Three single-photo fixtures × 2 models × 2 prompts (normal · strict) × 3 iterations +
one 3-photo collection × 2 models × 3 iterations = 42 runs.
Stage 1b (the strict dimension) was added after a routing bug was found and fixed: iOS was previously
stamping the normal promptId on every request even when strict mode was on.
The bridge fix lives in bridgeLaunchArgs + S3UploadService.
Verdict
Flash stays default Strict ≠ silver bullet on this fixture Pro+strict is deterministic Recommended action:
keep photo and collection on
gemini-2.5-flash + photo-ocr-normal-v1 in
ocr-routing.json.
Strict mode stays opt-in via the iOS Settings toggle. The new finding (stage 1b, below) is that strict
doesn't kill the luggage hallucination on either model — but it does buy full determinism on Pro, which
is a property worth surfacing for users who want auditable repeatability.
Every run carries Lambda-emitted provenance now (F1) and the routing override flowed through iOS launch
args (F4). 0 silent fallbacks, 0 misrouted runs. Stage 1b additionally validates that the
photoOCRStrictMode bridge correctly partitions the prompt SHA space —
strict and normal runs carry different SHAs with zero leakage.
modelId matches the requested override on every run.
strict: c60fcca3dad4…
Each fixture stresses a different OCR axis. The interesting Pro behaviour shows up in the shape of the text it returns, not the count.
Code screens — same words, more structure
AAAA0011 · code_screen_picture.JPGBoth models return exactly 108 words per run — perfect recall parity on a 30-line Python function. But Pro returns +31 chars of indentation and blank-line preservation. For code OCR specifically, that's the difference between "extract the words" and "extract the source."
def is_prime(n):
"""
Checks if an integer 'n' is a prime number.
A prime number is a natural number greater than 1
that has no positive divisors other than 1 and itself.
"""
if n <= 1:
return Fal… def is_prime(n):
"""
Checks if an integer 'n' is a prime number.
A prime number is a natural number greater than 1
that has no positive divisors other than 1 and itself.
"""
i… Adversarial — Pro hallucinates less even in normal mode
AAAA0013 · IMG_8213.JPG (luggage box)This fixture is the strict-mode regression case: a luggage box photo where Gemini fabricates safety-warning text that isn't there. Run in normal mode, both models hallucinate — but Pro fabricates noticeably less. Flash invents 6+ treadmill warnings; Pro invents 2.
NEED MORE BOXES? SCAN HERE OR VISIT US AT WALMART.COM/MOVING Warning Do not step on the side rails Keep the treadmill stable Do not allow children to operate Keep hands away from moving parts Redlir…
NEED MORE BOXES? SCAN HERE OR VISIT US AT WALMART.COM/MOVING DO NOT STEP ON THE EDGE OF THE BELT KEEP THE MACHINE LEVEL ▲ Warning Redliro
Both outputs are still wrong (the box has luggage warnings, not treadmill warnings). The strict-mode section below tests the obvious next hypothesis: does strict mode actually kill the hallucination? Spoiler — not on this fixture.
Multilingual — Pro reads Chinese product copy slightly more thoroughly
AAAA0003 · labubu_box_picture.JPGBilingual Chinese + English product label. Pro lifts word count from 152 → 158 (+3.9%) and char count by +34 chars. Marginal — but it's the one fixture where Pro picks up text Flash misses.
Same three photo fixtures, same 3 iterations, but with photoOCRStrictMode = YES:
iOS sends promptId=photo-ocr-strict-v1 + mode=strict,
and the strict-capable Lambda resolves a different prompt. Verified end-to-end via Lambda-emitted
promptSha256 — every stage 1b run carries the strict prompt SHA, every
stage 1 run carries the normal SHA. Two different SHAs, zero leakage.
For each fixture below, the four cells are: Flash · normal, Pro · normal, Flash · strict, Pro · strict. The bottom number is mean word count across 3 iterations; the deviation pill flags which cells produced identical output every iteration (the determinism win we expected from strict).
Pro returns the same word count every iteration in 5 of 6 cells (deterministic pill). Strict didn't cut Pro recall on Labubu (158 → 158) or code (108 → 108). On the luggage adversarial fixture, strict cut Pro from 28.7 → 27.0 mean words, and made every iteration identical. Flash + strict on luggage is more variable than Pro + strict — strict alone isn't enough on Flash.
Strict didn't kill the hallucination — but Pro+strict made it deterministic
AAAA0013 · IMG_8213.JPG
The going-in hypothesis was that the strict prompt would suppress fabricated treadmill warnings on
this luggage box photo. It didn't. All four cells still invent some variant of "do not step on the
rails." Strict did change the picture in two real ways: (a) it surfaces more real ink the
normal prompt skipped (TSA007, 222),
and (b) it makes Pro fully deterministic across 3 iterations — same exact text, same word count,
every time. Useful in pipelines where reproducibility matters even when content is wrong.
NEED MORE BOXES? SCAN HERE OR VISIT US AT WALMART.COM/MOVING Warning Do not step on the side rails Keep the treadmill stable Do not allow children to operate Keep hands away from moving parts Redliro
NEED MORE BOXES? SCAN HERE OR VISIT US AT WALMART.COM/MOVING DO NOT STEP ON THE EDGE OF THE BELT KEEP THE MACHINE LEVEL ▲ Warning Redliro
NEED MORE BOXES? SCAN HERE OR VISIT US AT WALMART.COM/MOVING TSA007 A Warning Do not step on the side rails when the machine is running. Keep children and pets away from the machine. Redliro
NEED MORE BOXES? SCAN HERE OR VISIT US AT WALMART.COM/MOVING TSA007 Warning Do not step on the edge of the running belt. Keep the treadmill flat. Redliro
Highlighted iteration is iter 01 from each cell — strict outputs are bordered in violet. The strict
variant's gain is real but narrow: it picks up the TSA007 tag the
normal prompt skipped, but still invents safety language. The conclusion isn't "strict mode failed" —
it's "this fixture is harder than strict mode, and Pro + strict is the right floor for hallucination
reduction here."
Total iOS-side flow time, min–max range across 3 iterations per cell. Flash bars are green, Pro bars are purple. The wider the Pro bar, the more variable the response time — and on the dense code screen we see Pro variance balloon to 36s in the worst case.
Bars span min–max latency. White tick = mean. Collection mean for Flash includes a slow-network sweep (avg 65s); the per-photo Pro Lambda calls were actually quicker on Pro this round, an artefact at n=3. Photo flows are the cleaner signal.
X axis: $/scan. Y axis: words extracted. Each dot is one fixture × model average. The Pro cluster sits to the right (more expensive per-token) and at roughly the same Y (no recall lift). The cost gap shrinks under image-token amortization but never disappears.
| Seed · fixture | Flash·N WC | Pro·N WC | Flash·S WC | Pro·S WC | Flash ms | Pro ms | Pro/Flash $ |
|---|---|---|---|---|---|---|---|
| AAAA0003 Product label (Chinese + English) | 152 | 158 | 151 | 158 | 10s | 13s | 1.02× |
| AAAA0011 Code screen (Python prime checker) | 108 | 108 | 108 | 108 | 7.0s | 25s | 1.03× |
| AAAA0013 Luggage box (hallucination regression) | 33 | 29 | 31 | 27 | 9.6s | 16s | 0.94× |
| CCCC0001 Mixed collection (code · grocery · word doc) | 365 | 377 | — | — | 65s | 22s | 1.00× |
WC = mean word count across 3 iterations per cell. ·S = strict prompt
(photo-ocr-strict-v1). Collection (CCCC0001) only ran the normal pair —
strict on collection is queued for stage 2.
Keep Flash + normal as the default
- No recall regression vs Pro to fix.
- 2.2× faster on photos — meaningful at the user's keyboard.
- Cost barely differs at our scale, so cost isn't the trade.
- Strict on Flash didn't kill the luggage hallucination — strict alone isn't enough.
- F3 routing config stays as it is. No re-upload of
ocr-routing.jsonneeded.
Where Pro + strict has a real role
- Determinism. Pro+strict produced byte-identical output on every iteration of the luggage adversarial. That's a property worth offering to power users who care about reproducibility (sweep harness, audit logs).
- Adversarial inputs only. On normal photos (Labubu, code) strict didn't change anything Pro already did. The win is concentrated on hallucination-prone fixtures.
- Code OCR. Pro's whitespace fidelity is real (Window 1) and survives strict mode. If we ever expose a "preserve formatting" mode, Pro is the right backend.
Net: the routing infra (F1–F6) is doing its job — every stage 1b run carries the strict prompt SHA, every stage 1 run carries the normal SHA, and the iOS-side bridge bug that was masking strict mode is fixed and verified. The story Stage 1b told us is honest: strict mode is not a magic anti-hallucination switch on hard fixtures, but it does buy determinism on Pro and surfaces real ink the normal prompt skips.
The original Workstream F7 plan was video-polish (Flash vs Pro on the polish stage only). Stage 1 + 1b gives us a base rate: parity recall, structural fidelity, latency cost, and a precise read on what strict mode does and doesn't buy. Open questions for video:
- Does Pro+strict reduce
additional_noteshallucination? The polish stage is exactly the place where Stage 1b's determinism win could matter — polish has to not invent corrections. - Latency — already a video pain point. Polish runs after frame OCR; an extra 2× there could push end-to-end past push-notification SLOs.
- Cost — different math. Polish is a text-only stage (no image tokens). Pro's 16× input + 16× output rates would actually bite, unlike here.
Recommend: stage 2 = 1 video × 3 iters × 4 configs (Flash·normal, Flash·strict, Pro·normal, Pro·strict) on the polish stage = 12 runs. Mirrors the stage-1b matrix on a single video and lets us compare polish behaviour symmetrically.