2026-04-25 · workstream f7 · stage 1 + 1b

Does Gemini 2.5 Pro earn its keep on photo OCR?

First production run of the F1–F6 routing infrastructure. Same prompts, same Lambdas, same iOS app — only modelId and promptId change between paired sweeps. Three single-photo fixtures × 2 models × 2 prompts (normal · strict) × 3 iterations + one 3-photo collection × 2 models × 3 iterations = 42 runs. Stage 1b (the strict dimension) was added after a routing bug was found and fixed: iOS was previously stamping the normal promptId on every request even when strict mode was on. The bridge fix lives in bridgeLaunchArgs + S3UploadService.

Verdict

Flash stays default Strict ≠ silver bullet on this fixture Pro+strict is deterministic

recall (words extracted)

Words +6.0%

Pro extracts effectively the same words as Flash on every fixture. No measurable recall lift.

fidelity (chars · indentation)

No fidelity gain

Pro preserves whitespace and structure that matters for code OCR — same words, more characters.

latency · cost

2.2× slower

Photo flows. Cost barely moves (1.00×) because image-token cost dominates at our scale.

Recommended action: keep photo and collection on gemini-2.5-flash + photo-ocr-normal-v1 in ocr-routing.json. Strict mode stays opt-in via the iOS Settings toggle. The new finding (stage 1b, below) is that strict doesn't kill the luggage hallucination on either model — but it does buy full determinism on Pro, which is a property worth surfacing for users who want auditable repeatability.

F-infrastructure proof

Every run carries Lambda-emitted provenance now (F1) and the routing override flowed through iOS launch args (F4). 0 silent fallbacks, 0 misrouted runs. Stage 1b additionally validates that the photoOCRStrictMode bridge correctly partitions the prompt SHA space — strict and normal runs carry different SHAs with zero leakage.

routing fidelity

✓ 42/42

Lambda-emitted modelId matches the requested override on every run.

prompt SHA partition

1+1 SHA, no overlap

normal: 5e132458afce…
strict: c60fcca3dad4…

lambda version

rev2

v2-json-strict-capable-rev2

Three windows into Pro

Each fixture stresses a different OCR axis. The interesting Pro behaviour shows up in the shape of the text it returns, not the count.

Window 1

Code screens — same words, more structure

AAAA0011 · code_screen_picture.JPG

Both models return exactly 108 words per run — perfect recall parity on a 30-line Python function. But Pro returns +31 chars of indentation and blank-line preservation. For code OCR specifically, that's the difference between "extract the words" and "extract the source."

flash · 2.5 — flat

def is_prime(n):
"""
Checks if an integer 'n' is a prime number.

A prime number is a natural number greater than 1
that has no positive divisors other than 1 and itself.
"""
if n <= 1:
    return Fal…

pro · 2.5 — preserved

def is_prime(n):
    """
    Checks if an integer 'n' is a prime number.

    A prime number is a natural number greater than 1
    that has no positive divisors other than 1 and itself.
    """
    i…

Window 2

Adversarial — Pro hallucinates less even in normal mode

AAAA0013 · IMG_8213.JPG (luggage box)

This fixture is the strict-mode regression case: a luggage box photo where Gemini fabricates safety-warning text that isn't there. Run in normal mode, both models hallucinate — but Pro fabricates noticeably less. Flash invents 6+ treadmill warnings; Pro invents 2.

flash — 33 avg words · expansive hallucination

NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING

Warning
Do not step on the side rails
Keep the treadmill stable
Do not allow children to operate
Keep hands away from moving parts

Redlir…

pro — 29 avg words · restrained

NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING

DO NOT STEP ON THE EDGE OF THE BELT
KEEP THE MACHINE LEVEL
▲ Warning

Redliro

Both outputs are still wrong (the box has luggage warnings, not treadmill warnings). The strict-mode section below tests the obvious next hypothesis: does strict mode actually kill the hallucination? Spoiler — not on this fixture.

Window 3

Multilingual — Pro reads Chinese product copy slightly more thoroughly

AAAA0003 · labubu_box_picture.JPG

Bilingual Chinese + English product label. Pro lifts word count from 152 → 158 (+3.9%) and char count by +34 chars. Marginal — but it's the one fixture where Pro picks up text Flash misses.

Stage 1b — strict mode × model

Same three photo fixtures, same 3 iterations, but with photoOCRStrictMode = YES: iOS sends promptId=photo-ocr-strict-v1 + mode=strict, and the strict-capable Lambda resolves a different prompt. Verified end-to-end via Lambda-emitted promptSha256 — every stage 1b run carries the strict prompt SHA, every stage 1 run carries the normal SHA. Two different SHAs, zero leakage.

For each fixture below, the four cells are: Flash · normal, Pro · normal, Flash · strict, Pro · strict. The bottom number is mean word count across 3 iterations; the deviation pill flags which cells produced identical output every iteration (the determinism win we expected from strict).

word count by cell

AAAA0003 Product label

Flash · normal

152 words

range 151–154

Pro · normal

158 words

range 158–158

deterministic

Flash · strict

151 words

range 151–151

deterministic

Pro · strict

158 words

range 158–158

deterministic

AAAA0011 Code screen

Flash · normal

108 words

range 108–108

deterministic

Pro · normal

108 words

range 108–108

deterministic

Flash · strict

108 words

range 108–108

deterministic

Pro · strict

108 words

range 108–108

deterministic

AAAA0013 Luggage box

Flash · normal

33 words

range 30–35

Pro · normal

29 words

range 26–30

Flash · strict

31 words

range 26–34

Pro · strict

27 words

range 27–27

deterministic

Pro returns the same word count every iteration in 5 of 6 cells (deterministic pill). Strict didn't cut Pro recall on Labubu (158 → 158) or code (108 → 108). On the luggage adversarial fixture, strict cut Pro from 28.7 → 27.0 mean words, and made every iteration identical. Flash + strict on luggage is more variable than Pro + strict — strict alone isn't enough on Flash.

Luggage adversarial — 2×2

Strict didn't kill the hallucination — but Pro+strict made it deterministic

AAAA0013 · IMG_8213.JPG

The going-in hypothesis was that the strict prompt would suppress fabricated treadmill warnings on this luggage box photo. It didn't. All four cells still invent some variant of "do not step on the rails." Strict did change the picture in two real ways: (a) it surfaces more real ink the normal prompt skipped (TSA007, 222), and (b) it makes Pro fully deterministic across 3 iterations — same exact text, same word count, every time. Useful in pipelines where reproducibility matters even when content is wrong.

Flash · normal — 33 avg words

NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING

Warning
Do not step on the side rails
Keep the treadmill stable
Do not allow children to operate
Keep hands away from moving parts

Redliro

Pro · normal — 29 avg words

NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING

DO NOT STEP ON THE EDGE OF THE BELT
KEEP THE MACHINE LEVEL
▲ Warning

Redliro

Flash · strict — 31 avg words

NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING
TSA007
A Warning
Do not step on the side rails
when the machine is running.
Keep children and pets away
from the machine.
Redliro

Pro · strict — 27 avg words deterministic

NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING
TSA007
Warning
Do not step on the edge of the
running belt.
Keep the treadmill flat.
Redliro

Highlighted iteration is iter 01 from each cell — strict outputs are bordered in violet. The strict variant's gain is real but narrow: it picks up the TSA007 tag the normal prompt skipped, but still invents safety language. The conclusion isn't "strict mode failed" — it's "this fixture is harder than strict mode, and Pro + strict is the right floor for hallucination reduction here."

strict mode impact across all photo fixtures

Flash · strict − Flash · normal

-2.2%

mean word-count change across 3 fixtures. Negative = strict cuts hallucination volume.

Pro · strict − Pro · normal

-1.9%

Pro is already restrained in normal mode, so strict's word-count effect is small except on the luggage adversarial.

The latency tax

Total iOS-side flow time, min–max range across 3 iterations per cell. Flash bars are green, Pro bars are purple. The wider the Pro bar, the more variable the response time — and on the dense code screen we see Pro variance balloon to 36s in the worst case.

AAAA0003

Pro/Flash = 1.30×

flash

10s

pro

13s

AAAA0011

Pro/Flash = 3.60×

flash

7.0s

pro

25s

AAAA0013

Pro/Flash = 1.71×

flash

9.6s

pro

16s

CCCC0001 · collection

Pro/Flash = 0.34×

flash

65s

pro

22s

Bars span min–max latency. White tick = mean. Collection mean for Flash includes a slow-network sweep (avg 65s); the per-photo Pro Lambda calls were actually quicker on Pro this round, an artefact at n=3. Photo flows are the cleaner signal.

Cost-per-scan vs recall

X axis: $/scan. Y axis: words extracted. Each dot is one fixture × model average. The Pro cluster sits to the right (more expensive per-token) and at roughly the same Y (no recall lift). The cost gap shrinks under image-token amortization but never disappears.

All numbers

Seed · fixture	Flash·N WC	Pro·N WC	Flash·S WC	Pro·S WC	Flash ms	Pro ms	Pro/Flash $
AAAA0003 Product label (Chinese + English)	152	158	151	158	10s	13s	1.02×
AAAA0011 Code screen (Python prime checker)	108	108	108	108	7.0s	25s	1.03×
AAAA0013 Luggage box (hallucination regression)	33	29	31	27	9.6s	16s	0.94×
CCCC0001 Mixed collection (code · grocery · word doc)	365	377	—	—	65s	22s	1.00×

WC = mean word count across 3 iterations per cell. ·S = strict prompt (photo-ocr-strict-v1). Collection (CCCC0001) only ran the normal pair — strict on collection is queued for stage 2.

Decision + next step

Keep Flash + normal as the default

No recall regression vs Pro to fix.
2.2× faster on photos — meaningful at the user's keyboard.
Cost barely differs at our scale, so cost isn't the trade.
Strict on Flash didn't kill the luggage hallucination — strict alone isn't enough.
F3 routing config stays as it is. No re-upload of ocr-routing.json needed.

Where Pro + strict has a real role

Determinism. Pro+strict produced byte-identical output on every iteration of the luggage adversarial. That's a property worth offering to power users who care about reproducibility (sweep harness, audit logs).
Adversarial inputs only. On normal photos (Labubu, code) strict didn't change anything Pro already did. The win is concentrated on hallucination-prone fixtures.
Code OCR. Pro's whitespace fidelity is real (Window 1) and survives strict mode. If we ever expose a "preserve formatting" mode, Pro is the right backend.

Net: the routing infra (F1–F6) is doing its job — every stage 1b run carries the strict prompt SHA, every stage 1 run carries the normal SHA, and the iOS-side bridge bug that was masking strict mode is fixed and verified. The story Stage 1b told us is honest: strict mode is not a magic anti-hallucination switch on hard fixtures, but it does buy determinism on Pro and surfaces real ink the normal prompt skips.

Stage 2 — onward to video

The original Workstream F7 plan was video-polish (Flash vs Pro on the polish stage only). Stage 1 + 1b gives us a base rate: parity recall, structural fidelity, latency cost, and a precise read on what strict mode does and doesn't buy. Open questions for video:

Does Pro+strict reduce additional_notes hallucination? The polish stage is exactly the place where Stage 1b's determinism win could matter — polish has to not invent corrections.
Latency — already a video pain point. Polish runs after frame OCR; an extra 2× there could push end-to-end past push-notification SLOs.
Cost — different math. Polish is a text-only stage (no image tokens). Pro's 16× input + 16× output rates would actually bite, unlike here.

Recommend: stage 2 = 1 video × 3 iters × 4 configs (Flash·normal, Flash·strict, Pro·normal, Pro·strict) on the polish stage = 12 runs. Mirrors the stage-1b matrix on a single video and lets us compare polish behaviour symmetrically.