Zapcopy QA Stage 1 — Photo + Collection: Flash vs Gemini 2.5 Pro
2026-04-25 · workstream f7 · stage 1 + 1b

Does Gemini 2.5 Pro earn its keep on photo OCR?

First production run of the F1–F6 routing infrastructure. Same prompts, same Lambdas, same iOS app — only modelId and promptId change between paired sweeps. Three single-photo fixtures × 2 models × 2 prompts (normal · strict) × 3 iterations + one 3-photo collection × 2 models × 3 iterations = 42 runs. Stage 1b (the strict dimension) was added after a routing bug was found and fixed: iOS was previously stamping the normal promptId on every request even when strict mode was on. The bridge fix lives in bridgeLaunchArgs + S3UploadService.

Verdict

Flash stays default Strict ≠ silver bullet on this fixture Pro+strict is deterministic
recall (words extracted)
Words +6.0%
Pro extracts effectively the same words as Flash on every fixture. No measurable recall lift.
fidelity (chars · indentation)
No fidelity gain
Pro preserves whitespace and structure that matters for code OCR — same words, more characters.
latency · cost
2.2× slower
Photo flows. Cost barely moves (1.00×) because image-token cost dominates at our scale.

Recommended action: keep photo and collection on gemini-2.5-flash + photo-ocr-normal-v1 in ocr-routing.json. Strict mode stays opt-in via the iOS Settings toggle. The new finding (stage 1b, below) is that strict doesn't kill the luggage hallucination on either model — but it does buy full determinism on Pro, which is a property worth surfacing for users who want auditable repeatability.

F-infrastructure proof

Every run carries Lambda-emitted provenance now (F1) and the routing override flowed through iOS launch args (F4). 0 silent fallbacks, 0 misrouted runs. Stage 1b additionally validates that the photoOCRStrictMode bridge correctly partitions the prompt SHA space — strict and normal runs carry different SHAs with zero leakage.

routing fidelity
✓ 42/42
Lambda-emitted modelId matches the requested override on every run.
prompt SHA partition
1+1 SHA, no overlap
normal: 5e132458afce…
strict: c60fcca3dad4…
lambda version
rev2
v2-json-strict-capable-rev2
Three windows into Pro

Each fixture stresses a different OCR axis. The interesting Pro behaviour shows up in the shape of the text it returns, not the count.

Window 1

Code screens — same words, more structure

AAAA0011 · code_screen_picture.JPG

Both models return exactly 108 words per run — perfect recall parity on a 30-line Python function. But Pro returns +31 chars of indentation and blank-line preservation. For code OCR specifically, that's the difference between "extract the words" and "extract the source."

flash · 2.5 — flat
def is_prime(n):
"""
Checks if an integer 'n' is a prime number.

A prime number is a natural number greater than 1
that has no positive divisors other than 1 and itself.
"""
if n <= 1:
    return Fal…
pro · 2.5 — preserved
def is_prime(n):
    """
    Checks if an integer 'n' is a prime number.

    A prime number is a natural number greater than 1
    that has no positive divisors other than 1 and itself.
    """
    i…
Window 2

Adversarial — Pro hallucinates less even in normal mode

AAAA0013 · IMG_8213.JPG (luggage box)

This fixture is the strict-mode regression case: a luggage box photo where Gemini fabricates safety-warning text that isn't there. Run in normal mode, both models hallucinate — but Pro fabricates noticeably less. Flash invents 6+ treadmill warnings; Pro invents 2.

flash — 33 avg words · expansive hallucination
NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING

Warning
Do not step on the side rails
Keep the treadmill stable
Do not allow children to operate
Keep hands away from moving parts

Redlir…
pro — 29 avg words · restrained
NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING

DO NOT STEP ON THE EDGE OF THE BELT
KEEP THE MACHINE LEVEL
▲ Warning

Redliro

Both outputs are still wrong (the box has luggage warnings, not treadmill warnings). The strict-mode section below tests the obvious next hypothesis: does strict mode actually kill the hallucination? Spoiler — not on this fixture.

Window 3

Multilingual — Pro reads Chinese product copy slightly more thoroughly

AAAA0003 · labubu_box_picture.JPG

Bilingual Chinese + English product label. Pro lifts word count from 152 → 158 (+3.9%) and char count by +34 chars. Marginal — but it's the one fixture where Pro picks up text Flash misses.

Stage 1b — strict mode × model

Same three photo fixtures, same 3 iterations, but with photoOCRStrictMode = YES: iOS sends promptId=photo-ocr-strict-v1 + mode=strict, and the strict-capable Lambda resolves a different prompt. Verified end-to-end via Lambda-emitted promptSha256 — every stage 1b run carries the strict prompt SHA, every stage 1 run carries the normal SHA. Two different SHAs, zero leakage.

For each fixture below, the four cells are: Flash · normal, Pro · normal, Flash · strict, Pro · strict. The bottom number is mean word count across 3 iterations; the deviation pill flags which cells produced identical output every iteration (the determinism win we expected from strict).

word count by cell
AAAA0003 Product label
Flash · normal
152 words
range 151–154
Pro · normal
158 words
range 158–158
deterministic
Flash · strict
151 words
range 151–151
deterministic
Pro · strict
158 words
range 158–158
deterministic
AAAA0011 Code screen
Flash · normal
108 words
range 108–108
deterministic
Pro · normal
108 words
range 108–108
deterministic
Flash · strict
108 words
range 108–108
deterministic
Pro · strict
108 words
range 108–108
deterministic
AAAA0013 Luggage box
Flash · normal
33 words
range 30–35
Pro · normal
29 words
range 26–30
Flash · strict
31 words
range 26–34
Pro · strict
27 words
range 27–27
deterministic

Pro returns the same word count every iteration in 5 of 6 cells (deterministic pill). Strict didn't cut Pro recall on Labubu (158 → 158) or code (108 → 108). On the luggage adversarial fixture, strict cut Pro from 28.7 → 27.0 mean words, and made every iteration identical. Flash + strict on luggage is more variable than Pro + strict — strict alone isn't enough on Flash.

Luggage adversarial — 2×2

Strict didn't kill the hallucination — but Pro+strict made it deterministic

AAAA0013 · IMG_8213.JPG

The going-in hypothesis was that the strict prompt would suppress fabricated treadmill warnings on this luggage box photo. It didn't. All four cells still invent some variant of "do not step on the rails." Strict did change the picture in two real ways: (a) it surfaces more real ink the normal prompt skipped (TSA007, 222), and (b) it makes Pro fully deterministic across 3 iterations — same exact text, same word count, every time. Useful in pipelines where reproducibility matters even when content is wrong.

Flash · normal — 33 avg words
NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING

Warning
Do not step on the side rails
Keep the treadmill stable
Do not allow children to operate
Keep hands away from moving parts

Redliro
Pro · normal — 29 avg words
NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING

DO NOT STEP ON THE EDGE OF THE BELT
KEEP THE MACHINE LEVEL
▲ Warning

Redliro
Flash · strict — 31 avg words
NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING
TSA007
A Warning
Do not step on the side rails
when the machine is running.
Keep children and pets away
from the machine.
Redliro
Pro · strict — 27 avg words deterministic
NEED MORE BOXES?
SCAN HERE OR VISIT US AT
WALMART.COM/MOVING
TSA007
Warning
Do not step on the edge of the
running belt.
Keep the treadmill flat.
Redliro

Highlighted iteration is iter 01 from each cell — strict outputs are bordered in violet. The strict variant's gain is real but narrow: it picks up the TSA007 tag the normal prompt skipped, but still invents safety language. The conclusion isn't "strict mode failed" — it's "this fixture is harder than strict mode, and Pro + strict is the right floor for hallucination reduction here."

strict mode impact across all photo fixtures
Flash · strict − Flash · normal
-2.2%
mean word-count change across 3 fixtures. Negative = strict cuts hallucination volume.
Pro · strict − Pro · normal
-1.9%
Pro is already restrained in normal mode, so strict's word-count effect is small except on the luggage adversarial.
The latency tax

Total iOS-side flow time, min–max range across 3 iterations per cell. Flash bars are green, Pro bars are purple. The wider the Pro bar, the more variable the response time — and on the dense code screen we see Pro variance balloon to 36s in the worst case.

AAAA0003
Pro/Flash = 1.30×
flash
10s
pro
13s
AAAA0011
Pro/Flash = 3.60×
flash
7.0s
pro
25s
AAAA0013
Pro/Flash = 1.71×
flash
9.6s
pro
16s
CCCC0001 · collection
Pro/Flash = 0.34×
flash
65s
pro
22s

Bars span min–max latency. White tick = mean. Collection mean for Flash includes a slow-network sweep (avg 65s); the per-photo Pro Lambda calls were actually quicker on Pro this round, an artefact at n=3. Photo flows are the cleaner signal.

Cost-per-scan vs recall

X axis: $/scan. Y axis: words extracted. Each dot is one fixture × model average. The Pro cluster sits to the right (more expensive per-token) and at roughly the same Y (no recall lift). The cost gap shrinks under image-token amortization but never disappears.

$0.00000 $0.00004 $0.00009 $0.00013 $0.00017 0 104 207 311 415 $/scan words extracted AAAA0003 AAAA0003 AAAA0011 AAAA0011 AAAA0013 AAAA0013 CCCC0001 CCCC0001 Flash Pro 2.5
All numbers
Seed · fixture Flash·N WC Pro·N WC Flash·S WC Pro·S WC Flash ms Pro ms Pro/Flash $
AAAA0003
Product label (Chinese + English)
152 158 151 158 10s 13s 1.02×
AAAA0011
Code screen (Python prime checker)
108 108 108 108 7.0s 25s 1.03×
AAAA0013
Luggage box (hallucination regression)
33 29 31 27 9.6s 16s 0.94×
CCCC0001
Mixed collection (code · grocery · word doc)
365 377 65s 22s 1.00×

WC = mean word count across 3 iterations per cell. ·S = strict prompt (photo-ocr-strict-v1). Collection (CCCC0001) only ran the normal pair — strict on collection is queued for stage 2.

Decision + next step

Keep Flash + normal as the default

  • No recall regression vs Pro to fix.
  • 2.2× faster on photos — meaningful at the user's keyboard.
  • Cost barely differs at our scale, so cost isn't the trade.
  • Strict on Flash didn't kill the luggage hallucination — strict alone isn't enough.
  • F3 routing config stays as it is. No re-upload of ocr-routing.json needed.

Where Pro + strict has a real role

  • Determinism. Pro+strict produced byte-identical output on every iteration of the luggage adversarial. That's a property worth offering to power users who care about reproducibility (sweep harness, audit logs).
  • Adversarial inputs only. On normal photos (Labubu, code) strict didn't change anything Pro already did. The win is concentrated on hallucination-prone fixtures.
  • Code OCR. Pro's whitespace fidelity is real (Window 1) and survives strict mode. If we ever expose a "preserve formatting" mode, Pro is the right backend.

Net: the routing infra (F1–F6) is doing its job — every stage 1b run carries the strict prompt SHA, every stage 1 run carries the normal SHA, and the iOS-side bridge bug that was masking strict mode is fixed and verified. The story Stage 1b told us is honest: strict mode is not a magic anti-hallucination switch on hard fixtures, but it does buy determinism on Pro and surfaces real ink the normal prompt skips.

Stage 2 — onward to video

The original Workstream F7 plan was video-polish (Flash vs Pro on the polish stage only). Stage 1 + 1b gives us a base rate: parity recall, structural fidelity, latency cost, and a precise read on what strict mode does and doesn't buy. Open questions for video:

  • Does Pro+strict reduce additional_notes hallucination? The polish stage is exactly the place where Stage 1b's determinism win could matter — polish has to not invent corrections.
  • Latency — already a video pain point. Polish runs after frame OCR; an extra 2× there could push end-to-end past push-notification SLOs.
  • Cost — different math. Polish is a text-only stage (no image tokens). Pro's 16× input + 16× output rates would actually bite, unlike here.

Recommend: stage 2 = 1 video × 3 iters × 4 configs (Flash·normal, Flash·strict, Pro·normal, Pro·strict) on the polish stage = 12 runs. Mirrors the stage-1b matrix on a single video and lets us compare polish behaviour symmetrically.