v1 vs v2 OCR Lambda — paired photo sweep

Photo-only paired comparison between the legacy Gemini OCR Lambda (free-form text response) and the new v2 Lambda (structured JSON via Gemini responseSchema). Same 4 photos — two baseline (AAAA0003, AAAA0004) and two fence-prone adversarial fixtures (AAAA0011 code screen, AAAA0012 word-doc screenshot) — run 3 iterations each through both Lambdas.

Batches: v1 2026-04-21T14-32Z, v2 2026-04-21T14-35Z. Both tagged *-paired-adversarial. 12 runs per variant, 24 total.

v2 raw JSON captured

12/12

every v2 run emits a full Gemini envelope; v1 has no such field

v2 rawResponse parseable

12/12

100% parse as JSON with ocr_text key — structural guarantee holds

Fence leaks in either variant

0 / 24

end-to-end defense (sanitizer + responseSchema) both holding

Word-count parity

1.4%

mean |Δ words| v2-vs-v1 across 4 seeds — no quality regression

1. Aggregate metrics

Metric	v1	v2	Reading
Runs	12	12	3 iters × 4 photos each
lambdaVariant distribution	{"v1":12}	{"v2":12}	ingest-time prefix detection works
rawResponse present	0/12	12/12	v1 Lambda doesn't emit one; v2 always does
rawResponse parseable	–	12/12	JSON.parse succeeds + contains `ocr_text`
ocrText contains triple-backtick	0/12	0/12	client-side OCRSanitizer defense holding on both paths
rows with failing invariants	0/12	0/12	0 failures across all 24 flows
avg flow elapsed	8.81s	7.63s	v2 trending faster; weak signal at n=12

2. Per-seed paired comparison

Seed	v1 avg words	v2 avg words	Δ words	v1 fence rate	v2 fence rate	v1 elapsed	v2 elapsed
AAAA0003	158	152	-3.6%	0/3	0/3	9.68s	9.59s
AAAA0004	80	78	-2.1%	0/3	0/3	9.00s	8.50s
AAAA0011	108	108	0	0/3	0/3	7.95s	5.04s
AAAA0012	257	257	0	0/3	0/3	8.62s	7.37s

Word counts are within a few percent per seed — no quality regression. Adversarial seeds (AAAA0011 code screen, AAAA0012 word doc) extract identical word counts across both variants, suggesting structured output mode hasn't cost us any recall on fence-prone content.

3. Observability layer (v2 only)

The whole reason for persisting rawResponse is so future regressions are observable — if Gemini ever produces a commentary-wrapped envelope or drops the required ocr_text key, the response-parseable invariant fails and we see it in the dashboard. Today, every v2 run looks like this:

v2 rawResponse (AAAA0011 — the prime-number Python leak case, first iter)

{"ocr_text": "def is_prime(n):\n\"\"\"\nChecks if an integer 'n' is a prime number.\n\nA prime number is a natural number greater than 1\nthat has no positive divisors other than 1 and itself.\n\"\"\"\nif n <= 1:\n    return False\n# Check for factors from 2 up to the square root of n\nfor i in range(2, int(math.sqrt(n)) + 1):\n    if n % i == 0:\n        return False\nreturn True\n\ndef find_first_n_primes(count):\n\"\"\"\nFinds a specified count of the first prime numbers.\n\"\"\"\nprimes = []…

The envelope contains exactly the two keys the schema promised: ocr_text (raw extraction) and notes (moderation/context). No commentary preamble, no fencing, no apologies — the structural guarantee is working at the source, which is why the client-side sanitizer has no cleanup to do on v2.

4. What this sweep can't prove

v1's actual fence-leak rate. The iOS OCRSanitizer strips fences at the decode site before the text reaches the dashboard, so the no-markdown-fence invariant can only fire if the sanitizer missed a case. We'd need to either sample s3://…/ocr/*.txt directly or temporarily disable the sanitizer to quantify the raw v1 fence rate. The historical one-off (prime-number Python screenshot) remains the only known leak event.
Adversarial coverage is shallow. Two code-heavy fixtures (AAAA0011, AAAA0012) × 3 iters = 6 v2 runs on fence-prone content. No regression observed, but a large-scale sweep (50+ iters) would be needed to make a statistical claim.
Video pipeline. Video OCR stays on v1 per the scope cut in the current PR; its rawResponse capture + parse invariants are tracked in the follow-up plan.

5. Takeaway

v2 ships the structural defense cleanly: 12/12 runs produce parseable JSON, zero quality regression vs v1 in word counts, zero invariant failures. The observability layer now makes any future Gemini drift immediately visible in the run page. The sanitizer + responseSchema + parseability invariant form the three-layer defense the design called for.