v1 vs v2 OCR Lambda — paired photo sweep
Photo-only paired comparison between the legacy Gemini OCR Lambda (free-form text response) and the
new v2 Lambda (structured JSON via Gemini responseSchema).
Same 4 photos — two baseline (AAAA0003, AAAA0004) and two fence-prone
adversarial fixtures (AAAA0011 code screen, AAAA0012 word-doc screenshot) —
run 3 iterations each through both Lambdas.
Batches: v1 2026-04-21T14-32Z,
v2 2026-04-21T14-35Z.
Both tagged *-paired-adversarial. 12 runs per variant, 24 total.
ocr_text key — structural guarantee holds| Metric | v1 | v2 | Reading |
|---|---|---|---|
| Runs | 12 | 12 | 3 iters × 4 photos each |
| lambdaVariant distribution | {"v1":12} | {"v2":12} | ingest-time prefix detection works |
| rawResponse present | 0/12 | 12/12 | v1 Lambda doesn't emit one; v2 always does |
| rawResponse parseable | – | 12/12 | JSON.parse succeeds + contains ocr_text |
| ocrText contains triple-backtick | 0/12 | 0/12 | client-side OCRSanitizer defense holding on both paths |
| rows with failing invariants | 0/12 | 0/12 | 0 failures across all 24 flows |
| avg flow elapsed | 8.81s | 7.63s | v2 trending faster; weak signal at n=12 |
| Seed | v1 avg words | v2 avg words | Δ words | v1 fence rate | v2 fence rate | v1 elapsed | v2 elapsed |
|---|---|---|---|---|---|---|---|
| AAAA0003 | 158 | 152 | -3.6% | 0/3 | 0/3 | 9.68s | 9.59s |
| AAAA0004 | 80 | 78 | -2.1% | 0/3 | 0/3 | 9.00s | 8.50s |
| AAAA0011 | 108 | 108 | 0 | 0/3 | 0/3 | 7.95s | 5.04s |
| AAAA0012 | 257 | 257 | 0 | 0/3 | 0/3 | 8.62s | 7.37s |
Word counts are within a few percent per seed — no quality regression. Adversarial seeds
(AAAA0011 code screen, AAAA0012 word doc) extract identical word counts
across both variants, suggesting structured output mode hasn't cost us any recall on fence-prone content.
The whole reason for persisting rawResponse is so future regressions are observable
— if Gemini ever produces a commentary-wrapped envelope or drops the required ocr_text key,
the response-parseable invariant fails and we see it in the dashboard. Today, every v2 run
looks like this:
{"ocr_text": "def is_prime(n):\n\"\"\"\nChecks if an integer 'n' is a prime number.\n\nA prime number is a natural number greater than 1\nthat has no positive divisors other than 1 and itself.\n\"\"\"\nif n <= 1:\n return False\n# Check for factors from 2 up to the square root of n\nfor i in range(2, int(math.sqrt(n)) + 1):\n if n % i == 0:\n return False\nreturn True\n\ndef find_first_n_primes(count):\n\"\"\"\nFinds a specified count of the first prime numbers.\n\"\"\"\nprimes = []…
The envelope contains exactly the two keys the schema promised: ocr_text (raw extraction)
and notes (moderation/context). No commentary preamble, no fencing, no apologies — the
structural guarantee is working at the source, which is why the client-side sanitizer has no cleanup
to do on v2.
- v1's actual fence-leak rate. The iOS
OCRSanitizerstrips fences at the decode site before the text reaches the dashboard, so theno-markdown-fenceinvariant can only fire if the sanitizer missed a case. We'd need to either samples3://…/ocr/*.txtdirectly or temporarily disable the sanitizer to quantify the raw v1 fence rate. The historical one-off (prime-number Python screenshot) remains the only known leak event. - Adversarial coverage is shallow. Two code-heavy fixtures
(
AAAA0011,AAAA0012) × 3 iters = 6 v2 runs on fence-prone content. No regression observed, but a large-scale sweep (50+ iters) would be needed to make a statistical claim. - Video pipeline. Video OCR stays on v1 per the scope cut
in the current PR; its
rawResponsecapture + parse invariants are tracked in the follow-up plan.
v2 ships the structural defense cleanly: 12/12 runs produce parseable JSON, zero quality regression vs v1 in word counts, zero invariant failures. The observability layer now makes any future Gemini drift immediately visible in the run page. The sanitizer + responseSchema + parseability invariant form the three-layer defense the design called for.