Zapcopy QA Report — v1 vs v2 OCR paired sweep

v1 vs v2 OCR Lambda — paired photo sweep

Photo-only paired comparison between the legacy Gemini OCR Lambda (free-form text response) and the new v2 Lambda (structured JSON via Gemini responseSchema). Same 4 photos — two baseline (AAAA0003, AAAA0004) and two fence-prone adversarial fixtures (AAAA0011 code screen, AAAA0012 word-doc screenshot) — run 3 iterations each through both Lambdas.

Batches: v1 2026-04-21T14-32Z, v2 2026-04-21T14-35Z. Both tagged *-paired-adversarial. 12 runs per variant, 24 total.

v2 raw JSON captured
12/12
every v2 run emits a full Gemini envelope; v1 has no such field
v2 rawResponse parseable
12/12
100% parse as JSON with ocr_text key — structural guarantee holds
Fence leaks in either variant
0 / 24
end-to-end defense (sanitizer + responseSchema) both holding
Word-count parity
1.4%
mean |Δ words| v2-vs-v1 across 4 seeds — no quality regression
1. Aggregate metrics
Metric v1 v2 Reading
Runs 12 12 3 iters × 4 photos each
lambdaVariant distribution {"v1":12} {"v2":12} ingest-time prefix detection works
rawResponse present 0/12 12/12 v1 Lambda doesn't emit one; v2 always does
rawResponse parseable 12/12 JSON.parse succeeds + contains ocr_text
ocrText contains triple-backtick 0/12 0/12 client-side OCRSanitizer defense holding on both paths
rows with failing invariants 0/12 0/12 0 failures across all 24 flows
avg flow elapsed 8.81s 7.63s v2 trending faster; weak signal at n=12
2. Per-seed paired comparison
Seed v1 avg words v2 avg words Δ words v1 fence rate v2 fence rate v1 elapsed v2 elapsed
AAAA0003 158 152 -3.6% 0/3 0/3 9.68s 9.59s
AAAA0004 80 78 -2.1% 0/3 0/3 9.00s 8.50s
AAAA0011 108 108 0 0/3 0/3 7.95s 5.04s
AAAA0012 257 257 0 0/3 0/3 8.62s 7.37s

Word counts are within a few percent per seed — no quality regression. Adversarial seeds (AAAA0011 code screen, AAAA0012 word doc) extract identical word counts across both variants, suggesting structured output mode hasn't cost us any recall on fence-prone content.

3. Observability layer (v2 only)

The whole reason for persisting rawResponse is so future regressions are observable — if Gemini ever produces a commentary-wrapped envelope or drops the required ocr_text key, the response-parseable invariant fails and we see it in the dashboard. Today, every v2 run looks like this:

v2 rawResponse (AAAA0011 — the prime-number Python leak case, first iter)
{"ocr_text": "def is_prime(n):\n\"\"\"\nChecks if an integer 'n' is a prime number.\n\nA prime number is a natural number greater than 1\nthat has no positive divisors other than 1 and itself.\n\"\"\"\nif n <= 1:\n    return False\n# Check for factors from 2 up to the square root of n\nfor i in range(2, int(math.sqrt(n)) + 1):\n    if n % i == 0:\n        return False\nreturn True\n\ndef find_first_n_primes(count):\n\"\"\"\nFinds a specified count of the first prime numbers.\n\"\"\"\nprimes = []…

The envelope contains exactly the two keys the schema promised: ocr_text (raw extraction) and notes (moderation/context). No commentary preamble, no fencing, no apologies — the structural guarantee is working at the source, which is why the client-side sanitizer has no cleanup to do on v2.

4. What this sweep can't prove
  • v1's actual fence-leak rate. The iOS OCRSanitizer strips fences at the decode site before the text reaches the dashboard, so the no-markdown-fence invariant can only fire if the sanitizer missed a case. We'd need to either sample s3://…/ocr/*.txt directly or temporarily disable the sanitizer to quantify the raw v1 fence rate. The historical one-off (prime-number Python screenshot) remains the only known leak event.
  • Adversarial coverage is shallow. Two code-heavy fixtures (AAAA0011, AAAA0012) × 3 iters = 6 v2 runs on fence-prone content. No regression observed, but a large-scale sweep (50+ iters) would be needed to make a statistical claim.
  • Video pipeline. Video OCR stays on v1 per the scope cut in the current PR; its rawResponse capture + parse invariants are tracked in the follow-up plan.
5. Takeaway

v2 ships the structural defense cleanly: 12/12 runs produce parseable JSON, zero quality regression vs v1 in word counts, zero invariant failures. The observability layer now makes any future Gemini drift immediately visible in the run page. The sanitizer + responseSchema + parseability invariant form the three-layer defense the design called for.