Zapcopy QA Workstream F — plan breakdown
2026-04-24 · plan breakdown

Workstream F — hot-swappable prompts & Gemini model variants

High-level breakdown of the next cycle. Why we want it, how complex each piece is, the safest order to ship, and what we test at every gate. This is the executive view — the line-by-line plan lives in ~/.claude/plans/wild-tumbling-glade.md.

Why now

Today every OCR prompt is a string literal baked into a Lambda zip; every Gemini model ID is a hardcoded constant. When Google ships Gemini 3.x, or we want to try Gemini 2.5 Pro on polish only, our only knob is "redeploy."

We lose two things from that:

Scope this cycle. Gemini family only — 2.5 Flash, 2.5 Pro, future 3.x. Cross-provider adapters (OpenAI, Anthropic) are explicitly deferred until we've proven the hot-swap infrastructure end-to-end. iOS Settings picker + S3 remote config drives the switching. Per-flow granularity (4 independent picks: photo / collection / video-frame / video-polish) so we can e.g. pin frame OCR to Flash for speed but promote polish to Pro for accuracy.

What this unlocks

Operational

  • Flip prompt or model in S3 → propagates to all clients in ≤10 min, no app release.
  • Per-stage tuning: cheap Flash on the hot path (frame OCR, called per-frame) and Pro on the polish stage where accuracy compounds across frames.
  • Safe rollback: if a new prompt blows up parse-rate, edit one JSON file and we're back to baseline within the cache TTL.

Observability

  • Every run carries Lambda-emitted provenance — no more inferring from YAML.
  • SHA-drift detection: dashboard turns yellow if Lambda's prompt SHA ≠ app_defaults.yaml SHA.
  • Model-mix tile per flow: at-a-glance "did our remote-config flip actually propagate?"
  • Cost tracking: separate Pro / Flash-Lite line items in the per-scan cost estimate.

Sub-steps — complexity & impact

F1 — Lambda self-identification

low complexity unlocks F5/F6

Add an _identity() helper to each v2 Lambda (moderation_ocr, frame_ocr, video_polish, video_aggregator) that emits {provider, modelId, promptId, promptSha256, lambdaVersion} in every response. Bump lambdaVersion suffix to -rev2 so the dashboard can tell pre- vs post-F1 runs apart.

Risk: minimal — additive fields. Old clients ignore them. Test: deploy + curl, confirm fields in response body; ingest one run, confirm they land in data/runs/<id>.json.

F2 — Prompt externalization + request-parameterized model

medium complexity core enabler

Lift each prompt out of Python into a named YAML entry (lambdas/v2/prompts.yaml). At cold start, load the YAML and compute SHA-256 per entry. Lambda accepts modelId + promptId on the request body; unknown promptId → 400; missing modelId → default to gemini-2.5-flash; validate against {2.5-flash, 2.5-pro, 2.5-flash-lite}.

Risk: moderate — touches every OCR Lambda's hot path, but video_polish_v2 already implements the same pattern (MODEL_MAPPING lookup). We're cloning a working template, not inventing one. Test: pytest suite under lambdas/v2/tests/ — 4 cases per Lambda (default resolves to Flash + normal prompt; explicit Pro reaches Gemini correctly; bad promptId → 400; identity fields populated).

F3 — S3 remote config

low complexity

Static JSON at s3://qr-uploads-sup/config/ocr-routing.json, public-read, 10-min cache. No new Lambda. Schema is forward-compatible with future rolloutPct fields for canary rollouts (deferred).

Risk: low — just an S3 object + a deploy script. Test: upload the file, curl the public URL, confirm Cache-Control: max-age=600 + valid JSON.

F4 — iOS routing service + Settings picker

medium complexity user-visible

New OCRRoutingConfig actor fetches the JSON at cold start, falls back to baked-in defaults until it lands. 4 dev-override pickers in Settings — Default (server-configured) / 2.5 Flash / 2.5 Pro / 2.5 Flash-Lite. Resolved modelId + promptId stamp the request body in S3UploadService + VideoOCRService.

Risk: moderate — touches the request-body builder for every OCR call. Cold-start fetch must be non-blocking so a slow S3 doesn't gate the first scan. Test: launch debug build, grep debug_log.txt for RoutingConfig: fetched v1 flows=…; flip polish picker to Pro, run a video, confirm polishModelId=gemini-2.5-pro in the request body; reset → reverts.

F5 — A/B sweep harness

low complexity

Extend run_sweep.sh with --model + --prompt-id flags passed through to ocr_auto_test.sh as launch args (mirrors the existing USE_V2 / PHOTO_STRICT_MODE pattern). Cookie-cutter model-compare report template, modeled after the v1-vs-v2 page that already exists.

Risk: low — sweep harness only, no production code path. Test: run the harness with --model gemini-2.5-pro, confirm every iteration in the resulting batch carries modelId: "gemini-2.5-pro".

F6 — Cost + dashboard provenance

medium complexity

Add Pro + Flash-Lite entries to gemini_pricing.yaml (_cost.py reads from this — no code change). Schema gains optional provenance fields. compute_drift flags Lambda-emitted SHA ≠ YAML SHA. New Model-mix tile per flow page.

Risk: moderate — schema evolution, but additive. Old runs without provenance fall back to YAML inference (we keep the old code path). Test: astro check clean; force a Lambda/YAML SHA mismatch (temporarily edit YAML) and confirm a yellow drift pill appears.

F7 — Ship + initial Flash-vs-Pro polish sweep

low complexity decision gate

Real first experiment of the new infrastructure. 3 seeds × 3 iters × 2 configs = 18 video runs comparing baseline (all Flash) vs polish-on-Pro. Compare accuracy on source-code edge cases, latency, and cost (Pro is ~10× Flash on output tokens). Decision: keep Flash everywhere, or promote Pro to default on polish.

Risk: none for production — this is a sweep, not a default flip. The decision lands in a follow-up commit that updates ocr-routing.json. Test: sweep + report at /reports/2026-05-XX-gemini-flash-vs-pro-polish.

Safest, most efficient order

The sequence is built around two principles: ship infrastructure that is invisible to users before shipping anything that changes user-visible behavior, and gate every step on automated tests we can re-run cheaply.

  1. F1 + F2 deploy together (Lambda rev2). Both are additive at the wire — old clients keep working. After deploy, ingest one paired sweep to confirm the new provenance fields land. Pytest must be green before the AWS push.
  2. F3 (S3 config) goes up next, with v1 defaults. All flows still pinned to Flash + normal prompts. This is a no-op rollout but it lets us verify the cache + URL contract before any client reads it.
  3. F4 (iOS) ships with the picker hidden behind a debug build first. Validate against the live S3 config on simulator. Picker becomes visible to all users only after F6's drift detection is live — that way, if a dev override flips a model in production, the dashboard surfaces the divergence immediately.
  4. F5 + F6 ship together. The harness is useless without the dashboard surfaces; the dashboard surfaces have nothing to display without the harness. Pair them.
  5. F7 is the first real experiment. Run the Flash-vs-Pro sweep, write the decision report, then (and only then) flip ocr-routing.json to promote Pro on polish if the report supports it. The promotion is a one-line JSON edit + an S3 upload — fully reversible inside 10 min.

Every step lands as its own commit. If any step regresses production behavior we revert that single commit; F7's promotion is just a config flip, not a code change.

Verification at every gate

The whole point of this workstream is to make experiments safer; the rollout itself must hold the same standard. Each gate is automated except where noted.

Gate How we verify Pass criterion
After F1+F2 deploy pytest lambdas/v2/tests/ + curl the deployed URL with a real S3 key All tests green; response carries the 5 identity fields
After F3 upload curl public URL, validate JSON + Cache-Control header 200 + 10-min cache + parses
After F4 iOS install debug build cold-launch; grep debug_log.txt for routing fetch line; one paired test scan Routing fetched, Default picker → empty body field, Pro picker → Pro reaches Lambda
After F5+F6 astro check + astro build; ingest one new sweep; visual check of Model-mix tile Build clean; tile shows correct distribution; drift pill triggers when forced
Before F7 promotion Read the F7 report end-to-end; sanity-check word-deltas against known-good seeds Manual — Pro improvement must be measurable AND worth the cost delta

Explicit non-goals this cycle