Workstream F — hot-swappable prompts & Gemini model variants
High-level breakdown of the next cycle. Why we want it, how complex each piece is, the
safest order to ship, and what we test at every gate. This is the executive view —
the line-by-line plan lives in ~/.claude/plans/wild-tumbling-glade.md.
Why now
Today every OCR prompt is a string literal baked into a Lambda zip; every Gemini model ID is a hardcoded constant. When Google ships Gemini 3.x, or we want to try Gemini 2.5 Pro on polish only, our only knob is "redeploy."
We lose two things from that:
- Server-side rollout control. Can't ramp a new model to 10% of calls without an iOS release. App-store latency > experiment latency.
- Ground-truth provenance. The dashboard infers
provider/model/promptSha256fromapp_defaults.yaml. If a Lambda drifts (someone hot-edits the prompt, AWS rolls back to a stale zip) we won't notice — the dashboard will keep cheerfully reporting the YAML's belief.
Scope this cycle. Gemini family only — 2.5 Flash, 2.5 Pro, future 3.x. Cross-provider adapters (OpenAI, Anthropic) are explicitly deferred until we've proven the hot-swap infrastructure end-to-end. iOS Settings picker + S3 remote config drives the switching. Per-flow granularity (4 independent picks: photo / collection / video-frame / video-polish) so we can e.g. pin frame OCR to Flash for speed but promote polish to Pro for accuracy.
What this unlocks
Operational
- Flip prompt or model in S3 → propagates to all clients in ≤10 min, no app release.
- Per-stage tuning: cheap Flash on the hot path (frame OCR, called per-frame) and Pro on the polish stage where accuracy compounds across frames.
- Safe rollback: if a new prompt blows up parse-rate, edit one JSON file and we're back to baseline within the cache TTL.
Observability
- Every run carries Lambda-emitted provenance — no more inferring from YAML.
- SHA-drift detection: dashboard turns yellow if Lambda's prompt SHA ≠
app_defaults.yamlSHA. - Model-mix tile per flow: at-a-glance "did our remote-config flip actually propagate?"
- Cost tracking: separate Pro / Flash-Lite line items in the per-scan cost estimate.
Sub-steps — complexity & impact
F1 — Lambda self-identification
low complexity unlocks F5/F6
Add an _identity() helper to each v2 Lambda
(moderation_ocr, frame_ocr,
video_polish, video_aggregator) that emits
{provider, modelId, promptId, promptSha256, lambdaVersion} in every
response. Bump lambdaVersion suffix to -rev2 so the
dashboard can tell pre- vs post-F1 runs apart.
data/runs/<id>.json.
F2 — Prompt externalization + request-parameterized model
medium complexity core enabler
Lift each prompt out of Python into a named YAML entry
(lambdas/v2/prompts.yaml). At cold start, load the YAML and compute
SHA-256 per entry. Lambda accepts modelId + promptId on the
request body; unknown promptId → 400; missing modelId →
default to gemini-2.5-flash; validate against
{2.5-flash, 2.5-pro, 2.5-flash-lite}.
video_polish_v2 already implements the same pattern (MODEL_MAPPING lookup).
We're cloning a working template, not inventing one.
Test: pytest suite under lambdas/v2/tests/ — 4 cases per Lambda
(default resolves to Flash + normal prompt; explicit Pro reaches Gemini correctly; bad
promptId → 400; identity fields populated).
F3 — S3 remote config
low complexity
Static JSON at s3://qr-uploads-sup/config/ocr-routing.json, public-read,
10-min cache. No new Lambda. Schema is forward-compatible with future
rolloutPct fields for canary rollouts (deferred).
Cache-Control: max-age=600 + valid JSON.
F4 — iOS routing service + Settings picker
medium complexity user-visible
New OCRRoutingConfig actor fetches the JSON at cold start, falls back to
baked-in defaults until it lands. 4 dev-override pickers in Settings — Default
(server-configured) / 2.5 Flash / 2.5 Pro / 2.5 Flash-Lite. Resolved
modelId + promptId stamp the request body in
S3UploadService + VideoOCRService.
debug_log.txt for
RoutingConfig: fetched v1 flows=…; flip polish picker to Pro, run a video,
confirm polishModelId=gemini-2.5-pro in the request body; reset → reverts.
F5 — A/B sweep harness
low complexity
Extend run_sweep.sh with --model + --prompt-id flags
passed through to ocr_auto_test.sh as launch args (mirrors the existing
USE_V2 / PHOTO_STRICT_MODE pattern). Cookie-cutter
model-compare report template, modeled after the v1-vs-v2 page that already
exists.
--model gemini-2.5-pro, confirm
every iteration in the resulting batch carries modelId: "gemini-2.5-pro".
F6 — Cost + dashboard provenance
medium complexity
Add Pro + Flash-Lite entries to gemini_pricing.yaml
(_cost.py reads from this — no code change). Schema gains optional
provenance fields. compute_drift flags Lambda-emitted SHA ≠ YAML SHA. New
Model-mix tile per flow page.
astro check clean; force a Lambda/YAML SHA mismatch
(temporarily edit YAML) and confirm a yellow drift pill appears.
F7 — Ship + initial Flash-vs-Pro polish sweep
low complexity decision gateReal first experiment of the new infrastructure. 3 seeds × 3 iters × 2 configs = 18 video runs comparing baseline (all Flash) vs polish-on-Pro. Compare accuracy on source-code edge cases, latency, and cost (Pro is ~10× Flash on output tokens). Decision: keep Flash everywhere, or promote Pro to default on polish.
ocr-routing.json.
Test: sweep + report at /reports/2026-05-XX-gemini-flash-vs-pro-polish.
Safest, most efficient order
The sequence is built around two principles: ship infrastructure that is invisible to users before shipping anything that changes user-visible behavior, and gate every step on automated tests we can re-run cheaply.
- F1 + F2 deploy together (Lambda rev2). Both are additive at the wire — old clients keep working. After deploy, ingest one paired sweep to confirm the new provenance fields land. Pytest must be green before the AWS push.
- F3 (S3 config) goes up next, with v1 defaults. All flows still pinned to Flash + normal prompts. This is a no-op rollout but it lets us verify the cache + URL contract before any client reads it.
- F4 (iOS) ships with the picker hidden behind a debug build first. Validate against the live S3 config on simulator. Picker becomes visible to all users only after F6's drift detection is live — that way, if a dev override flips a model in production, the dashboard surfaces the divergence immediately.
- F5 + F6 ship together. The harness is useless without the dashboard surfaces; the dashboard surfaces have nothing to display without the harness. Pair them.
- F7 is the first real experiment. Run the Flash-vs-Pro sweep, write the decision report, then (and only then) flip
ocr-routing.jsonto promote Pro on polish if the report supports it. The promotion is a one-line JSON edit + an S3 upload — fully reversible inside 10 min.
Every step lands as its own commit. If any step regresses production behavior we revert that single commit; F7's promotion is just a config flip, not a code change.
Verification at every gate
The whole point of this workstream is to make experiments safer; the rollout itself must hold the same standard. Each gate is automated except where noted.
| Gate | How we verify | Pass criterion |
|---|---|---|
| After F1+F2 deploy | pytest lambdas/v2/tests/ + curl the deployed URL with a real S3 key | All tests green; response carries the 5 identity fields |
| After F3 upload | curl public URL, validate JSON + Cache-Control header | 200 + 10-min cache + parses |
| After F4 iOS install | debug build cold-launch; grep debug_log.txt for routing fetch line; one paired test scan | Routing fetched, Default picker → empty body field, Pro picker → Pro reaches Lambda |
| After F5+F6 | astro check + astro build; ingest one new sweep; visual check of Model-mix tile | Build clean; tile shows correct distribution; drift pill triggers when forced |
| Before F7 promotion | Read the F7 report end-to-end; sanity-check word-deltas against known-good seeds | Manual — Pro improvement must be measurable AND worth the cost delta |
Explicit non-goals this cycle
- OpenAI / Anthropic adapters. Cross-provider work waits until the Gemini-family swap pattern is proven in production.
- True no-redeploy prompt edits. Prompts.yaml is in the Lambda zip; full hot-reload (S3-hosted prompts + polling) is a future enhancement.
- Canary / percentage rollouts. Schema in
ocr-routing.jsonleaves room (rolloutPctfield), but we ship instant rollouts only this cycle. - Drift-alert cron. F6 surfaces a yellow pill in the dashboard synchronously; nightly Slack alerts are queued.
- v1 Lambda retirement. Earliest 2026-05-21, unchanged from prior plan.
- Workstream E (zapcopy.app landing page). Separate cycle — not bundled into F.