Full-coverage sweep analysis

Analysis of batch 2026-04-20T16-21Z — the first clean full sweep (5 iterations × 6 asset/flow pairs = 30 runs) after the post-sweep fixes landed. Every run parses, every invariant passes, so the interesting signal is in what the outputs look like, not in whether the pipeline survived.

Observed leak events

1 / 10

video runs on BBBB0003 leaked the raw JSON envelope to the user's clipboard

Push notifications

unverified

device registers + Lambda promises one; no evidence the app ever receives it

Photo + collection determinism

< 1.5%

char-count variation across 5 iterations at temperature=0

1. Timing & variability

Flow	Asset	n	Avg time	Time CV	Avg chars	Char CV	Char range
collection	CCCC0001	5	12.4s	10.3%	2788	0.2%	17
collection	CCCC0002	5	10.6s	6.4%	2792	0.2%	20
single-photo	AAAA0003	5	9.9s	25.4%	1108	0.4%	12
single-photo	AAAA0004	5	11.8s	46.6%	422	1.4%	16
video	BBBB0003	5	140.2s	15.8%	5466	13.6%	2,043
video	BBBB0004	5	96.1s	13.0%	3549	0.4%	38

Photos + collections are nearly perfectly deterministic — CV under 1.5% on char count means Gemini at temperature=0 is essentially reproducing the same output. The tiny deltas are whitespace-only (e.g. [ ] vs ☐ for the same checkbox, 2-space vs 4-space indentation).

Video BBBB0003 is the outlier at 13.6% char CV and a 2,043-char range across 5 runs. Four of the five runs cluster near 5,000 chars; the fifth came in at 6,925 chars — and that one is the leak event described below.

Video time dominates the pipeline — ~83% of total video time is cloud-processing (Gemini polish: avg 116s, max 174s). s3-upload averages 2.3s; frame-extraction logs 0ms (instrumentation gap, not a real 0).

2. Observed leak event — video polish JSON envelope

Raw JSON envelope reached the user

1 of 10 video runs on asset BBBB0003 · 2026-04-20T16-38-17Z__iter_04

severity: leak

Gemini's polish call on this run returned the expected JSON wrapped in a markdown code fence (```json ... ```). iOS's JSONDecoder rejects the fence wrapper, so it fell through to the plain-text fallback path — meaning the user's clipboard received 6,925 characters of raw JSON including Gemini's process commentary under the additional_notes key. This is the exact failure mode the user has described seeing in production.

What reached the clipboard (first ~500 chars)

```json
{
  "stitched_response": "import * as vscode from \"vscode\";\nimport * as aws from \"aws-sdk\";\n\nexport function activate(context: vscode.ExtensionContext) {\n  const disposable = vscode.commands.registerCommand(\n    \"qrcode-extension.openQRCodePanel\",\n    () => {\n      // ...",
  "additional_notes": "The document was reconstructed by identifying overlapping lines between consecutive OCR frames. Indentation was corrected and normalized across all stitched content..."
}
```

5,237 chars

2026-04-20T16-44-35Z__iter_05

leak

6,925 chars

2026-04-20T16-38-17Z__iter_04

4,882 chars

2026-04-20T16-32-39Z__iter_03

5,234 chars

2026-04-20T16-27-35Z__iter_02

5,052 chars

2026-04-20T16-22-25Z__iter_01

Four clean runs landed in the 4,882–5,237 range (same 164 lines of VS Code extension code); the leak run ballooned to 6,925 chars.

3. Recommendation — Proposal A (strict JSON responseSchema)

Priority 1 — Video polish

Ship responseSchema immediately

Current prompt says "your final output must be a single valid JSON object and nothing else" — the leak event proves that's insufficient. Gemini decided to helpfully wrap its JSON in markdown. True generationConfig.responseSchema makes that structurally impossible at the API layer.

+ closes the observed leak path

+ ~20 extra output tokens, ~5% latency bump

~ Lambda-only change; iOS contract unchanged

Priority 2 — Photo + collection

Ship responseSchema after video lands

No leaks observed in this sweep's 20 photo/collection runs — but the sample is narrow (2 assets × 5 iters). Historical production evidence from the user remains the stronger signal. Applying the same structural safeguard here closes the last remaining plain-text return path.

+ makes leaks structurally impossible on every flow

+ enables a real parseability invariant

~ one Lambda edit covers both (EXTENDED_LAMBDA_OCR)

Short-term hotfix (can ship today, independent of Proposal A)

Strip ```json and ``` fences from the polish response on the iOS side before JSONDecoder.decode(). Five-line change in VideoOCRService. Catches the specific markdown-fence failure mode observed here, even without touching the Lambda. Think of this as a seatbelt while Proposal A is the airbag.

Full design in docs/superpowers/specs/2026-04-20-ios-json-response-and-parseability.md — the sequencing there was A→B with photo/collection first. The leak evidence flips the ordering: video polish now leads, photo/collection follows.

4. Push notifications — probably broken

Every video in the sweep follows the same pattern: the backend says a push will be sent, iOS registers with SNS, and then iOS polls S3 for the result. There's no evidence the push is ever delivered.

✓ Observed

Push permission granted
APNS device token registered
SNS platform endpoint ARN returned
Lambda accepts: "You will receive a push notification when complete"

✗ Not observed

No didReceiveRemoteNotification events
No willPresent events
Every video completes via pollElapsed=~100s (polling, not push)

Three candidate causes — investigate in order

Lambda doesn't publish. Check CloudWatch for video_ocr_polishing_lambda SNS publish attempts at the end of a run. If absent, the completion handler is the break.
APNS sandbox ↔ simulator delivery is flaky. Run the same test on a physical device to rule this in or out.
iOS-side handler isn't firing. Add a print("📬 push received:", userInfo) at the top of every UNUserNotificationCenterDelegate method.

The polling fallback is masking the problem — users get their result in ~100s either way, but the "You'll get a push when it's done" message on screen may be a promise we're not keeping.

5. Suggested next tests — ranked by value

Ship Proposal A for video_polish + re-sweep 10 BBBB0003 runs

critical effort: small

One-Lambda change that would block the observed leak. Compare leak rate pre/post on the exact asset that failed.

Markdown-fence hotfix in VideoOCRService

high value effort: tiny

Five-line iOS change that strips ```json wrappers before JSONDecoder. Ships today, independent of Proposal A.

Push-notification end-to-end verification on a physical device

high value effort: small

Either confirms the path works (removes the FUD) or localizes the actual break. Low-cost, high-information.

Widen the asset corpus

high value effort: medium

Two videos isn’t enough. Add: handwritten notes, multi-column PDF, RTL script (Arabic/Hebrew), a video with UI chrome the prompt says to strip. Each surfaces a different failure mode.

Temperature=0.5 baseline sweep

medium effort: small

A/B reference for how much variability is model-inherent vs. our T=0 ceiling. Useful for deciding when future regressions are "real" vs. noise.

Instrument frame-extraction timing on iOS

medium effort: tiny

Currently logs 0ms. Unblocks full pipeline-stage analysis and credit-per-frame validation.

Proposal B — persist raw Gemini JSON (after A lands)

medium effort: medium

Once iOS stops discarding the envelope, the dashboard can measure parse-rate over time. The invariant I tried to add in Phase C.5 finally has data to check.

6. TL;DR

✓ Pipeline is clean — 30/30 runs parse, all invariants pass, dashboard loads every run.
! Direct evidence Proposal A is needed — 1 of 10 video runs leaked the raw JSON envelope + model commentary to the user via a markdown-fence parse failure. Photo + collection clean in this sample but with a narrow corpus.
? Push notifications appear not to be delivering — every video falls back to polling. Needs investigation on a physical device.
→ Most urgent: video polish responseSchema + an iOS markdown-fence stripper. Both close the leak path.