AI Generation Retry & Recovery

Server-side retry cron, lifecycle tracking, and admin Generation dashboard (ZLY-1099)

Overview

AI Generation Retry & Recovery (ZLY-1099) closes the gap where a shopper leaves their email and walks away while AI art generation runs server-side. Today generation is synchronous — the client POSTs /ai/generate and waits up to 300 seconds. On failure the only foreground recovery is a manual page reload, and there is no server-side retry or persisted failure state. If the delayed-email path fails, no "your design is ready" email is ever sent and the shopper silently never receives their artwork.

This feature adds:

  • Lifecycle tracking on merch_session — status, attempt count, last attempt time, and last error.
  • A cron every 5 minutes that retries failed or stale generations (max 3 attempts, 5 min apart) by re-kicking the existing generate route.
  • A Generation dashboard in zooly-stats with KPI cards, filters, and manual Regenerate / Resend email actions.

Where to Access

SurfaceURL / pathWho
Generation dashboardzooly-statsMerchGeneration (/merch/generation)Admin only
Cron endpointGET /api/merch/cron/generation-retry on zooly-appBearer CRON_SECRET (Vercel cron)
Manual regeneratePOST /api/merch/admin/generation/{sessionId}/retry on zooly-appMerch admin
Manual resend emailPOST /api/merch/admin/generation/{sessionId}/resend on zooly-appMerch admin

Local dev defaults:

AppPort
zooly-stats3010
zooly-app3000 (API default 3004 in stats client env)
Merch SPA3008

The Problem (Before ZLY-1099)

sequenceDiagram participant Shopper participant Create as create-page participant Gen as POST /ai/generate participant AI as AI provider participant Email as sendMerchDelayedReadyEmail Shopper->>Create: Leaves email, closes tab Create->>Gen: POST (sync, up to 300s) Gen->>AI: generateConsistentAIArt() alt success AI-->>Gen: image Gen->>Email: "your design is ready" Email-->>Shopper: Email with result link else failure AI-->>Gen: error Note over Gen: Clears analytics lock only<br/>No email, no retry, no failed state Note over Shopper: Shopper never notified end

Foreground shoppers who stay on the page can tap Retry (page reload). The delayed-email shopper has no recovery path.

Foreground generation (unchanged)

The create page still drives generation synchronously — POST /api/merch/ai/generate (up to 300s) while polling GET /api/merch/session/{id}/generation-status. Client orchestration lives in create-page.tsx; the poll/race helper is packages/merch/client/src/lib/ai-generation.ts (STALE_IN_PROGRESS_MS = 5 min, matching the server lock).

sequenceDiagram participant Client as create-page.tsx participant Gen as POST /ai/generate participant Status as GET /generation-status participant Analytics as merch_session_analytics participant Session as merch_session Client->>Gen: POST (blocking) Gen->>Analytics: generationStartedAt = now Gen->>Session: markGenerationAttemptStarted (ZLY-1099) alt duplicate within 5 min Gen-->>Client: 409 in_progress Client->>Status: poll until complete else success Gen->>Session: markGenerationSucceeded Gen-->>Client: 200 + art else failure Gen->>Session: markGenerationFailed Gen-->>Client: 500 Client->>Client: handleRetry (reload) end

ZLY-1099 does not replace this foreground path — it adds server-side retry for sessions that fail while the shopper is gone.


Design Decisions

QuestionDecisionRationale
Which failures does the cron retry?Primary target: delayed-email shoppers who left before art completedHighest missed-revenue case; foreground users already self-serve via reload
How does the cron invoke generation?Reconcile kick — POST /ai/generate with CRON_SECRETReuses full path including ready email; matches offers crons
Where is status stored?merch_session columns (not analytics sidecar)Ticket requirement; simpler cron + dashboard queries
Manual dashboard actions?Regenerate + resend ready emailFulfillment push already exists via supplier assign
Retry cadenceEvery 5 min, max 3 attemptsPer ticket; then failed → dashboard

Implementation note: the impl plan originally scoped the DB candidate query to delayed_email IS NOT NULL. The shipped query is broader — any session with tracked status, no ai_art_key, and a design — still filtered to personalized designs + open campaigns in resolveRetryTargets. Foreground failures without a delayed email can therefore enter the cron set if they have a tracked generation_status.

Attempt counter: markGenerationAttemptStarted runs on every /ai/generate call, including foreground shopper retries. All attempts count toward the 3-try cron cap (not gated to delayed_email only).


End-to-End Flow (After ZLY-1099)

flowchart TD Cron[Cron every 5 min] --> Scan{Art missing, attempted,<br/>failed / pending / stale in_progress,<br/>attempts < 3?} Scan -->|No| Skip[Skip] Scan -->|Yes| Filter{Personalized design<br/>AND store open?} Filter -->|No| Skip Filter -->|Yes| Kick[Re-kick POST /ai/generate] Kick --> Start[markGenerationAttemptStarted<br/>attempts++] Start --> Gen{AI generation} Gen -->|Success| Ok[markGenerationSucceeded<br/>+ delayed-ready email] Gen -->|Failure| Fail[markGenerationFailed + error] Fail --> Cap{attempts >= 3?} Cap -->|No| Cron Cap -->|Yes| Dash[(Generation dashboard<br/>status = failed)] Ok --> Dash Dash --> Admin{Admin manual action} Admin -->|Regenerate| Reset[reset attempts + re-kick] Admin -->|Resend email| Resend[Re-send ready email] Reset --> Kick

Lifecycle states

StatusMeaning
pendingGeneration expected but not yet running (e.g. after admin reset)
in_progressA generation attempt is running (lock window: 5 min)
succeededArt produced (ai_art_key set); session removed from cron set
failedLast attempt failed; may still be retried until attempt cap

/ai/generate writes the lifecycle:

  • StartmarkGenerationAttemptStarted (increments generation_attempts, sets in_progress)
  • Success (including cache hit) → markGenerationSucceeded
  • FailuremarkGenerationFailed (records error message, does not reset attempts)

Delayed-Email Path

flowchart LR Selfie[Selfie captured] --> Email[Shopper enters email] Email --> Walk[Shopper leaves] Walk --> Gen[Server runs /ai/generate] Gen -->|Success| Ready[Ready email sent] Gen -->|Fail| Cron[Retry cron picks up] Cron -->|Retry succeeds| Ready Cron -->|3 failures| Admin[Admin dashboard triage] Admin --> Regen[Manual regenerate] Admin --> Resend[Manual resend email]

The cron reuses the existing generate path, which already calls sendMerchDelayedReadyEmail when delayed_email is set and art was missing at the start of the run. A successful retry therefore sends the same email the shopper would have received on first success.


In-Flight Lock (409)

Two layers prevent duplicate concurrent generation for the same session:

LayerMechanismStale after
Analytics lockmerch_session_analytics.generation_started_at set, generation_completed_at null5 min → next request proceeds
Session statusmerch_session.generation_status = in_progressCron treats as stale after 5 min (STALE_LOCK_MS)

If /ai/generate is called while analytics shows an in-flight run younger than 5 min, it returns 409 { inProgress: true }. The cron and admin retry treat 409 as healthy (generation already running), not an error.


Ready Email

On successful generation, /ai/generate may send the "your design is ready" email via sendMerchDelayedReadyEmail (@zooly/merch-fulfillment-srv).

Send conditions (all must hold):

  • delayed_email is set on the session (re-fetched after generation so a PATCH during the run is picked up)
  • Art was missing at the start of this request (!session.aiArtKey at entry) — prevents duplicate sends when a second kick runs after art already exists
  • Not a free digital experience pending item (ZLY-1294 suppresses this generic email; experience flow sends its own)
  • NEXT_PUBLIC_MERCH_URL (or NEXT_PUBLIC_APP_URL/merch) is configured

Link target: {merchHost}/{talentSlug}/{campaignSlug}/create?s={sessionId}

i18n keys (default-merch-strings.ts): email_delayed_subject, email_delayed_header, email_delayed_body, email_delayed_cta — overridden per campaign via createTranslator(campaign.i18n, defaultMerchStrings).

Manual Resend in the dashboard repeats this send for sessions that already have ai_art_key + delayed_email.


Architecture

apps/zooly-app/
  app/api/merch/cron/generation-retry/route.ts          5-min cron (thin router)
  app/api/merch/ai/generate/route.ts                    Lifecycle writes + generation
  app/api/merch/admin/generation/[sessionId]/retry/route.ts   Manual regenerate
  app/api/merch/admin/generation/[sessionId]/resend/route.ts  Manual resend email
  vercel.json                                           schedule: */5 * * * *

packages/merch/srv/ (@zooly/merch-srv)
  generation-retry.ts                                   resolveRetryTargets()

packages/db/ (@zooly/db)
  access/merch/merch-session.ts                         listRetryableGenerations, mark*, resetGenerationAttempts
  schema/merchEnums.ts                                  merch_generation_status enum
  schema/merchTables.ts                                 generation_* columns on merch_session

apps/zooly-stats/
  app/merch/generation/                                 Admin dashboard (read-only DB)
  app/api/admin/merch/generation/route.ts               List + KPI counts API

Read vs write split

zooly-stats uses a read-only production DB connection. The dashboard reads session rows via stats APIs and writes manual actions through cross-origin zooly-app admin routes (credentials: include + CORS).

sequenceDiagram participant Admin participant Stats as zooly-stats participant App as zooly-app participant DB as PostgreSQL Admin->>Stats: Load /merch/generation Stats->>DB: Read-only SQL (merch_session + campaign) DB-->>Stats: Generation rows + KPI counts Stats-->>Admin: Render table Admin->>Stats: Click Regenerate or Resend Stats->>App: POST .../admin/generation/{sessionId}/retry|resend App->>DB: resetGenerationAttempts / sendMerchDelayedReadyEmail App->>App: Re-kick /ai/generate (retry only) App-->>Stats: OK Stats-->>Admin: Refresh table

Reconcile-kick pattern

The cron does not duplicate generation logic. It lists candidates, then POSTs /api/merch/ai/generate server-to-server with Authorization: Bearer $CRON_SECRET — the same pattern as offers crons (image-candidates/reconcile). Benefits:

  • Reuses the full generate path (upload, session write, delayed email).
  • Idempotent: 409 while a run is in flight is treated as healthy, not an error.
  • Kicks are sequential (batch limit 50) to avoid piling AI load.

Cron Details

Route: GET /api/merch/cron/generation-retry

Auth: Authorization: Bearer $CRON_SECRET

Schedule: every 5 minutes (*/5 * * * * in apps/zooly-app/vercel.json)

Constants:

ConstantValuePurpose
RETRY_AFTER_MS5 minMinimum gap between attempts
STALE_LOCK_MS5 minin_progress older than this → treat as dead
MAX_ATTEMPTS3Cron stops retrying after this
BATCH_LIMIT50Max sessions per tick
maxDuration300sMatches generate route timeout

Candidate query (listRetryableGenerations):

  • ai_art_key IS NULL — art never completed
  • design_id IS NOT NULL — needs a design to generate
  • generation_attempts < maxAttempts
  • Session not expired
  • Status is failed, pending, or stale in_progress
  • Last attempt older than retryAfterMs (or never recorded)

Service filter (resolveRetryTargets):

  • Skips branded designs (designMode === "branded") — no AI, template image only
  • Skips closed/expired campaigns (isMerchCampaignOpen)

Response: { pickedUp, kicked, errors[] }

Env:

VariablePurpose
CRON_SECRETBearer token for cron auth
NEXT_PUBLIC_MERCH_URLBase URL for ready-email / resend links
NEXT_PUBLIC_MERCH_API_URLStats client base for cross-origin admin actions (default http://localhost:3004)
ALLOWED_DOMAINS_CORSMust include the zooly-stats origin (e.g. http://localhost:3010) so dashboard Regenerate/Resend calls succeed

Local test:

curl -H "Authorization: Bearer $CRON_SECRET" \
  http://localhost:3000/api/merch/cron/generation-retry

Admin Dashboard

Path: /merch/generation in zooly-stats (linked from Merch → Stats nav tabs).

Access: cookie auth via getVerifiedUserInfo; unauthenticated users redirect to SSO with returnTo; non-admins redirect to /denied.

Default filter: Failed. Only sessions with generation_status IS NOT NULL appear (a session that never hit /ai/generate is invisible here). List capped at 200 rows, ordered by generation_last_attempt_at DESC.

KPI cards

Counts across all tracked sessions (status filter does not affect KPIs):

  • Failed
  • Pending
  • In progress
  • Succeeded

Table columns

ColumnSource
StoreCampaign talent_name / slug
Emailmerch_session.delayed_email
Statusgeneration_status pill
Attemptsgeneration_attempts
Last attemptgeneration_last_attempt_at
Errorgeneration_error (truncated)

Filters: status and store (campaign).

Manual actions

ActionAPIWhen enabled
RegeneratePOST .../retryAlways (resets attempt counter, re-kicks generate with forceRegenerate: true)
Resend emailPOST .../resendOnly when ai_art_key and delayed_email are both set

Both routes require requireAdmin() from @zooly/merch-admin-srv.

Resend builds the result URL as {merchHost}/{talentSlug}/{campaignSlug}/create?s={sessionId} and calls sendMerchDelayedReadyEmail with campaign i18n.


Database

merch_generation_status enum

pending · in_progress · succeeded · failed

merch_session (new columns)

ColumnTypePurpose
generation_statusenumCurrent lifecycle state
generation_attemptsinteger (default 0)Total attempts (cron + foreground)
generation_last_attempt_attimestamptzLast kick timestamp
generation_errortextLast failure message (max 1000 chars)

Index: merch_session_generation_status_idx on generation_status.

Migration: packages/db/drizzle/0125_merch_session_generation_status.sql (idempotent).

Relationship to analytics milestones

merch_session_analytics still tracks generation_started_at / generation_completed_at for funnel analytics. ZLY-1099 adds session-level status for cron eligibility and admin triage — the ticket explicitly asked to track generation on the session.

flowchart TB subgraph Session["merch_session (ZLY-1099)"] GS[generation_status] GA[generation_attempts] GL[generation_last_attempt_at] GE[generation_error] end subgraph Analytics["merch_session_analytics (existing)"] GStart[generation_started_at] GEnd[generation_completed_at] end GenRoute["POST /ai/generate"] --> Session GenRoute --> Analytics Cron["generation-retry cron"] -->|reads| Session Dashboard["Generation dashboard"] -->|reads| Session

Local Verification

Schema check:

set -a && source .env.local
psql "$DATABASE_URL" -c "\d merch_session" | grep generation
# expect: generation_status, generation_attempts, generation_last_attempt_at, generation_error

Seed a failed generation:

UPDATE merch_session
SET delayed_email = 'you@example.com',
    ai_art_key = NULL,
    generation_status = 'failed',
    generation_attempts = 1,
    generation_last_attempt_at = now() - interval '10 minutes'
WHERE id = '<session-with-personalized-design>';

Trigger cron:

curl -s -H "Authorization: Bearer $CRON_SECRET" \
  http://localhost:3004/api/merch/cron/generation-retry | jq
# expect: { "pickedUp": >=1, "kicked": >=1, "errors": [] }

After success: generation_status = succeeded, ai_art_key set, ready email sent. After 3 failures the row stays failed and is no longer picked up until an admin clicks Regenerate (resets attempts).

Dashboard: open http://localhost:3010/merch/generation with an admin cookie. Confirm KPI cards, status/store filters, and manual actions. Verify CORS if Regenerate/Resend fail cross-origin.


Out of Scope

  • Turning foreground generation into a full async job queue (foreground users still use client retry / reload).
  • Retrying branded designs (no AI involved).
  • Per-(design, product) tuple retry for multi-product journeys — the success-only tuple cache merch_session_generation is not used for failure recovery; cron targets the session's primary generation.
  • Manual push to fulfillment (already covered by PATCH /orders/{id}/items/{itemId}/supplier).
  • Automatic email on failure (only success / manual resend sends the ready email).

See also: Abandoned Cart Recovery for the parallel hourly cart cron pattern.