Image Generation Benchmark

Admin tool for comparing AI models, prompts, and hyperparameters through the production merch pipeline

What is the Benchmark?

The Image Generation Benchmark is an admin-only tool inside Merch Admin for running controlled experiments on merch AI art generation. Each benchmark session sweeps a cartesian product of:

selfies × prompts × models × quality × input fidelity

Every generation runs through the same production pipeline as live merch:

  1. AI generation — gateway model call
  2. Overlay composite — center-crop + template overlay
  3. Background mask — optional removal using the same modes as production render

Results are stored in PostgreSQL and uploaded to S3 so admins can compare outputs side-by-side, like/dislike individual cells, and rerun past sessions.

Where to Access

EnvironmentURL
Local devhttp://localhost:3004/admin/merch/benchmark
ProductionMerch Admin → Benchmark tab in the top nav

The Benchmark tab is admin-only. Supplier users are routed to /orders and cannot access it.

Routes:

PagePath
New Benchmark (setup)/benchmark
Run history/benchmark/results
Session detail/benchmark/results/:id

Architecture

apps/zooly-merch-admin/ (Vite SPA — port 3004)
  Benchmark tab mounts pages from @zooly/merch-benchmark-client
  BenchmarkPathsProvider base="/benchmark"

packages/merch-benchmark/client/ (@zooly/merch-benchmark-client)
  dashboard-page.tsx       Setup form + sticky Run Benchmark bar
  results-page.tsx         Session list
  session-detail-page.tsx  Comparison grid, config, rerun
  components/              Image upload, selfie groups modal, results grid

packages/merch-benchmark/srv/ (@zooly/merch-benchmark-srv)
  create-session.ts        Cartesian product → DB rows
  run-generation.ts        Full pipeline: generate → overlay → mask → S3
  selfies.ts               Western / Japan selfie corpus helpers

packages/merch/img-gen/ (@zooly/merch-img-gen)
  generateAIArt, downloadAndProcessAIImage, resolveBackgroundMask, applyMaskToImage
  Shared with production /api/merch/ai/generate and render steps

apps/zooly-app/app/api/merch/admin/benchmark/
  sessions/                GET list, POST create
  sessions/[id]/           GET detail, PATCH rename
  generate/                POST run one generation
  generations/[id]/like/   PATCH like/dislike
  selfies/                 GET selfie corpus

packages/db/
  schema/merchBenchmarkTables.ts
  access/merch/merch-benchmark.ts

Auth

Same pattern as the rest of merch admin:

  • Client calls GET /api/merch/admin/* with credentials: "include"
  • Server routes use requireAdmin() from @zooly/merch-admin-srv
  • Unauthenticated users redirect to zooly-auth SSO (VITE_AUTH_URL)

Client-side generation loop

After creating a session, the dashboard fires a client-side job queue (run-loop.ts) that POSTs to /api/merch/admin/benchmark/generate with bounded concurrency. The session detail page polls every 3 seconds while status is running.


Setup Page

Session name

Auto-generated timestamp name; editable. Required to run.

Selfies

Two corpora:

SetCountNotes
Western110Default dry-run pool
Japan442Larger corpus

Selection modes:

ModeBehavior
allEvery selfie in the corpus
rangeInclusive index range (e.g. 1–5)
randomN random selfies at run time
groupSelfies from a saved test group — can mix Western, Japan, and uploaded selfies

Selfie groups are shared with the Simulation tab. Use Manage groups to create, edit, or delete groups from the full pool (Western + Japan + uploaded selfies). The group editor also supports uploading new selfies — they are stored on S3 and become available to every group.

Prompts

One or more named prompt variants. At least one prompt with non-empty text is required.

Template, overlay, and mask

Drag-and-drop image upload fields (or paste a URL). These map to production design assets:

  • Template — AI template image passed to the model
  • Overlay — composited on cropped art after generation
  • Background removal modenone, edge-only, flood-fill+edge, or mask-image
  • Mask image — required when mode is mask-image

Models and hyperparameters

Models come from GET /api/merch/admin/models (same catalog as production). Only active models appear.

SettingOptions
Qualitylow, medium, high
Input fidelitylow, high
ConcurrencyParallel generation jobs (default 5)

Run summary bar

A sticky bar at the bottom shows total generations, estimated cost, and a Run Benchmark button. The button is disabled with a hint when required fields are missing.


Results

Run history (/benchmark/results)

Lists past sessions with name, date, status, and models used. New Benchmark returns to the setup page.

Session detail (/benchmark/results/:id)

Header

  • Click session name to rename
  • Status badge: running, completed, or failed
  • Rerun with these settings — opens setup pre-filled from this session (?from=sessionId)

Stats

Progress, failed count, cost, average duration. Progress bar while running.

Configuration

Expandable section showing prompts (read-only textarea, max ~500px height), template/overlay/mask thumbnails, and run parameters.

Results grid

  • Rows = selfies; columns = model / prompt / quality / fidelity (selectable via column field dropdown)
  • Show masked toggle — switches between post-overlay (artUrl) and post-mask (maskedArtUrl) images
  • Column filters — checkboxes to show/hide specific column values
  • Click an image for fullscreen view
  • Thumbs up / down on each completed cell (persisted via API)

Database Schema

Defined in packages/db/src/schema/merchBenchmarkTables.ts. Migration: 0128_merch_benchmark_tables.sql.

merch_benchmark_session

One row per benchmark run.

ColumnPurpose
promptsJSON array of { name, text }
template_image_url, overlay_url, bg_mask_image_urlAsset URLs
bg_removal_modeBackground removal mode
selfie_test_setwestern or japan
total_generations, completed_generations, failed_generationsCounters
statusrunning | completed | failed
estimated_cost, actual_costCost tracking

merch_benchmark_generation

One row per (selfie, prompt, model, quality, fidelity) combination.

ColumnPurpose
selfie_url, selfie_indexInput selfie
model_endpoint, model_display_nameFrom model catalog
prompt, prompt_namePrompt text and variant name
params{ quality, inputFidelity, selfieFilter }
art_urlPost-overlay result (S3)
masked_art_urlPost-mask result (S3)
likedtrue / false / null human review
statuspending | running | completed | failed
duration_ms, cost_per_image, errorMetrics

Access layer: packages/db/src/access/merch/merch-benchmark.ts


API Reference

All routes under apps/zooly-app/app/api/merch/admin/benchmark/. All require admin auth + CORS.

GET /api/merch/admin/benchmark/sessions

List all sessions with aggregated model names.

POST /api/merch/admin/benchmark/sessions

Create a session and bulk-insert pending generation rows.

Body (key fields):

{
  "name": "Benchmark 2026-06-10",
  "prompts": [{ "name": "Default", "text": "..." }],
  "selfies": [{ "index": 1, "url": "https://..." }],
  "models": [{ "endpointId": "gpt-image", "displayName": "GPT" }],
  "templateImageUrl": null,
  "overlayUrl": null,
  "bgRemovalMode": "none",
  "bgMaskImageUrl": null,
  "concurrency": 5,
  "selfieTestSet": "western",
  "qualities": ["low"],
  "inputFidelities": ["high"],
  "selfieFilter": null
}

GET /api/merch/admin/benchmark/sessions/[id]

Session + all generations.

PATCH /api/merch/admin/benchmark/sessions/[id]

Rename session: { "name": "..." }

POST /api/merch/admin/benchmark/generate

Run one generation: { "generationId": "..." }. maxDuration = 300.

PATCH /api/merch/admin/benchmark/generations/[id]/like

Set review: { "liked": true | false | null }

GET /api/merch/admin/benchmark/selfies?set=western|japan

Returns the selfie corpus for the chosen test set.



### Environment

Merch admin reads `VITE_APP_URL` and `VITE_AUTH_URL` (defaults to `localhost:3004` and `localhost:3003` in dev).

Ensure `ALLOWED_DOMAINS_CORS` in `apps/zooly-app/.env.local` includes `http://localhost:3004`.

Run migration `0128_merch_benchmark_tables` against your local database before first use.



---

## Related Documentation

- [Merch Overview](/docs/merch/overview) — system components
- [Architecture](/docs/merch/architecture) — stack and package layout
- [Database Schema](/docs/merch/database-schema) — all merch tables
- [Environment Setup](/docs/merch/environment-setup) — local dev prerequisites