Admin tool for comparing AI models, prompts, and hyperparameters through the production merch pipeline
The Image Generation Benchmark is an admin-only tool inside Merch Admin for running controlled experiments on merch AI art generation. Each benchmark session sweeps a cartesian product of:
selfies × prompts × models × quality × input fidelity
Every generation runs through the same production pipeline as live merch:
Results are stored in PostgreSQL and uploaded to S3 so admins can compare outputs side-by-side, like/dislike individual cells, and rerun past sessions.
| Environment | URL |
|---|---|
| Local dev | http://localhost:3004/admin/merch/benchmark |
| Production | Merch Admin → Benchmark tab in the top nav |
The Benchmark tab is admin-only. Supplier users are routed to /orders and cannot access it.
Routes:
| Page | Path |
|---|---|
| New Benchmark (setup) | /benchmark |
| Run history | /benchmark/results |
| Session detail | /benchmark/results/:id |
apps/zooly-merch-admin/ (Vite SPA — port 3004)
Benchmark tab mounts pages from @zooly/merch-benchmark-client
BenchmarkPathsProvider base="/benchmark"
packages/merch-benchmark/client/ (@zooly/merch-benchmark-client)
dashboard-page.tsx Setup form + sticky Run Benchmark bar
results-page.tsx Session list
session-detail-page.tsx Comparison grid, config, rerun
components/ Image upload, selfie groups modal, results grid
packages/merch-benchmark/srv/ (@zooly/merch-benchmark-srv)
create-session.ts Cartesian product → DB rows
run-generation.ts Full pipeline: generate → overlay → mask → S3
selfies.ts Western / Japan selfie corpus helpers
packages/merch/img-gen/ (@zooly/merch-img-gen)
generateAIArt, downloadAndProcessAIImage, resolveBackgroundMask, applyMaskToImage
Shared with production /api/merch/ai/generate and render steps
apps/zooly-app/app/api/merch/admin/benchmark/
sessions/ GET list, POST create
sessions/[id]/ GET detail, PATCH rename
generate/ POST run one generation
generations/[id]/like/ PATCH like/dislike
selfies/ GET selfie corpus
packages/db/
schema/merchBenchmarkTables.ts
access/merch/merch-benchmark.ts
Same pattern as the rest of merch admin:
GET /api/merch/admin/* with credentials: "include"requireAdmin() from @zooly/merch-admin-srvVITE_AUTH_URL)After creating a session, the dashboard fires a client-side job queue (run-loop.ts) that POSTs to /api/merch/admin/benchmark/generate with bounded concurrency. The session detail page polls every 3 seconds while status is running.
Auto-generated timestamp name; editable. Required to run.
Two corpora:
| Set | Count | Notes |
|---|---|---|
| Western | 110 | Default dry-run pool |
| Japan | 442 | Larger corpus |
Selection modes:
| Mode | Behavior |
|---|---|
| all | Every selfie in the corpus |
| range | Inclusive index range (e.g. 1–5) |
| random | N random selfies at run time |
| group | Selfies from a saved test group — can mix Western, Japan, and uploaded selfies |
Selfie groups are shared with the Simulation tab. Use Manage groups to create, edit, or delete groups from the full pool (Western + Japan + uploaded selfies). The group editor also supports uploading new selfies — they are stored on S3 and become available to every group.
One or more named prompt variants. At least one prompt with non-empty text is required.
Drag-and-drop image upload fields (or paste a URL). These map to production design assets:
none, edge-only, flood-fill+edge, or mask-imagemask-imageModels come from GET /api/merch/admin/models (same catalog as production). Only active models appear.
| Setting | Options |
|---|---|
| Quality | low, medium, high |
| Input fidelity | low, high |
| Concurrency | Parallel generation jobs (default 5) |
A sticky bar at the bottom shows total generations, estimated cost, and a Run Benchmark button. The button is disabled with a hint when required fields are missing.
/benchmark/results)Lists past sessions with name, date, status, and models used. New Benchmark returns to the setup page.
/benchmark/results/:id)Header
running, completed, or failed?from=sessionId)Stats
Progress, failed count, cost, average duration. Progress bar while running.
Configuration
Expandable section showing prompts (read-only textarea, max ~500px height), template/overlay/mask thumbnails, and run parameters.
Results grid
artUrl) and post-mask (maskedArtUrl) imagesDefined in packages/db/src/schema/merchBenchmarkTables.ts. Migration: 0128_merch_benchmark_tables.sql.
merch_benchmark_sessionOne row per benchmark run.
| Column | Purpose |
|---|---|
prompts | JSON array of { name, text } |
template_image_url, overlay_url, bg_mask_image_url | Asset URLs |
bg_removal_mode | Background removal mode |
selfie_test_set | western or japan |
total_generations, completed_generations, failed_generations | Counters |
status | running | completed | failed |
estimated_cost, actual_cost | Cost tracking |
merch_benchmark_generationOne row per (selfie, prompt, model, quality, fidelity) combination.
| Column | Purpose |
|---|---|
selfie_url, selfie_index | Input selfie |
model_endpoint, model_display_name | From model catalog |
prompt, prompt_name | Prompt text and variant name |
params | { quality, inputFidelity, selfieFilter } |
art_url | Post-overlay result (S3) |
masked_art_url | Post-mask result (S3) |
liked | true / false / null human review |
status | pending | running | completed | failed |
duration_ms, cost_per_image, error | Metrics |
Access layer: packages/db/src/access/merch/merch-benchmark.ts
All routes under apps/zooly-app/app/api/merch/admin/benchmark/. All require admin auth + CORS.
List all sessions with aggregated model names.
Create a session and bulk-insert pending generation rows.
Body (key fields):
{
"name": "Benchmark 2026-06-10",
"prompts": [{ "name": "Default", "text": "..." }],
"selfies": [{ "index": 1, "url": "https://..." }],
"models": [{ "endpointId": "gpt-image", "displayName": "GPT" }],
"templateImageUrl": null,
"overlayUrl": null,
"bgRemovalMode": "none",
"bgMaskImageUrl": null,
"concurrency": 5,
"selfieTestSet": "western",
"qualities": ["low"],
"inputFidelities": ["high"],
"selfieFilter": null
}
Session + all generations.
Rename session: { "name": "..." }
Run one generation: { "generationId": "..." }. maxDuration = 300.
Set review: { "liked": true | false | null }
Returns the selfie corpus for the chosen test set.
### Environment
Merch admin reads `VITE_APP_URL` and `VITE_AUTH_URL` (defaults to `localhost:3004` and `localhost:3003` in dev).
Ensure `ALLOWED_DOMAINS_CORS` in `apps/zooly-app/.env.local` includes `http://localhost:3004`.
Run migration `0128_merch_benchmark_tables` against your local database before first use.
---
## Related Documentation
- [Merch Overview](/docs/merch/overview) — system components
- [Architecture](/docs/merch/architecture) — stack and package layout
- [Database Schema](/docs/merch/database-schema) — all merch tables
- [Environment Setup](/docs/merch/environment-setup) — local dev prerequisitesOn This Page
What is the Benchmark?Where to AccessArchitectureAuthClient-side generation loopSetup PageSession nameSelfiesPromptsTemplate, overlay, and maskModels and hyperparametersRun summary barResultsRun history (,[object Object],)Session detail (,[object Object],)Database Schema[object Object][object Object]API ReferenceGET /api/merch/admin/benchmark/sessionsPOST /api/merch/admin/benchmark/sessionsGET /api/merch/admin/benchmark/sessions/[id]PATCH /api/merch/admin/benchmark/sessions/[id]POST /api/merch/admin/benchmark/generatePATCH /api/merch/admin/benchmark/generations/[id]/likeGET /api/merch/admin/benchmark/selfies?set=western|japan