Architecture & Design

Architecture and design decisions for the Likeness Search system

Architecture Overview

The Likeness Search system follows a layered architecture optimized for Vercel's serverless environment:

┌─────────────────────────────────────────────┐
│   API Layer (Next.js Route Handlers)        │
│   apps/zooly-app/app/api/indexing/          │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│   Likeness Search Package                    │
│   @zooly/likeness-search                     │
│   packages/likeness-search/src/          │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│   Utility Packages                           │
│   @zooly/util-elevenlabs, @zooly/util-srv   │
│   @zooly/social-scraper                      │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│   Database Access Layer                      │
│   packages/db/src/access/                │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│   Database Schema (Drizzle ORM)              │
│   packages/db/src/schema/               │
└─────────────────────────────────────────────┘

System Components

1. Database Schema Layer

Location: packages/db/src/schema/

Six core tables support the search system:

  • likeness_assets - Stores uploaded images and voice samples with AI-extracted tags
  • likeness_search - SQL search index with 30+ enum fields for exact filtering
  • likeness_search_vector - pgvector embeddings for semantic similarity search
  • likeness_need_indexing_queue - Event queue driving the indexing pipeline
  • account_social_links - Social media links and platform-specific follower counts
  • scrapes - Social media scraping results and retry tracking

See Database Schema for complete table definitions.

2. Database Access Layer

Location: packages/db/src/access/

Provides type-safe access functions following the project's access pattern:

  • Queue Management: likenessNeedIndexingQueue.ts - Event CRUD and status management
  • Asset Management: likenessAssets.ts - Asset CRUD and tag updates
  • Search Index: likenessSearch.ts - SQL search index upserts and queries
  • Vector Search: likenessSearchVector.ts - Vector embedding operations
  • Requirements: baseRequirementsChecker.ts - Base requirements validation
  • Sufficiency: dataSufficiencyChecker.ts - Data sufficiency evaluation
  • Social Links: accountSocialLinks.ts - Social link management
  • Scrapes: scrapes.ts - Scraping result storage

Key Principle: Tables are never exposed directly. All database access goes through access functions, ensuring consistent filtering, type safety, and data integrity.

3. Likeness Search Package

Location: packages/likeness-search/src/
Package: @zooly/likeness-search

A dedicated package containing core business logic organized by responsibility:

Indexing Pipeline

  • indexingDaemon.ts - Main daemon loop that processes the queue
  • processEvent.ts - Processes individual queue events
  • upsertToIndex.ts - Upserts account data to SQL and vector indexes
  • aggregateAccountTags.ts - Aggregates tags from multiple sources

Data Collection

  • generateTagsFromImage.ts - AI image tag extraction (Gemini vision)
  • generateTagsFromVoice.ts - AI voice tag extraction (Gemini audio)
  • generateVoiceSampleText.ts - AI demo script generation
  • triggerSubProcess.ts - Fire-and-forget API triggers

Search Functions

  • searchLikeness.ts - Main search function (SQL with vector fallback)
  • vectorFallbackSearch.ts - Vector similarity search
  • formatSearchResults.ts - Result enrichment and formatting

Utilities

  • apiRetryHandler.ts - Error classification and retry logic
  • validateAudioUrl.ts - Audio URL validation
  • schemas/likenessTagsSchema.ts - Zod schemas for tag validation

4. Utility Packages

The system leverages shared utility packages:

@zooly/util-elevenlabs

Location: packages/util-elevenlabs/src/

  • createVoiceSample.ts - Complete voice sample creation workflow (ElevenLabs voice clone + TTS generation)
  • voice-management.ts - Voice management functions (create, update, delete, list)
  • elevenlabs-service.ts - Low-level ElevenLabs API operations

@zooly/util-srv

Location: packages/util-srv/src/

  • generateEmbedding.ts - OpenAI embedding generation for vector search
  • S3 operations - File storage utilities

@zooly/social-scraper

Location: packages/social-scraper/src/

A dedicated package for social media scraping operations:

  • processSocialScraping.ts - Main orchestration function that scrapes all social links for an account
  • parseSocialUrl.ts - URL normalization and platform detection
  • scrapers/ - Platform-specific scraper functions:
    • instagram.ts - Instagram profile scraping
    • tiktok.ts - TikTok profile scraping
    • twitter.ts - Twitter/X profile scraping
    • youtube.ts - YouTube channel scraping
    • linkedin.ts - LinkedIn profile scraping
    • index.ts - Scraper registry (platform → function mapping)
  • types.ts - Type definitions for scraping results

Purpose: Collects follower counts, profile images, and metadata from social media platforms to enhance search indexing and populate missing account data.

Integration with Indexing Pipeline:

  1. Trigger Points: Social scraping is triggered when:

    • Account is missing a profile image but has social links
    • Data sufficiency check indicates missing follower count data
  2. Process Flow:

    • Indexing daemon calls triggerSocialScraping() (fire-and-forget)
    • API route /api/indexing/scrape-social receives the request
    • Calls processSocialScraping() which:
      • Fetches social links from account_social_links table
      • For each platform, checks retry limits and backoff periods
      • Scrapes via Scrapfly API (Instagram, TikTok, Twitter/X, YouTube, LinkedIn)
      • Updates account_social_links.followersCount per platform
      • Stores results in scrapes table for retry tracking
      • Uploads profile avatars to S3 if account image is missing
      • Creates likeness_assets entries from scraped avatars
  3. Data Integration:

    • Follower counts are aggregated during tag aggregation (aggregateAccountTags)
    • Profile images become likeness_assets entries for AI tag extraction
    • Creates new SOCIAL_DATA queue event to trigger re-indexing
  4. Error Handling:

    • Per-link retry tracking (max 5 attempts per link)
    • Exponential backoff per link
    • Skip logic for recently scraped links (< 24h)
    • Rate limit detection and handling

5. API Layer

Location: apps/zooly-app/app/api/indexing/

Next.js route handlers that expose the system:

  • process-queue/route.ts - Cron endpoint (GET) - Protected by CRON_SECRET

    • Main indexing daemon that processes queue events
    • Triggers sub-processes (AI generation, social scraping) when needed
  • generate-tags/route.ts - AI tag generation (POST) - Protected by CRON_SECRET

    • Processes unprocessed assets for tag generation
    • Calls generateTagsFromImage() and generateTagsFromVoice() from @zooly/likeness-search
    • Creates voice samples via createVoiceSample() from @zooly/util-elevenlabs
  • scrape-social/route.ts - Social scraping (POST) - Protected by CRON_SECRET

    • Orchestrates social media scraping for an account
    • Calls processSocialScraping() from @zooly/social-scraper
    • Updates follower counts in account_social_links table
    • Uploads profile avatars to S3 and creates likeness_assets entries
    • Creates new SOCIAL_DATA queue event for re-indexing
  • search/route.ts - Public search (GET/POST) - No authentication required

    • Main search endpoint for buyers
    • Calls searchLikeness() from @zooly/likeness-search

6. External Services Integration

The system integrates with several external services:

  • OpenAI - text-embedding-3-small for vector embeddings (1536 dimensions)
  • Google Gemini - gemini-2.5-flash for image and audio analysis
  • ElevenLabs - Voice cloning and text-to-speech generation
  • AWS S3 - Storage for voice samples and profile images
  • Scrapfly - Social media scraping service (used by @zooly/social-scraper)

Design Decisions

1. Dual-Index Strategy

Decision: Use both SQL and vector indexes

Rationale:

  • SQL index provides fast, exact filtering on enum fields
  • Vector index enables semantic similarity search for fuzzy queries
  • Fallback mechanism ensures no query returns empty results unnecessarily

Implementation: SQL search runs first. If it returns 0 results (and offset=0), vector search is used as fallback.

2. Queue-Based Async Processing

Decision: Event-driven queue architecture

Rationale:

  • Serverless-friendly (Vercel functions have time limits)
  • Allows independent scaling of sub-processes
  • Enables retry logic and error recovery
  • Prevents blocking user interactions

Implementation: Events are added to queue, daemon processes them via cron, sub-processes run as separate API calls.

3. Fire-and-Forget Sub-Processes

Decision: Sub-processes triggered via non-blocking fetch() calls

Rationale:

  • Each sub-process gets its own serverless function lifecycle
  • Prevents timeout issues with long-running operations
  • Allows parallel execution of multiple sub-processes
  • Simplifies error handling per sub-process

Implementation: triggerSubProcess.ts uses fetch() without await, sub-processes mark events as completed when done.

4. Account-Centric Design

Decision: Everything keyed by accountId (not userId)

Rationale:

  • Account is the main tenant object in the system
  • User entity is only for authentication/identity
  • Simplifies multi-tenant data isolation
  • Aligns with business domain model

Implementation: All tables reference account.id, all access functions take accountId parameter.

5. Separate Vector Table

Decision: Keep vector embeddings in separate table from SQL index

Rationale:

  • Efficient CRUD on search data without fetching embeddings
  • Embeddings are large (1536 dimensions) and rarely needed
  • Drizzle natively supports pgvector (no raw SQL hacks)
  • Allows independent updates to vector vs. data

Implementation: likeness_search_vector table stores accountId, content (text), and embedding (vector).

6. Best-Effort Indexing

Decision: Index accounts even with incomplete data if they have at least one image

Rationale:

  • Maximizes discoverability
  • Buyers can still find accounts with partial data
  • Better than returning no results
  • Quality maintained through minimum requirements

Implementation: After exhausting data collection options, if account has ≥1 image asset, index with available data.

7. Per-Record Retry Tracking

Decision: Track retry attempts per asset and per social link

Rationale:

  • Some assets/links may fail permanently while others succeed
  • Prevents retrying failed records indefinitely
  • Enables exponential backoff per record
  • Better than failing entire account indexing

Implementation: likenessAssets.tagAttemptCount and scrapes.attemptCount track individual retries.

8. Asynchronous Social Scraping

Decision: Social scraping runs as a separate, asynchronous sub-process via fire-and-forget API calls

Rationale:

  • Social scraping can be slow (multiple API calls per account)
  • Prevents blocking the main indexing daemon
  • Allows parallel execution with AI tag generation
  • Each platform scrape gets its own retry lifecycle
  • Enables per-link retry tracking independent of other links

Implementation:

  • triggerSocialScraping() makes non-blocking fetch() call to /api/indexing/scrape-social
  • Event stays IN_PROGRESS while scraping runs
  • Scraper creates new SOCIAL_DATA event on completion to trigger re-indexing
  • Each social link tracks its own scraping attempts in scrapes table
  • Links with successful scrapes within 24 hours are automatically skipped

Data Flow Diagrams

Indexing Flow

flowchart TD subgraph UserActions["User Actions"] A[Celebrity completes onboarding] --> B[Name, Image, Social Links] B --> C[Approves Term] C --> D[Uploads Assets] end subgraph EventTrigger["Event Triggers"] B --> E[Create Indexing Event] C --> E D --> E end E --> F[Add to Queue] F --> G[Indexing Daemon] subgraph DaemonWorkflow["Daemon Workflow"] G --> H{Base Requirements Met?} H -->|No| H2{Image Missing + Has Social Links?} H2 -->|Yes| N[Trigger Social Scraping] H2 -->|No| I[Discard Event] H -->|Yes| J{Sufficient Data?} J -->|Yes| K[Upsert to Index] J -->|No| L{What's Missing?} L -->|Tags| M[Trigger AI Generation] L -->|Follower Count| N end subgraph SubProcesses["Sub-Processes (Async API Calls)"] M --> O[AI Generates Tags] N --> P[Scraper Collects Data] P --> P2[Upload Avatar to S3] P2 --> P3[Set as User Image] P3 --> P4[Create likenessAsset] O --> Q[Create New Indexing Event] P4 --> Q end Q --> F subgraph StatusManagement["Event Status"] G --> R[Mark IN_PROGRESS] K --> S[Mark COMPLETED] I --> S R -.->|"> 5 min"| T[Mark TIMEOUT] end subgraph Search["Search Flow"] U[Buyer Submits Brief] --> V[AI Extracts Tags from Brief] V --> W[SQL Filter Search on Index] W --> X[Return Matching Celebrities] end K --> W

Search Flow

flowchart TD A[Buyer Submits Query] --> B{Query Type?} B -->|Brief Text| C[AI Extract Filters] B -->|Direct Filters| D[Use Filters] C --> D D --> E[SQL Search] E --> F{Results Found?} F -->|Yes| G[Format Results] F -->|No + First Page| H[Vector Search] H --> I[Format Results] G --> J[Return to Buyer] I --> J

Event Processing Flow

sequenceDiagram participant Queue participant Daemon participant Checker as Requirements Checker participant Sufficiency as Data Sufficiency Checker participant AI as AI Generator participant Scraper as Social Scraper (@zooly/social-scraper) participant Index as Search Index Queue->>Daemon: Get Oldest PENDING Event Daemon->>Checker: Check Base Requirements Checker-->>Daemon: Requirements Status alt Missing Requirements Daemon->>Queue: Mark DISCARDED else Requirements Met Daemon->>Sufficiency: Check Data Sufficiency Sufficiency-->>Daemon: Sufficiency Status alt Sufficient Data Daemon->>Index: Upsert Account Daemon->>Queue: Mark COMPLETED else Insufficient Data Daemon->>AI: Trigger Tag Generation Daemon->>Scraper: Trigger Social Scraping Note over Daemon: Event stays IN_PROGRESS AI->>AI: Generate Tags (Gemini) Scraper->>Scraper: processSocialScraping() Note over Scraper: 1. Fetch social links from account_social_links<br/>2. Check retry limits & backoff<br/>3. Scrape via Scrapfly (per platform)<br/>4. Update follower counts<br/>5. Upload avatars to S3 if needed<br/>6. Create likeness_assets entries Scraper->>Scraper: Updates account_social_links.followersCount Scraper->>Scraper: Stores results in scrapes table AI->>Queue: Create New Event Scraper->>Queue: Create New SOCIAL_DATA Event AI->>Queue: Mark Original COMPLETED Scraper->>Queue: Mark Original COMPLETED end end

Environment Variables

The system requires the following environment variables:

VariablePurpose
CRON_SECRETBearer token for cron/internal API authentication
INDEXING_DAEMON_BATCH_SIZEMax events per daemon run (default: 250)
NEXT_PUBLIC_APP_URLBase URL for fire-and-forget API calls
OPENAI_API_KEYOpenAI API key for embeddings
ELEVEN_LABS_API_KEYElevenLabs API key for voice cloning
AWS_BUCKET_NAMES3 bucket for voice samples
AWS_REGIONAWS region for S3
AWS_ACCESS_KEY_IDAWS access key
AWS_SECRET_ACCESS_KEYAWS secret key
NEXT_PUBLIC_AWS_BUCKET_URLPublic S3 bucket URL prefix
DATABASE_URLPostgreSQL connection URL

Performance Considerations

Indexing Performance

  • Batch Processing: Daemon processes up to 250 events per run (configurable)
  • Delays: 1-second delay between events prevents overwhelming the system
  • Parallel Sub-Processes: AI generation and social scraping run in parallel
  • Timeout Detection: Events stuck >5 minutes are marked as timeout

Search Performance

  • SQL First: Fast enum-based filtering runs before vector search
  • Vector Fallback: Only used when SQL returns 0 results (first page)
  • Result Caching: Consider adding caching layer for frequently searched filters
  • Pagination: Search supports limit/offset for pagination

Database Optimization

  • Indexes: Queue table has indexes on (status, createdAt) and (accountId, status)
  • Unique Constraints: Search index has unique constraint on accountId
  • Vector Index: pgvector automatically creates indexes for similarity search

Error Handling Strategy

The system implements multi-level error handling:

  1. Event-Level: Failed sub-processes retry with exponential backoff (max 5 attempts)
  2. Per-Asset: Individual assets track attempt counts (max 5 attempts)
  3. Per-Scrape: Social links track scraping attempts separately (max 5 attempts)
  4. Timeout Detection: Stale events are automatically marked as timeout
  5. Error Classification: Errors are classified as rate-limit, temporary, or permanent

See Error Handling for detailed retry logic.