Architecture & Design

Architecture and design decisions for the Likeness Search system

Architecture Overview

The Likeness Search system follows a layered architecture optimized for Vercel's serverless environment:

┌─────────────────────────────────────────────┐
│   API Layer (Next.js Route Handlers)        │
│   apps/zooly-app/app/api/indexing/          │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│   Likeness Search Package                    │
│   @zooly/likeness-search                     │
│   packages/likeness-search/src/          │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│   Utility Packages                           │
│   @zooly/util-elevenlabs, @zooly/util-srv   │
│   @zooly/social-scraper                      │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│   Database Access Layer                      │
│   packages/db/src/access/                │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│   Database Schema (Drizzle ORM)              │
│   packages/db/src/schema/               │
└─────────────────────────────────────────────┘

System Components

1. Database Schema Layer

Location: packages/db/src/schema/

Six core tables support the search system:

likeness_assets - Stores uploaded images and voice samples with AI-extracted tags
likeness_search - SQL search index with 30+ enum fields for exact filtering
likeness_search_vector - pgvector embeddings for semantic similarity search
likeness_need_indexing_queue - Event queue driving the indexing pipeline
account_social_links - Social media links and platform-specific follower counts
scrapes - Social media scraping results and retry tracking

See Database Schema for complete table definitions.

2. Database Access Layer

Location: packages/db/src/access/

Provides type-safe access functions following the project's access pattern:

Queue Management: likenessNeedIndexingQueue.ts - Event CRUD and status management
Asset Management: likenessAssets.ts - Asset CRUD and tag updates
Search Index: likenessSearch.ts - SQL search index upserts and queries
Vector Search: likenessSearchVector.ts - Vector embedding operations
Requirements: baseRequirementsChecker.ts - Base requirements validation
Sufficiency: dataSufficiencyChecker.ts - Data sufficiency evaluation
Social Links: accountSocialLinks.ts - Social link management
Scrapes: scrapes.ts - Scraping result storage

Key Principle: Tables are never exposed directly. All database access goes through access functions, ensuring consistent filtering, type safety, and data integrity.

3. Likeness Search Package

Location: packages/likeness-search/src/
Package: @zooly/likeness-search

A dedicated package containing core business logic organized by responsibility:

Indexing Pipeline

indexingDaemon.ts - Main daemon loop that processes the queue
processEvent.ts - Processes individual queue events
upsertToIndex.ts - Upserts account data to SQL and vector indexes
aggregateAccountTags.ts - Aggregates tags from multiple sources

Data Collection

generateTagsFromImage.ts - AI image tag extraction (Gemini vision)
generateTagsFromVoice.ts - AI voice tag extraction (Gemini audio)
generateVoiceSampleText.ts - AI demo script generation
triggerSubProcess.ts - Fire-and-forget API triggers

Search Functions

searchLikeness.ts - Main search function (SQL with vector fallback)
vectorFallbackSearch.ts - Vector similarity search
formatSearchResults.ts - Result enrichment and formatting

Utilities

apiRetryHandler.ts - Error classification and retry logic
validateAudioUrl.ts - Audio URL validation
schemas/likenessTagsSchema.ts - Zod schemas for tag validation

4. Utility Packages

The system leverages shared utility packages:

`@zooly/util-elevenlabs`

Location: packages/util-elevenlabs/src/

createVoiceSample.ts - Complete voice sample creation workflow (ElevenLabs voice clone + TTS generation)
voice-management.ts - Voice management functions (create, update, delete, list)
elevenlabs-service.ts - Low-level ElevenLabs API operations

`@zooly/util-srv`

Location: packages/util-srv/src/

generateEmbedding.ts - OpenAI embedding generation for vector search
S3 operations - File storage utilities

`@zooly/social-scraper`

Location: packages/social-scraper/src/

A dedicated package for social media scraping operations:

processSocialScraping.ts - Main orchestration function that scrapes all social links for an account
parseSocialUrl.ts - URL normalization and platform detection
scrapers/ - Platform-specific scraper functions:
- instagram.ts - Instagram profile scraping
- tiktok.ts - TikTok profile scraping
- twitter.ts - Twitter/X profile scraping
- youtube.ts - YouTube channel scraping
- linkedin.ts - LinkedIn profile scraping
- index.ts - Scraper registry (platform → function mapping)
types.ts - Type definitions for scraping results

Purpose: Collects follower counts, profile images, and metadata from social media platforms to enhance search indexing and populate missing account data.

Integration with Indexing Pipeline:

Trigger Points: Social scraping is triggered when:
- Account is missing a profile image but has social links
- Data sufficiency check indicates missing follower count data
Process Flow:
- Indexing daemon calls triggerSocialScraping() (fire-and-forget)
- API route /api/indexing/scrape-social receives the request
- Calls processSocialScraping() which:
  - Fetches social links from account_social_links table
  - For each platform, checks retry limits and backoff periods
  - Scrapes via Scrapfly API (Instagram, TikTok, Twitter/X, YouTube, LinkedIn)
  - Updates account_social_links.followersCount per platform
  - Stores results in scrapes table for retry tracking
  - Uploads profile avatars to S3 if account image is missing
  - Creates likeness_assets entries from scraped avatars
Data Integration:
- Follower counts are aggregated during tag aggregation (aggregateAccountTags)
- Profile images become likeness_assets entries for AI tag extraction
- Creates new SOCIAL_DATA queue event to trigger re-indexing
Error Handling:
- Per-link retry tracking (max 5 attempts per link)
- Exponential backoff per link
- Skip logic for recently scraped links (< 24h)
- Rate limit detection and handling

5. API Layer

Location: apps/zooly-app/app/api/indexing/

Next.js route handlers that expose the system:

process-queue/route.ts - Cron endpoint (GET) - Protected by CRON_SECRET
- Main indexing daemon that processes queue events
- Triggers sub-processes (AI generation, social scraping) when needed
generate-tags/route.ts - AI tag generation (POST) - Protected by CRON_SECRET
- Processes unprocessed assets for tag generation
- Calls generateTagsFromImage() and generateTagsFromVoice() from @zooly/likeness-search
- Creates voice samples via createVoiceSample() from @zooly/util-elevenlabs
scrape-social/route.ts - Social scraping (POST) - Protected by CRON_SECRET
- Orchestrates social media scraping for an account
- Calls processSocialScraping() from @zooly/social-scraper
- Updates follower counts in account_social_links table
- Uploads profile avatars to S3 and creates likeness_assets entries
- Creates new SOCIAL_DATA queue event for re-indexing
search/route.ts - Public search (GET/POST) - No authentication required
- Main search endpoint for buyers
- Calls searchLikeness() from @zooly/likeness-search

6. External Services Integration

The system integrates with several external services:

OpenAI - text-embedding-3-small for vector embeddings (1536 dimensions)
Google Gemini - gemini-2.5-flash for image and audio analysis
ElevenLabs - Voice cloning and text-to-speech generation
AWS S3 - Storage for voice samples and profile images
Scrapfly - Social media scraping service (used by @zooly/social-scraper)

Design Decisions

1. Dual-Index Strategy

Decision: Use both SQL and vector indexes

Rationale:

SQL index provides fast, exact filtering on enum fields
Vector index enables semantic similarity search for fuzzy queries
Fallback mechanism ensures no query returns empty results unnecessarily

Implementation: SQL search runs first. If it returns 0 results (and offset=0), vector search is used as fallback.

2. Queue-Based Async Processing

Decision: Event-driven queue architecture

Rationale:

Serverless-friendly (Vercel functions have time limits)
Allows independent scaling of sub-processes
Enables retry logic and error recovery
Prevents blocking user interactions

Implementation: Events are added to queue, daemon processes them via cron, sub-processes run as separate API calls.

3. Fire-and-Forget Sub-Processes

Decision: Sub-processes triggered via non-blocking fetch() calls

Rationale:

Each sub-process gets its own serverless function lifecycle
Prevents timeout issues with long-running operations
Allows parallel execution of multiple sub-processes
Simplifies error handling per sub-process

Implementation: triggerSubProcess.ts uses fetch() without await, sub-processes mark events as completed when done.

4. Account-Centric Design

Decision: Everything keyed by accountId (not userId)

Rationale:

Account is the main tenant object in the system
User entity is only for authentication/identity
Simplifies multi-tenant data isolation
Aligns with business domain model

Implementation: All tables reference account.id, all access functions take accountId parameter.

5. Separate Vector Table

Decision: Keep vector embeddings in separate table from SQL index

Rationale:

Efficient CRUD on search data without fetching embeddings
Embeddings are large (1536 dimensions) and rarely needed
Drizzle natively supports pgvector (no raw SQL hacks)
Allows independent updates to vector vs. data

Implementation: likeness_search_vector table stores accountId, content (text), and embedding (vector).

6. Best-Effort Indexing

Decision: Index accounts even with incomplete data if they have at least one image

Rationale:

Maximizes discoverability
Buyers can still find accounts with partial data
Better than returning no results
Quality maintained through minimum requirements

Implementation: After exhausting data collection options, if account has ≥1 image asset, index with available data.

7. Per-Record Retry Tracking

Decision: Track retry attempts per asset and per social link

Rationale:

Some assets/links may fail permanently while others succeed
Prevents retrying failed records indefinitely
Enables exponential backoff per record
Better than failing entire account indexing

Implementation: likenessAssets.tagAttemptCount and scrapes.attemptCount track individual retries.

Decision: Social scraping runs as a separate, asynchronous sub-process via fire-and-forget API calls

Rationale:

Social scraping can be slow (multiple API calls per account)
Prevents blocking the main indexing daemon
Allows parallel execution with AI tag generation
Each platform scrape gets its own retry lifecycle
Enables per-link retry tracking independent of other links

Implementation:

triggerSocialScraping() makes non-blocking fetch() call to /api/indexing/scrape-social
Event stays IN_PROGRESS while scraping runs
Scraper creates new SOCIAL_DATA event on completion to trigger re-indexing
Each social link tracks its own scraping attempts in scrapes table
Links with successful scrapes within 24 hours are automatically skipped

Data Flow Diagrams

Indexing Flow

flowchart TD subgraph UserActions["User Actions"] A[Celebrity completes onboarding] --> B[Name, Image, Social Links] B --> C[Approves Term] C --> D[Uploads Assets] end subgraph EventTrigger["Event Triggers"] B --> E[Create Indexing Event] C --> E D --> E end E --> F[Add to Queue] F --> G[Indexing Daemon] subgraph DaemonWorkflow["Daemon Workflow"] G --> H{Base Requirements Met?} H -->|No| H2{Image Missing + Has Social Links?} H2 -->|Yes| N[Trigger Social Scraping] H2 -->|No| I[Discard Event] H -->|Yes| J{Sufficient Data?} J -->|Yes| K[Upsert to Index] J -->|No| L{What's Missing?} L -->|Tags| M[Trigger AI Generation] L -->|Follower Count| N end subgraph SubProcesses["Sub-Processes (Async API Calls)"] M --> O[AI Generates Tags] N --> P[Scraper Collects Data] P --> P2[Upload Avatar to S3] P2 --> P3[Set as User Image] P3 --> P4[Create likenessAsset] O --> Q[Create New Indexing Event] P4 --> Q end Q --> F subgraph StatusManagement["Event Status"] G --> R[Mark IN_PROGRESS] K --> S[Mark COMPLETED] I --> S R -.->|"> 5 min"| T[Mark TIMEOUT] end subgraph Search["Search Flow"] U[Buyer Submits Brief] --> V[AI Extracts Tags from Brief] V --> W[SQL Filter Search on Index] W --> X[Return Matching Celebrities] end K --> W

Search Flow

flowchart TD A[Buyer Submits Query] --> B{Query Type?} B -->|Brief Text| C[AI Extract Filters] B -->|Direct Filters| D[Use Filters] C --> D D --> E[SQL Search] E --> F{Results Found?} F -->|Yes| G[Format Results] F -->|No + First Page| H[Vector Search] H --> I[Format Results] G --> J[Return to Buyer] I --> J

Event Processing Flow

sequenceDiagram participant Queue participant Daemon participant Checker as Requirements Checker participant Sufficiency as Data Sufficiency Checker participant AI as AI Generator participant Scraper as Social Scraper (@zooly/social-scraper) participant Index as Search Index Queue->>Daemon: Get Oldest PENDING Event Daemon->>Checker: Check Base Requirements Checker-->>Daemon: Requirements Status alt Missing Requirements Daemon->>Queue: Mark DISCARDED else Requirements Met Daemon->>Sufficiency: Check Data Sufficiency Sufficiency-->>Daemon: Sufficiency Status alt Sufficient Data Daemon->>Index: Upsert Account Daemon->>Queue: Mark COMPLETED else Insufficient Data Daemon->>AI: Trigger Tag Generation Daemon->>Scraper: Trigger Social Scraping Note over Daemon: Event stays IN_PROGRESS AI->>AI: Generate Tags (Gemini) Scraper->>Scraper: processSocialScraping() Note over Scraper: 1. Fetch social links from account_social_links 2. Check retry limits & backoff 3. Scrape via Scrapfly (per platform) 4. Update follower counts 5. Upload avatars to S3 if needed 6. Create likeness_assets entries Scraper->>Scraper: Updates account_social_links.followersCount Scraper->>Scraper: Stores results in scrapes table AI->>Queue: Create New Event Scraper->>Queue: Create New SOCIAL_DATA Event AI->>Queue: Mark Original COMPLETED Scraper->>Queue: Mark Original COMPLETED end end

Environment Variables

The system requires the following environment variables:

Variable	Purpose
`CRON_SECRET`	Bearer token for cron/internal API authentication
`INDEXING_DAEMON_BATCH_SIZE`	Max events per daemon run (default: 250)
`NEXT_PUBLIC_APP_URL`	Base URL for fire-and-forget API calls
`OPENAI_API_KEY`	OpenAI API key for embeddings
`ELEVEN_LABS_API_KEY`	ElevenLabs API key for voice cloning
`AWS_BUCKET_NAME`	S3 bucket for voice samples
`AWS_REGION`	AWS region for S3
`AWS_ACCESS_KEY_ID`	AWS access key
`AWS_SECRET_ACCESS_KEY`	AWS secret key
`NEXT_PUBLIC_AWS_BUCKET_URL`	Public S3 bucket URL prefix
`DATABASE_URL`	PostgreSQL connection URL

Performance Considerations

Indexing Performance

Batch Processing: Daemon processes up to 250 events per run (configurable)
Delays: 1-second delay between events prevents overwhelming the system
Parallel Sub-Processes: AI generation and social scraping run in parallel
Timeout Detection: Events stuck >5 minutes are marked as timeout

Search Performance

SQL First: Fast enum-based filtering runs before vector search
Vector Fallback: Only used when SQL returns 0 results (first page)
Result Caching: Consider adding caching layer for frequently searched filters
Pagination: Search supports limit/offset for pagination

Database Optimization

Indexes: Queue table has indexes on (status, createdAt) and (accountId, status)
Unique Constraints: Search index has unique constraint on accountId
Vector Index: pgvector automatically creates indexes for similarity search

Error Handling Strategy

The system implements multi-level error handling:

Event-Level: Failed sub-processes retry with exponential backoff (max 5 attempts)
Per-Asset: Individual assets track attempt counts (max 5 attempts)
Per-Scrape: Social links track scraping attempts separately (max 5 attempts)
Timeout Detection: Stale events are automatically marked as timeout
Error Classification: Errors are classified as rate-limit, temporary, or permanent

See Error Handling for detailed retry logic.

Likeness Search Overview Search Flow

Architecture & Design

Architecture Overview

System Components

1. Database Schema Layer

2. Database Access Layer

3. Likeness Search Package

Indexing Pipeline

Data Collection

Search Functions

Utilities

4. Utility Packages

@zooly/util-elevenlabs

@zooly/util-srv

@zooly/social-scraper

5. API Layer

6. External Services Integration

Design Decisions

1. Dual-Index Strategy

2. Queue-Based Async Processing

3. Fire-and-Forget Sub-Processes

4. Account-Centric Design

5. Separate Vector Table

6. Best-Effort Indexing

7. Per-Record Retry Tracking

8. Asynchronous Social Scraping

Data Flow Diagrams

Indexing Flow

Search Flow

Event Processing Flow

Environment Variables

Performance Considerations

Indexing Performance

Search Performance

Database Optimization

Error Handling Strategy

`@zooly/util-elevenlabs`

`@zooly/util-srv`

`@zooly/social-scraper`