Likeness Search Overview

Overview of the Likeness Search Indexing System

What is Likeness Search?

Likeness Search is a dual-index search system that enables buyers to find accounts (celebrity talent) by physical, visual, and voice characteristics. It powers the discovery experience on Zooly, allowing content creators to search for the perfect likeness match for their campaigns.

Purpose

The Likeness Search system enables:

Visual likeness discovery - Find talent matching specific physical characteristics (hair color, eye color, body type, etc.)
Voice likeness discovery - Find talent matching specific voice characteristics (accent, pitch, tone, language)
Social media filtering - Filter by follower count, platform, verification status
Semantic search - Natural language queries that understand intent even with incomplete specifications

Key Concepts

Dual-Index Architecture

The system uses two complementary search indexes:

SQL Index (likeness_search table) - Enum-based fields for exact filtering and fast SQL queries
Vector Index (likeness_search_vector table) - OpenAI embeddings for semantic similarity fallback when SQL returns no results

Event-Driven Indexing Pipeline

Indexing happens asynchronously through a queue-based system:

Events are created when accounts are updated, terms are approved, or assets are uploaded
A background daemon processes events and triggers data collection when needed
Sub-processes (AI tag generation, social scraping) run independently via API endpoints
The system automatically retries failed operations with exponential backoff

AI-Powered Data Collection

The system uses AI to extract searchable tags from:

Images - Google Gemini 2.5 Flash vision model extracts visual characteristics
Voice samples - Google Gemini 2.5 Flash audio model extracts voice characteristics
Social media - Scraping collects follower counts and profile images

Voice Sample Generation

For voice assets, the system creates AI-generated voice samples using the @zooly/util-elevenlabs package:

ElevenLabs voice cloning - Creates a voice clone from the original audio
AI-generated demo scripts - Gemini generates natural demo text matching voice characteristics (via @zooly/likeness-search)
Text-to-speech - ElevenLabs generates the final voice sample audio
S3 storage - Voice samples are stored in S3 for direct playback

The complete workflow is handled by createVoiceSample() in @zooly/util-elevenlabs.

System Architecture

The Likeness Search system consists of six main components:

1. Database Schema

Location: packages/db/src/schema/

likeness_assets - Stores uploaded images and voice samples
likeness_search - SQL search index with enum fields
likeness_search_vector - Vector embeddings for semantic search
likeness_need_indexing_queue - Event queue for indexing pipeline
account_social_links - Social media links and follower counts
scrapes - Social media scraping results

2. Access Layer

Location: packages/db/src/access/

Provides type-safe database access functions for all search-related operations. See Database Access Layer for details.

3. Likeness Search Package

Location: packages/likeness-search/src/
Package: @zooly/likeness-search

A dedicated package containing core business logic:

Indexing daemon and event processing
Tag aggregation and normalization
Search functions (SQL and vector fallback)
AI tag generation from images and voice
Error handling and retry logic
Audio URL validation utilities

Voice sample creation (createVoiceSample) is in @zooly/util-elevenlabs for reuse. Embedding generation (generateEmbedding) is in @zooly/util-srv for reuse. See the respective package documentation for details.

4. API Routes

Location: apps/zooly-app/app/api/indexing/

GET /api/indexing/process-queue - Cron endpoint for indexing daemon
POST /api/indexing/generate-tags - AI tag generation sub-process
POST /api/indexing/scrape-social - Social media scraping sub-process
GET/POST /api/indexing/search - Public search endpoint

5. Types

Location: packages/types/src/types/

LikenessSearch.ts - Search filters and result types
LikenessAssets.ts - Asset types
LikenessQueue.ts - Queue event types
AccountSocialLink.ts - Social link types

6. Utility Packages

The system leverages shared utility packages:

@zooly/util-elevenlabs - Voice cloning and text-to-speech operations
- createVoiceSample - Complete voice sample creation workflow
- addVoice, deleteVoice, getVoice - Voice management
- generateVoiceForText - Text-to-speech generation
@zooly/util-srv - Server-side utilities
- generateEmbedding - OpenAI embedding generation for vector search
- S3 operations for file storage
@zooly/social-scraper - Social media scraping operations
- processSocialScraping - Main orchestration function for scraping all social links
- Platform-specific scrapers (Instagram, TikTok, Twitter/X, YouTube, LinkedIn)
- URL parsing and platform detection
- Retry logic and error handling

Indexing Workflow

The indexing process follows these steps:

Event Creation - When an account is updated, a term is approved, or an asset is uploaded, an event is added to the queue
Base Requirements Check - Verify account has name, image, and approved term
Data Sufficiency Check - Determine if there's enough tag data to index (minimum 5 tags)
Data Collection - If insufficient, trigger AI tag generation or social scraping
Index Upsert - Once sufficient data is available, upsert to both SQL and vector indexes
Search Ready - Account is now searchable

See Indexing Pipeline for detailed workflow diagrams.

Social scraping is automatically triggered during the indexing process when additional data is needed. The @zooly/social-scraper package handles scraping social media profiles to collect follower counts, profile images, and metadata.

Social scraping is triggered in two scenarios:

Missing Image Data - If an account is missing a profile image but has social media links, scraping is triggered to fetch profile avatars
Data Sufficiency Check - During the data sufficiency check, if follower count data is missing or incomplete, social scraping is triggered to collect this information

When triggered, the system:

Fire-and-Forget Trigger - The indexing daemon calls triggerSocialScraping() which makes a non-blocking API call to /api/indexing/scrape-social
Scraper Orchestration - The API route calls processSocialScraping() from @zooly/social-scraper which:
- Fetches all social links for the account from account_social_links table
- For each platform (Instagram, TikTok, Twitter/X, YouTube, LinkedIn):
  - Checks if recently scraped (< 24 hours) and successful → skips
  - Checks if max retry attempts (5) reached → skips
  - Checks exponential backoff period → waits if needed
  - Calls platform-specific scraper function via Scrapfly
  - Updates follower counts in account_social_links table
  - Stores scraping results in scrapes table for retry tracking
Profile Image Handling - If the account is missing an image and a scraped avatar is available:
- Uploads avatar to S3 via @zooly/util-srv
- Updates account.imageUrl
- Creates a likeness_assets record (type: IMAGE) for AI tag extraction
Re-indexing Trigger - On completion, creates a new SOCIAL_DATA queue event to trigger re-indexing with the new data
Event Completion - Marks the original event as COMPLETED

Data Integration

Scraped data is integrated into the search index:

Follower Counts - Per-platform follower counts are stored in account_social_links and aggregated into a total follower count during tag aggregation
Profile Images - Scraped avatars become likeness_assets entries that can be processed for AI tag extraction
Retry Tracking - Each social link tracks scraping attempts separately, allowing individual retries without affecting other links

Error Handling

Social scraping includes comprehensive error handling:

Rate Limit Detection - Scrapfly rate limit errors are detected and trigger exponential backoff
Per-Link Retries - Each social link tracks its own retry attempts (max 5) with exponential backoff
Skip Logic - Links that have been successfully scraped recently (< 24h) are automatically skipped
Event-Level Retries - Failed scraping operations retry at the event level with exponential backoff

The social scraping process runs asynchronously and doesn't block the main indexing pipeline, allowing parallel data collection operations.

Search Flow

When a buyer searches:

Query Processing - Extract search filters from brief text or use direct filters
SQL Search - Query the likeness_search table with enum-based filters
Vector Fallback - If SQL returns no results, use vector similarity search
Result Formatting - Enrich results with account data, images, and voice samples
Return Results - Return formatted results to the buyer

See Search Flow for detailed search process.

Key Features

Automatic Tag Extraction

The system automatically extracts searchable tags from:

Account descriptions
Uploaded images (via Gemini vision)
Voice samples (via Gemini audio)
Social media profiles (via scraping)

Retry Logic

Comprehensive retry handling at multiple levels:

Event-level retries - Failed sub-processes retry with exponential backoff
Per-asset retries - Individual assets track attempt counts and backoff periods
Per-scrape retries - Social links track scraping attempts separately
Timeout detection - Events stuck in progress for >5 minutes are marked as timeout

Best-Effort Indexing

Even with incomplete data, the system will index accounts if they have at least one image asset. This ensures maximum discoverability while still maintaining quality standards.

Voice Sample Generation

For voice assets, the system creates AI-generated voice samples using @zooly/util-elevenlabs that showcase the voice characteristics. This allows buyers to preview voices before licensing. The complete workflow is handled by createVoiceSample() which orchestrates voice cloning, AI-generated demo scripts, and text-to-speech generation.

Architecture & Design - System architecture and design decisions
Indexing Pipeline - Detailed indexing workflow
Search Flow - Search process and algorithms
Database Schema - Database tables and relationships
API Reference - API endpoints and usage

Working with Remote Databases Architecture & Design

Likeness Search Overview

What is Likeness Search?

Purpose

Key Concepts

Dual-Index Architecture

Event-Driven Indexing Pipeline

AI-Powered Data Collection

Voice Sample Generation

System Architecture

1. Database Schema

2. Access Layer

3. Likeness Search Package

4. API Routes

5. Types

6. Utility Packages

Indexing Workflow

Social Scraping Integration

When Social Scraping is Triggered

Social Scraping Process

Data Integration

Error Handling

Search Flow

Key Features

Automatic Tag Extraction

Retry Logic

Best-Effort Indexing

Voice Sample Generation

Related Documentation