Social Scraper

Social media profile scraping for likeness search indexing

Overview

The Social Scraper is a component of the Likeness Search Indexing system that automatically scrapes social media profiles to collect profile data, follower counts, and avatar images. This data enhances search indexing and populates missing account information.

Purpose

The Social Scraper collects:

Profile names - Display names from social media profiles
Avatar images - Profile pictures for accounts missing images
Follower counts - Per-platform follower metrics for popularity filtering
Raw profile data - Structured data for future use

This data is used to:

Populate missing account images - Upload scraped avatars to S3 and set account.imageUrl
Enhance search indexing - Follower counts are summed across platforms and indexed
Create likeness assets - Scraped avatars are inserted as likeness_assets records for AI tag extraction

Architecture

Package Structure

The Social Scraper is implemented as a dedicated package:

Location: packages/social-scraper/

Key Files:

src/index.ts - Package exports
src/types.ts - TypeScript interfaces (ScrapeResult, ScraperFunction, SocialScrapingResult)
src/parseSocialUrl.ts - URL normalization utility
src/processSocialScraping.ts - Main orchestration function
src/scrapers/index.ts - Scraper registry
src/scrapers/instagram.ts - Instagram scraper
src/scrapers/tiktok.ts - TikTok scraper
src/scrapers/twitter.ts - Twitter/X scraper
src/scrapers/youtube.ts - YouTube scraper
src/scrapers/linkedin.ts - LinkedIn scraper

Integration Points

The Social Scraper integrates with:

Indexing Daemon - Detects when social scraping is needed
- Location: packages/likeness-search/src/processEvent.ts
- Function: checkDataSufficiency() identifies un-scraped social links
API Route - Handles scraping requests
- Location: apps/zooly-app/app/api/indexing/scrape-social/route.ts
- Endpoint: POST /api/indexing/scrape-social
Database Access - Stores scraping results
- Location: packages/db/src/access/scrapes.ts
- Functions: upsertScrape(), incrementScrapeAttemptCount(), resetScrapeAttemptCount()
S3 Service - Uploads avatars
- Location: packages/util-srv/src/s3/s3-service.ts
- Function: uploadImageFromUrl()

Supported Platforms

The scraper supports five social media platforms:

Instagram

Scraper: scrapeInstagram()
Location: packages/social-scraper/src/scrapers/instagram.ts

Data Extracted:

Name: user.full_name
Avatar: user.profile_pic_url_hd or user.profile_pic_url
Followers: user.edge_followed_by.count

URL Format: https://www.instagram.com/{username}/

TikTok

Scraper: scrapeTiktok()
Location: packages/social-scraper/src/scrapers/tiktok.ts

Data Extracted:

Name: userInfo.user.nickname or userInfo.user.uniqueId
Avatar: userInfo.user.avatarLarger or userInfo.user.avatarMedium
Followers: userInfo.stats.followerCount

URL Format: https://www.tiktok.com/@{username}

Twitter/X

Scraper: scrapeTwitter()
Location: packages/social-scraper/src/scrapers/twitter.ts

Data Extracted:

Name: userData.name or userData.legacy.name
Avatar: userData.profile_image_url_https or userData.legacy.profile_image_url_https
Followers: userData.followers_count or userData.legacy.followers_count

URL Format: https://x.com/{username} or https://twitter.com/{username}

YouTube

Scraper: scrapeYoutube()
Location: packages/social-scraper/src/scrapers/youtube.ts

Data Extracted:

Name: Channel title from ytInitialData
Avatar: Channel avatar from thumbnails
Followers: Subscriber count (parsed from text like "1.2M subscribers")

URL Format: https://www.youtube.com/@{username}, /channel/{id}, /c/{name}, or /user/{username}

Scraper: scrapeLinkedIn()
Location: packages/social-scraper/src/scrapers/linkedin.ts

Data Extracted:

Name: firstName + lastName or name or headline
Avatar: profilePicture.displayImage or profilePictureUrl
Followers: Connection count (LinkedIn uses "connections" not "followers")

URL Format: https://www.linkedin.com/in/{profileId}

Scraping Process

Main Orchestration

Function: processSocialScraping(accountId: string)
Location: packages/social-scraper/src/processSocialScraping.ts

The orchestration function:

Fetches Social Links - Gets all social links for the account from account_social_links table
Applies Skip Logic - For each link, checks:
- Recently scraped (< 24 hours) and successful → Skip
- Max attempts reached (≥ 5) → Skip
- Still in exponential backoff period → Skip
Parses URLs - Normalizes URLs using parseSocialUrl() to extract platform and identifier
Selects Scraper - Maps platform to appropriate scraper function via getScraperForPlatform()
Executes Scraping - Calls platform-specific scraper with Scrapfly API
Saves Results - Stores results in scrapes table:
- Success: upsertScrape() with error: null, attemptCount: 0
- Error: upsertScrape() with error message, then incrementScrapeAttemptCount()
Returns Summary - Returns SocialScrapingResult with:
- results - Array of per-platform results
- bestAvatar - Highest quality avatar URL found
- hasRateLimitError - Whether any scrape hit rate limits

URL Parsing

Function: parseSocialUrl(url: string)
Location: packages/social-scraper/src/parseSocialUrl.ts

Normalizes various URL formats to standard URLs:

Handles URLs with/without protocols (https://, http://)
Handles URLs with/without www. subdomain
Handles handle formats (@username) vs full URLs
Returns normalized URL object for parsing

Scraper Registry

Function: getScraperForPlatform(platform: SocialPlatform)
Location: packages/social-scraper/src/scrapers/index.ts

Maps platform enum values to their respective scraper functions:

SocialPlatform.INSTAGRAM → scrapeInstagram
SocialPlatform.TIKTOK → scrapeTiktok
SocialPlatform.TWITTER → scrapeTwitter
SocialPlatform.YOUTUBE → scrapeYoutube
SocialPlatform.LINKEDIN → scrapeLinkedIn

Retry Logic

Exponential Backoff

Function: calculateScrapeBackoff(attemptCount: number)
Location: packages/social-scraper/src/processSocialScraping.ts

Calculates backoff delay based on attempt count:

Formula: Math.min(60 * 1000 * Math.pow(2, attemptCount - 1), 24 * 60 * 60 * 1000)
Base delay: 60 seconds
Max delay: 24 hours
Example delays: 60s (attempt 1), 120s (attempt 2), 240s (attempt 3), 480s (attempt 4), 960s (attempt 5)

Skip Conditions

Function: shouldSkipScrape(scrape: Scrape)
Location: packages/social-scraper/src/processSocialScraping.ts

A scrape is skipped if:

Recent Success - error IS NULL AND lastAttemptAt < 24 hours ago
Max Attempts - attemptCount >= 5
Backoff Active - lastAttemptAt + calculateScrapeBackoff(attemptCount) > now

Attempt Tracking

Fields: scrapes.attemptCount, scrapes.lastAttemptAt

On Success: attemptCount reset to 0 via resetScrapeAttemptCount()
On Error: attemptCount incremented via incrementScrapeAttemptCount()
Initial Error: New scrape record created with attemptCount: 1

Implementation: packages/db/src/access/scrapes.ts

API Route Handler

Location: apps/zooly-app/app/api/indexing/scrape-social/route.ts

The API route handler:

Authorization - Verifies CRON_SECRET bearer token
Event Retrieval - Gets event from likeness_need_indexing_queue table
Scraping Execution - Calls processSocialScraping(accountId)
Post-Processing:
- Follower Count Updates - Iterates scrapingResult.results and calls upsertSocialLink() for each platform's follower count
- Avatar Upload - If scrapingResult.bestAvatar exists and account.imageUrl is missing:
  - Calls uploadImageFromUrl() to upload avatar to S3
  - Calls createAsset() to insert likeness_assets record (type: IMAGE)
  - Updates account.imageUrl with S3 URL
Rate Limit Handling - If scrapingResult.hasRateLimitError, calls handleApiError() to mark event AWAITING_RETRY
Event Management:
- Success: markEventCompleted() + addToQueue() for re-indexing
- Rate limit: markEventAwaitingRetry() with exponential backoff
- Failure: markEventFailed()

Database Schema

`scrapes` Table

Location: packages/db/src/schema/scrapes.ts

Stores scraping results per account and social link:

id - Unique identifier (nanoid)
accountId - Foreign key to account table
link - Social media URL (used for unique constraint)
platform - Platform enum (INSTAGRAM, TIKTOK, TWITTER, YOUTUBE, LINKEDIN)
name - Scraped profile name (nullable)
followers - Follower count (nullable)
avatar - Avatar URL (nullable)
rawData - JSONB with full scraped data (nullable)
error - Error message if scraping failed (nullable)
attemptCount - Number of failed attempts (default: 0)
lastAttemptAt - Timestamp of last attempt (nullable)
createdAt - Record creation timestamp
updatedAt - Record update timestamp

Unique Constraint: (accountId, link) - One scrape record per account-link combination

Indexes:

scrapes_account_link_idx - Unique index on (accountId, link)
scrapes_account_idx - Index on accountId for fast account lookups

`account_social_links` Table

Location: packages/db/src/schema/accountSocialLinks.ts

Stores social media links and follower counts per platform:

followersCount - Follower count for this platform (updated by scraper)
Other fields: id, accountId, platform, link, username, etc.

Note: Follower counts are stored per-platform, but when indexing, the sum of all platform follower counts is used.

Scrapfly Integration

API Client

Package: scrapfly-sdk (version ^0.7.0)

All scrapers use the Scrapfly API for web scraping:

Client Initialization: new ScrapflyClient({ key: process.env.SCRAPFLY_API_KEY })
Scraping Configuration: new ScrapeConfig({ url, asp: true, country: "US" })
Error Handling: Checks apiResult.result.error for API errors

Environment Variable

Required: SCRAPFLY_API_KEY

Must be set in .env.local for the scraper to function.

Rate Limit Handling

Location: packages/likeness-search/src/apiRetryHandler.ts

The general API retry handler recognizes Scrapfly rate limit errors:

Pattern: Error messages containing "scrapfly" and rate limit indicators
Behavior: Marks events AWAITING_RETRY with exponential backoff
Function: isRateLimitError() includes Scrapfly in its detection logic

Error Handling

Scraper-Level Errors

Each platform scraper:

Validates API Key - Throws error if SCRAPFLY_API_KEY is missing
Validates URL - Throws error if URL format is invalid
Checks API Errors - Throws error if apiResult.result.error exists
Validates Data - Throws error if expected profile data is missing
Wraps Errors - Re-throws with context ("Failed to scrape platform")

Orchestration-Level Errors

The processSocialScraping() function:

Catches Scraper Errors - Wraps in ScrapeResult with error field
Saves Error Results - Calls upsertScrape() with error message
Increments Attempts - Calls incrementScrapeAttemptCount() for failed scrapes
Continues Processing - Processes remaining links even if one fails
Returns Summary - Includes hasRateLimitError flag for event-level retry handling

Event-Level Retries

Location: packages/likeness-search/src/apiRetryHandler.ts

When scrapingResult.hasRateLimitError is true:

Event is marked AWAITING_RETRY
retryAt timestamp is set with exponential backoff
Daemon moves event back to PENDING when retryAt passes

Workflow Integration

Social scraping is triggered by the indexing daemon when:

Data Sufficiency Check - checkDataSufficiency() returns needsSocialScraping: true
- Location: packages/db/src/access/dataSufficiencyChecker.ts
- Condition: Social links exist but haven't been scraped recently (< 24h) or have errors
Base Requirements Check - checkBaseRequirements() detects missing image
- Location: packages/db/src/access/baseRequirementsChecker.ts
- Condition: account.imageUrl is missing AND account has social links

Fire-and-Forget Pattern

Function: triggerSocialScraping(accountId, eventId)
Location: packages/likeness-search/src/triggerSubProcess.ts

The indexing daemon triggers social scraping using a fire-and-forget pattern:

Makes non-blocking fetch() call to /api/indexing/scrape-social
Event stays IN_PROGRESS while scraping happens
Scraping completes asynchronously and updates event status

Post-Scraping Flow

After scraping completes:

Follower Counts Updated - account_social_links.followersCount updated per platform
Avatar Uploaded - If account missing image, avatar uploaded to S3
Asset Created - likeness_assets record created for scraped avatar
Event Completed - Event marked COMPLETED and new queue event created
Re-Indexing Triggered - New event triggers re-check of data sufficiency and indexing

Configuration

Constants

Location: packages/social-scraper/src/processSocialScraping.ts

MAX_SCRAPE_ATTEMPTS = 5 - Maximum retry attempts per link
SCRAPE_RECENCY_HOURS = 24 - Hours before re-scraping successful links

Backoff Parameters

Base Delay: 60 seconds
Exponential Factor: 2x per attempt
Maximum Delay: 24 hours

Monitoring

Key Metrics

Track these metrics for monitoring:

Scraping Success Rate - Percentage of successful scrapes vs total attempts
Per-Platform Success Rates - Success rate by platform (Instagram, TikTok, etc.)
Average Attempts - Average attemptCount before success or permanent failure
Rate Limit Frequency - How often hasRateLimitError is true
Avatar Recovery Rate - Percentage of missing images recovered via scraping

Logging Points

The system logs at:

Scraping Start - Account ID and link count
Per-Link Results - Success/failure for each link
Post-Processing - Avatar uploads and follower count updates
Error Details - Scrapfly API errors and parsing failures

Dependencies

Package Dependencies

Location: packages/social-scraper/package.json

@zooly/app-db - Database access functions
@zooly/types - TypeScript types
@zooly/util-srv - S3 service utilities
scrapfly-sdk - Web scraping API client

Likeness Search Overview - Overall system overview
Indexing Pipeline - How social scraping fits into the indexing workflow
Database Schema - Database tables and relationships
API Reference - API endpoints including scrape-social

Media System Database Schema

Social Scraper

Overview

Purpose

Architecture

Package Structure

Integration Points

Supported Platforms

Instagram

TikTok

Twitter/X

YouTube

LinkedIn

Scraping Process

Main Orchestration

URL Parsing

Scraper Registry

Retry Logic

Exponential Backoff

Skip Conditions

Attempt Tracking

API Route Handler

Database Schema

scrapes Table

account_social_links Table

Scrapfly Integration

API Client

Environment Variable

Rate Limit Handling

Error Handling

Scraper-Level Errors

Orchestration-Level Errors

Event-Level Retries

Workflow Integration

Triggering Social Scraping

Fire-and-Forget Pattern

Post-Scraping Flow

Configuration

Constants

Backoff Parameters

Monitoring

Key Metrics

Logging Points

Dependencies

Package Dependencies

Related Documentation

`scrapes` Table

`account_social_links` Table