Social Scraper

Social media profile scraping for likeness search indexing

Overview

The Social Scraper is a component of the Likeness Search Indexing system that automatically scrapes social media profiles to collect profile data, follower counts, and avatar images. This data enhances search indexing and populates missing account information.

Purpose

The Social Scraper collects:

  • Profile names - Display names from social media profiles
  • Avatar images - Profile pictures for accounts missing images
  • Follower counts - Per-platform follower metrics for popularity filtering
  • Raw profile data - Structured data for future use

This data is used to:

  1. Populate missing account images - Upload scraped avatars to S3 and set account.imageUrl
  2. Enhance search indexing - Follower counts are summed across platforms and indexed
  3. Create likeness assets - Scraped avatars are inserted as likeness_assets records for AI tag extraction

Architecture

Package Structure

The Social Scraper is implemented as a dedicated package:

Location: packages/social-scraper/

Key Files:

  • src/index.ts - Package exports
  • src/types.ts - TypeScript interfaces (ScrapeResult, ScraperFunction, SocialScrapingResult)
  • src/parseSocialUrl.ts - URL normalization utility
  • src/processSocialScraping.ts - Main orchestration function
  • src/scrapers/index.ts - Scraper registry
  • src/scrapers/instagram.ts - Instagram scraper
  • src/scrapers/tiktok.ts - TikTok scraper
  • src/scrapers/twitter.ts - Twitter/X scraper
  • src/scrapers/youtube.ts - YouTube scraper
  • src/scrapers/linkedin.ts - LinkedIn scraper

Integration Points

The Social Scraper integrates with:

  1. Indexing Daemon - Detects when social scraping is needed

    • Location: packages/likeness-search/src/processEvent.ts
    • Function: checkDataSufficiency() identifies un-scraped social links
  2. API Route - Handles scraping requests

    • Location: apps/zooly-app/app/api/indexing/scrape-social/route.ts
    • Endpoint: POST /api/indexing/scrape-social
  3. Database Access - Stores scraping results

    • Location: packages/db/src/access/scrapes.ts
    • Functions: upsertScrape(), incrementScrapeAttemptCount(), resetScrapeAttemptCount()
  4. S3 Service - Uploads avatars

    • Location: packages/util-srv/src/s3/s3-service.ts
    • Function: uploadImageFromUrl()

Supported Platforms

The scraper supports five social media platforms:

Instagram

Scraper: scrapeInstagram()
Location: packages/social-scraper/src/scrapers/instagram.ts

Data Extracted:

  • Name: user.full_name
  • Avatar: user.profile_pic_url_hd or user.profile_pic_url
  • Followers: user.edge_followed_by.count

URL Format: https://www.instagram.com/{username}/

TikTok

Scraper: scrapeTiktok()
Location: packages/social-scraper/src/scrapers/tiktok.ts

Data Extracted:

  • Name: userInfo.user.nickname or userInfo.user.uniqueId
  • Avatar: userInfo.user.avatarLarger or userInfo.user.avatarMedium
  • Followers: userInfo.stats.followerCount

URL Format: https://www.tiktok.com/@{username}

Twitter/X

Scraper: scrapeTwitter()
Location: packages/social-scraper/src/scrapers/twitter.ts

Data Extracted:

  • Name: userData.name or userData.legacy.name
  • Avatar: userData.profile_image_url_https or userData.legacy.profile_image_url_https
  • Followers: userData.followers_count or userData.legacy.followers_count

URL Format: https://x.com/{username} or https://twitter.com/{username}

YouTube

Scraper: scrapeYoutube()
Location: packages/social-scraper/src/scrapers/youtube.ts

Data Extracted:

  • Name: Channel title from ytInitialData
  • Avatar: Channel avatar from thumbnails
  • Followers: Subscriber count (parsed from text like "1.2M subscribers")

URL Format: https://www.youtube.com/@{username}, /channel/{id}, /c/{name}, or /user/{username}

LinkedIn

Scraper: scrapeLinkedIn()
Location: packages/social-scraper/src/scrapers/linkedin.ts

Data Extracted:

  • Name: firstName + lastName or name or headline
  • Avatar: profilePicture.displayImage or profilePictureUrl
  • Followers: Connection count (LinkedIn uses "connections" not "followers")

URL Format: https://www.linkedin.com/in/{profileId}

Scraping Process

Main Orchestration

Function: processSocialScraping(accountId: string)
Location: packages/social-scraper/src/processSocialScraping.ts

The orchestration function:

  1. Fetches Social Links - Gets all social links for the account from account_social_links table
  2. Applies Skip Logic - For each link, checks:
    • Recently scraped (< 24 hours) and successful → Skip
    • Max attempts reached (≥ 5) → Skip
    • Still in exponential backoff period → Skip
  3. Parses URLs - Normalizes URLs using parseSocialUrl() to extract platform and identifier
  4. Selects Scraper - Maps platform to appropriate scraper function via getScraperForPlatform()
  5. Executes Scraping - Calls platform-specific scraper with Scrapfly API
  6. Saves Results - Stores results in scrapes table:
    • Success: upsertScrape() with error: null, attemptCount: 0
    • Error: upsertScrape() with error message, then incrementScrapeAttemptCount()
  7. Returns Summary - Returns SocialScrapingResult with:
    • results - Array of per-platform results
    • bestAvatar - Highest quality avatar URL found
    • hasRateLimitError - Whether any scrape hit rate limits

URL Parsing

Function: parseSocialUrl(url: string)
Location: packages/social-scraper/src/parseSocialUrl.ts

Normalizes various URL formats to standard URLs:

  • Handles URLs with/without protocols (https://, http://)
  • Handles URLs with/without www. subdomain
  • Handles handle formats (@username) vs full URLs
  • Returns normalized URL object for parsing

Scraper Registry

Function: getScraperForPlatform(platform: SocialPlatform)
Location: packages/social-scraper/src/scrapers/index.ts

Maps platform enum values to their respective scraper functions:

  • SocialPlatform.INSTAGRAMscrapeInstagram
  • SocialPlatform.TIKTOKscrapeTiktok
  • SocialPlatform.TWITTERscrapeTwitter
  • SocialPlatform.YOUTUBEscrapeYoutube
  • SocialPlatform.LINKEDINscrapeLinkedIn

Retry Logic

Exponential Backoff

Function: calculateScrapeBackoff(attemptCount: number)
Location: packages/social-scraper/src/processSocialScraping.ts

Calculates backoff delay based on attempt count:

  • Formula: Math.min(60 * 1000 * Math.pow(2, attemptCount - 1), 24 * 60 * 60 * 1000)
  • Base delay: 60 seconds
  • Max delay: 24 hours
  • Example delays: 60s (attempt 1), 120s (attempt 2), 240s (attempt 3), 480s (attempt 4), 960s (attempt 5)

Skip Conditions

Function: shouldSkipScrape(scrape: Scrape)
Location: packages/social-scraper/src/processSocialScraping.ts

A scrape is skipped if:

  1. Recent Success - error IS NULL AND lastAttemptAt < 24 hours ago
  2. Max Attempts - attemptCount >= 5
  3. Backoff Active - lastAttemptAt + calculateScrapeBackoff(attemptCount) > now

Attempt Tracking

Fields: scrapes.attemptCount, scrapes.lastAttemptAt

  • On Success: attemptCount reset to 0 via resetScrapeAttemptCount()
  • On Error: attemptCount incremented via incrementScrapeAttemptCount()
  • Initial Error: New scrape record created with attemptCount: 1

Implementation: packages/db/src/access/scrapes.ts

API Route Handler

Location: apps/zooly-app/app/api/indexing/scrape-social/route.ts

The API route handler:

  1. Authorization - Verifies CRON_SECRET bearer token
  2. Event Retrieval - Gets event from likeness_need_indexing_queue table
  3. Scraping Execution - Calls processSocialScraping(accountId)
  4. Post-Processing:
    • Follower Count Updates - Iterates scrapingResult.results and calls upsertSocialLink() for each platform's follower count
    • Avatar Upload - If scrapingResult.bestAvatar exists and account.imageUrl is missing:
      • Calls uploadImageFromUrl() to upload avatar to S3
      • Calls createAsset() to insert likeness_assets record (type: IMAGE)
      • Updates account.imageUrl with S3 URL
  5. Rate Limit Handling - If scrapingResult.hasRateLimitError, calls handleApiError() to mark event AWAITING_RETRY
  6. Event Management:
    • Success: markEventCompleted() + addToQueue() for re-indexing
    • Rate limit: markEventAwaitingRetry() with exponential backoff
    • Failure: markEventFailed()

Database Schema

scrapes Table

Location: packages/db/src/schema/scrapes.ts

Stores scraping results per account and social link:

  • id - Unique identifier (nanoid)
  • accountId - Foreign key to account table
  • link - Social media URL (used for unique constraint)
  • platform - Platform enum (INSTAGRAM, TIKTOK, TWITTER, YOUTUBE, LINKEDIN)
  • name - Scraped profile name (nullable)
  • followers - Follower count (nullable)
  • avatar - Avatar URL (nullable)
  • rawData - JSONB with full scraped data (nullable)
  • error - Error message if scraping failed (nullable)
  • attemptCount - Number of failed attempts (default: 0)
  • lastAttemptAt - Timestamp of last attempt (nullable)
  • createdAt - Record creation timestamp
  • updatedAt - Record update timestamp

Unique Constraint: (accountId, link) - One scrape record per account-link combination

Indexes:

  • scrapes_account_link_idx - Unique index on (accountId, link)
  • scrapes_account_idx - Index on accountId for fast account lookups

Location: packages/db/src/schema/accountSocialLinks.ts

Stores social media links and follower counts per platform:

  • followersCount - Follower count for this platform (updated by scraper)
  • Other fields: id, accountId, platform, link, username, etc.

Note: Follower counts are stored per-platform, but when indexing, the sum of all platform follower counts is used.

Scrapfly Integration

API Client

Package: scrapfly-sdk (version ^0.7.0)

All scrapers use the Scrapfly API for web scraping:

  • Client Initialization: new ScrapflyClient({ key: process.env.SCRAPFLY_API_KEY })
  • Scraping Configuration: new ScrapeConfig({ url, asp: true, country: "US" })
  • Error Handling: Checks apiResult.result.error for API errors

Environment Variable

Required: SCRAPFLY_API_KEY

Must be set in .env.local for the scraper to function.

Rate Limit Handling

Location: packages/likeness-search/src/apiRetryHandler.ts

The general API retry handler recognizes Scrapfly rate limit errors:

  • Pattern: Error messages containing "scrapfly" and rate limit indicators
  • Behavior: Marks events AWAITING_RETRY with exponential backoff
  • Function: isRateLimitError() includes Scrapfly in its detection logic

Error Handling

Scraper-Level Errors

Each platform scraper:

  1. Validates API Key - Throws error if SCRAPFLY_API_KEY is missing
  2. Validates URL - Throws error if URL format is invalid
  3. Checks API Errors - Throws error if apiResult.result.error exists
  4. Validates Data - Throws error if expected profile data is missing
  5. Wraps Errors - Re-throws with context ("Failed to scrape platform")

Orchestration-Level Errors

The processSocialScraping() function:

  • Catches Scraper Errors - Wraps in ScrapeResult with error field
  • Saves Error Results - Calls upsertScrape() with error message
  • Increments Attempts - Calls incrementScrapeAttemptCount() for failed scrapes
  • Continues Processing - Processes remaining links even if one fails
  • Returns Summary - Includes hasRateLimitError flag for event-level retry handling

Event-Level Retries

Location: packages/likeness-search/src/apiRetryHandler.ts

When scrapingResult.hasRateLimitError is true:

  • Event is marked AWAITING_RETRY
  • retryAt timestamp is set with exponential backoff
  • Daemon moves event back to PENDING when retryAt passes

Workflow Integration

Triggering Social Scraping

Social scraping is triggered by the indexing daemon when:

  1. Data Sufficiency Check - checkDataSufficiency() returns needsSocialScraping: true

    • Location: packages/db/src/access/dataSufficiencyChecker.ts
    • Condition: Social links exist but haven't been scraped recently (< 24h) or have errors
  2. Base Requirements Check - checkBaseRequirements() detects missing image

    • Location: packages/db/src/access/baseRequirementsChecker.ts
    • Condition: account.imageUrl is missing AND account has social links

Fire-and-Forget Pattern

Function: triggerSocialScraping(accountId, eventId)
Location: packages/likeness-search/src/triggerSubProcess.ts

The indexing daemon triggers social scraping using a fire-and-forget pattern:

  • Makes non-blocking fetch() call to /api/indexing/scrape-social
  • Event stays IN_PROGRESS while scraping happens
  • Scraping completes asynchronously and updates event status

Post-Scraping Flow

After scraping completes:

  1. Follower Counts Updated - account_social_links.followersCount updated per platform
  2. Avatar Uploaded - If account missing image, avatar uploaded to S3
  3. Asset Created - likeness_assets record created for scraped avatar
  4. Event Completed - Event marked COMPLETED and new queue event created
  5. Re-Indexing Triggered - New event triggers re-check of data sufficiency and indexing

Configuration

Constants

Location: packages/social-scraper/src/processSocialScraping.ts

  • MAX_SCRAPE_ATTEMPTS = 5 - Maximum retry attempts per link
  • SCRAPE_RECENCY_HOURS = 24 - Hours before re-scraping successful links

Backoff Parameters

  • Base Delay: 60 seconds
  • Exponential Factor: 2x per attempt
  • Maximum Delay: 24 hours

Monitoring

Key Metrics

Track these metrics for monitoring:

  • Scraping Success Rate - Percentage of successful scrapes vs total attempts
  • Per-Platform Success Rates - Success rate by platform (Instagram, TikTok, etc.)
  • Average Attempts - Average attemptCount before success or permanent failure
  • Rate Limit Frequency - How often hasRateLimitError is true
  • Avatar Recovery Rate - Percentage of missing images recovered via scraping

Logging Points

The system logs at:

  1. Scraping Start - Account ID and link count
  2. Per-Link Results - Success/failure for each link
  3. Post-Processing - Avatar uploads and follower count updates
  4. Error Details - Scrapfly API errors and parsing failures

Dependencies

Package Dependencies

Location: packages/social-scraper/package.json

  • @zooly/app-db - Database access functions
  • @zooly/types - TypeScript types
  • @zooly/util-srv - S3 service utilities
  • scrapfly-sdk - Web scraping API client