The Social Scraper is a component of the Likeness Search Indexing system that automatically scrapes social media profiles to collect profile data, follower counts, and avatar images. This data enhances search indexing and populates missing account information.
The Social Scraper collects:
This data is used to:
account.imageUrllikeness_assets records for AI tag extractionThe Social Scraper is implemented as a dedicated package:
Location: packages/social-scraper/
Key Files:
src/index.ts - Package exportssrc/types.ts - TypeScript interfaces (ScrapeResult, ScraperFunction, SocialScrapingResult)src/parseSocialUrl.ts - URL normalization utilitysrc/processSocialScraping.ts - Main orchestration functionsrc/scrapers/index.ts - Scraper registrysrc/scrapers/instagram.ts - Instagram scrapersrc/scrapers/tiktok.ts - TikTok scrapersrc/scrapers/twitter.ts - Twitter/X scrapersrc/scrapers/youtube.ts - YouTube scrapersrc/scrapers/linkedin.ts - LinkedIn scraperThe Social Scraper integrates with:
Indexing Daemon - Detects when social scraping is needed
packages/likeness-search/src/processEvent.tscheckDataSufficiency() identifies un-scraped social linksAPI Route - Handles scraping requests
apps/zooly-app/app/api/indexing/scrape-social/route.tsPOST /api/indexing/scrape-socialDatabase Access - Stores scraping results
packages/db/src/access/scrapes.tsupsertScrape(), incrementScrapeAttemptCount(), resetScrapeAttemptCount()S3 Service - Uploads avatars
packages/util-srv/src/s3/s3-service.tsuploadImageFromUrl()The scraper supports five social media platforms:
Scraper: scrapeInstagram()
Location: packages/social-scraper/src/scrapers/instagram.ts
Data Extracted:
user.full_nameuser.profile_pic_url_hd or user.profile_pic_urluser.edge_followed_by.countURL Format: https://www.instagram.com/{username}/
Scraper: scrapeTiktok()
Location: packages/social-scraper/src/scrapers/tiktok.ts
Data Extracted:
userInfo.user.nickname or userInfo.user.uniqueIduserInfo.user.avatarLarger or userInfo.user.avatarMediumuserInfo.stats.followerCountURL Format: https://www.tiktok.com/@{username}
Scraper: scrapeTwitter()
Location: packages/social-scraper/src/scrapers/twitter.ts
Data Extracted:
userData.name or userData.legacy.nameuserData.profile_image_url_https or userData.legacy.profile_image_url_httpsuserData.followers_count or userData.legacy.followers_countURL Format: https://x.com/{username} or https://twitter.com/{username}
Scraper: scrapeYoutube()
Location: packages/social-scraper/src/scrapers/youtube.ts
Data Extracted:
ytInitialDataURL Format: https://www.youtube.com/@{username}, /channel/{id}, /c/{name}, or /user/{username}
Scraper: scrapeLinkedIn()
Location: packages/social-scraper/src/scrapers/linkedin.ts
Data Extracted:
firstName + lastName or name or headlineprofilePicture.displayImage or profilePictureUrlURL Format: https://www.linkedin.com/in/{profileId}
Function: processSocialScraping(accountId: string)
Location: packages/social-scraper/src/processSocialScraping.ts
The orchestration function:
account_social_links tableparseSocialUrl() to extract platform and identifiergetScraperForPlatform()scrapes table:
upsertScrape() with error: null, attemptCount: 0upsertScrape() with error message, then incrementScrapeAttemptCount()SocialScrapingResult with:
results - Array of per-platform resultsbestAvatar - Highest quality avatar URL foundhasRateLimitError - Whether any scrape hit rate limitsFunction: parseSocialUrl(url: string)
Location: packages/social-scraper/src/parseSocialUrl.ts
Normalizes various URL formats to standard URLs:
https://, http://)www. subdomain@username) vs full URLsURL object for parsingFunction: getScraperForPlatform(platform: SocialPlatform)
Location: packages/social-scraper/src/scrapers/index.ts
Maps platform enum values to their respective scraper functions:
SocialPlatform.INSTAGRAM → scrapeInstagramSocialPlatform.TIKTOK → scrapeTiktokSocialPlatform.TWITTER → scrapeTwitterSocialPlatform.YOUTUBE → scrapeYoutubeSocialPlatform.LINKEDIN → scrapeLinkedInFunction: calculateScrapeBackoff(attemptCount: number)
Location: packages/social-scraper/src/processSocialScraping.ts
Calculates backoff delay based on attempt count:
Math.min(60 * 1000 * Math.pow(2, attemptCount - 1), 24 * 60 * 60 * 1000)Function: shouldSkipScrape(scrape: Scrape)
Location: packages/social-scraper/src/processSocialScraping.ts
A scrape is skipped if:
error IS NULL AND lastAttemptAt < 24 hours agoattemptCount >= 5lastAttemptAt + calculateScrapeBackoff(attemptCount) > nowFields: scrapes.attemptCount, scrapes.lastAttemptAt
attemptCount reset to 0 via resetScrapeAttemptCount()attemptCount incremented via incrementScrapeAttemptCount()attemptCount: 1Implementation: packages/db/src/access/scrapes.ts
Location: apps/zooly-app/app/api/indexing/scrape-social/route.ts
The API route handler:
CRON_SECRET bearer tokenlikeness_need_indexing_queue tableprocessSocialScraping(accountId)scrapingResult.results and calls upsertSocialLink() for each platform's follower countscrapingResult.bestAvatar exists and account.imageUrl is missing:
uploadImageFromUrl() to upload avatar to S3createAsset() to insert likeness_assets record (type: IMAGE)account.imageUrl with S3 URLscrapingResult.hasRateLimitError, calls handleApiError() to mark event AWAITING_RETRYmarkEventCompleted() + addToQueue() for re-indexingmarkEventAwaitingRetry() with exponential backoffmarkEventFailed()scrapes TableLocation: packages/db/src/schema/scrapes.ts
Stores scraping results per account and social link:
id - Unique identifier (nanoid)accountId - Foreign key to account tablelink - Social media URL (used for unique constraint)platform - Platform enum (INSTAGRAM, TIKTOK, TWITTER, YOUTUBE, LINKEDIN)name - Scraped profile name (nullable)followers - Follower count (nullable)avatar - Avatar URL (nullable)rawData - JSONB with full scraped data (nullable)error - Error message if scraping failed (nullable)attemptCount - Number of failed attempts (default: 0)lastAttemptAt - Timestamp of last attempt (nullable)createdAt - Record creation timestampupdatedAt - Record update timestampUnique Constraint: (accountId, link) - One scrape record per account-link combination
Indexes:
scrapes_account_link_idx - Unique index on (accountId, link)scrapes_account_idx - Index on accountId for fast account lookupsaccount_social_links TableLocation: packages/db/src/schema/accountSocialLinks.ts
Stores social media links and follower counts per platform:
followersCount - Follower count for this platform (updated by scraper)id, accountId, platform, link, username, etc.Note: Follower counts are stored per-platform, but when indexing, the sum of all platform follower counts is used.
Package: scrapfly-sdk (version ^0.7.0)
All scrapers use the Scrapfly API for web scraping:
new ScrapflyClient({ key: process.env.SCRAPFLY_API_KEY })new ScrapeConfig({ url, asp: true, country: "US" })apiResult.result.error for API errorsRequired: SCRAPFLY_API_KEY
Must be set in .env.local for the scraper to function.
Location: packages/likeness-search/src/apiRetryHandler.ts
The general API retry handler recognizes Scrapfly rate limit errors:
AWAITING_RETRY with exponential backoffisRateLimitError() includes Scrapfly in its detection logicEach platform scraper:
SCRAPFLY_API_KEY is missingapiResult.result.error existsThe processSocialScraping() function:
ScrapeResult with error fieldupsertScrape() with error messageincrementScrapeAttemptCount() for failed scrapeshasRateLimitError flag for event-level retry handlingLocation: packages/likeness-search/src/apiRetryHandler.ts
When scrapingResult.hasRateLimitError is true:
AWAITING_RETRYretryAt timestamp is set with exponential backoffPENDING when retryAt passesSocial scraping is triggered by the indexing daemon when:
Data Sufficiency Check - checkDataSufficiency() returns needsSocialScraping: true
packages/db/src/access/dataSufficiencyChecker.tsBase Requirements Check - checkBaseRequirements() detects missing image
packages/db/src/access/baseRequirementsChecker.tsaccount.imageUrl is missing AND account has social linksFunction: triggerSocialScraping(accountId, eventId)
Location: packages/likeness-search/src/triggerSubProcess.ts
The indexing daemon triggers social scraping using a fire-and-forget pattern:
fetch() call to /api/indexing/scrape-socialIN_PROGRESS while scraping happensAfter scraping completes:
account_social_links.followersCount updated per platformlikeness_assets record created for scraped avatarCOMPLETED and new queue event createdLocation: packages/social-scraper/src/processSocialScraping.ts
MAX_SCRAPE_ATTEMPTS = 5 - Maximum retry attempts per linkSCRAPE_RECENCY_HOURS = 24 - Hours before re-scraping successful linksTrack these metrics for monitoring:
attemptCount before success or permanent failurehasRateLimitError is trueThe system logs at:
Location: packages/social-scraper/package.json
@zooly/app-db - Database access functions@zooly/types - TypeScript types@zooly/util-srv - S3 service utilitiesscrapfly-sdk - Web scraping API clientOn This Page
OverviewPurposeArchitecturePackage StructureIntegration PointsSupported PlatformsInstagramTikTokTwitter/XYouTubeLinkedInScraping ProcessMain OrchestrationURL ParsingScraper RegistryRetry LogicExponential BackoffSkip ConditionsAttempt TrackingAPI Route HandlerDatabase Schema[object Object], Table[object Object], TableScrapfly IntegrationAPI ClientEnvironment VariableRate Limit HandlingError HandlingScraper-Level ErrorsOrchestration-Level ErrorsEvent-Level RetriesWorkflow IntegrationTriggering Social ScrapingFire-and-Forget PatternPost-Scraping FlowConfigurationConstantsBackoff ParametersMonitoringKey MetricsLogging PointsDependenciesPackage DependenciesRelated Documentation