Backfill Scale Reality Check
contact_connections work_overlap backfill — measured sandbox5 throughput, prod extrapolation, and the realistic plan for the maintenance window.TL;DR — In plain English
Production has ~3.5M contacts to backfill, projected to ~570M contact_connections edges (range 300–800M). The optimized per-contact SQL is 40× faster than the original, but the real prod risk isn't query speed — it's Sidekiq stability and DB connection headroom during the maintenance window.
- 1Iteration count is 3.5M, not 27M. We iterate over contacts (not CWE rows). 8.2× the sandbox5 count.
- 2Throughput target: 12k edges/min sustained. Measured on sandbox5 with 8 active Sidekiq workers on optimized SQL.
- 3Realistic window: 8–30 hours depending on the Sidekiq fleet. ~100 threads on
:defaultlands at ~8h. - 4SQL speedup is real. Heaviest sandbox5 contact (104 CWEs): 619.3s → 15.2s per job (commit
bbf47c0e2d). - 5Gating issue: sandbox5 Sidekiq restarts ~every 5h and Maintenance Tasks doesn't resume cleanly. Must understand at the infra layer before prod.
One-line summary: Code path is ready; the prod risk is operational (Sidekiq stability, RDS connection pool) not algorithmic.
1. The numbers
1.1 Production universe — measured 2026-05-13 via prod reader DB
SELECT
(SELECT COUNT(DISTINCT contact_id)
FROM contact_work_experiences
WHERE organization_id IS NOT NULL) AS contacts_to_backfill,
(SELECT COUNT(*) FROM contact_work_experiences) AS total_cwes,
(SELECT COUNT(*) FROM contact_work_experiences WHERE organization_id IS NOT NULL) AS cwes_with_org,
(SELECT COUNT(*) FROM contact_work_experiences WHERE organization_id IS NULL) AS cwes_without_org,
(SELECT COUNT(DISTINCT organization_id)
FROM contact_work_experiences
WHERE organization_id IS NOT NULL) AS distinct_orgs_in_cwe;
| Metric | Prod | Sandbox5 | Prod ÷ Sandbox5 |
|---|---|---|---|
| Contacts to backfill (iteration count) | 3,508,418 | 429,904 | 8.2× |
| Total CWE rows | 27,783,005 | ~3.6M | 7.7× |
CWEs with organization_id | 24,772,984 | ~3.2M | 7.7× |
CWEs without organization_id | 3,010,021 | ~371k | 8.1× |
| Distinct orgs referenced in CWE | 3,236,986 | 759,245 | 4.3× |
organization_id values in CWE but only 73k exist in the organizations table; the rest reference deleted/orphan orgs. The backfill correctly filters those out via Organization.where(id: ...). Same check on prod is pending; running it tells us the effective org count for edge generation.-- Run on prod to find the effective org count:
SELECT COUNT(DISTINCT cwe.organization_id) AS in_cwe,
COUNT(o.id) AS exist_in_orgs_table
FROM contact_work_experiences cwe
LEFT JOIN organizations o ON o.id = cwe.organization_id
WHERE cwe.organization_id IS NOT NULL;
~11% of CWEs have no organization_id. Those contacts' employment is recorded as company_name strings only and will not produce edges until an upstream resolver links them to an organizations row. Separate ticket; not blocking GET-109.
1.2 Sandbox5 empirical throughput (2026-05-12 → 2026-05-13)
| Phase | Contacts processed | Edges added (net new) | Edges per contact |
|---|---|---|---|
| Pre-task baseline (steady-state callbacks) | n/a | ~3.18M | — |
| Run #5 (per-contact, interrupted at 4%) | 17,101 | ~2.6M | ~152 |
| Run #7 (per-contact, same iteration order as #5) | 16,801 | ~130k | (mostly re-upserts) |
| Final sandbox5 state | ~17k of 430k (4%) | ~5.9M | — |
Observed edge-write throughput with optimized SQL on healthy Sidekiq (8 active workers):
- ~12,000 edges/min sustained
- ~10–60k edges per 5-minute window depending on which contacts are being processed (mega-org contacts spike higher)
1.3 Per-contact job latency — sandbox5, contact 372722 (104 CWEs)
| SQL | Plan | Intermediate rows | Execution time |
|---|---|---|---|
| Old (filter post-join) | Parallel Hash Join | 1,003,463,935 | 619.3s |
| New (CTE-anchored on target) | Nested Loop + index | ~6,000 | 15.2s |
40× speedup. Most prod contacts will be faster than this — 372722 was picked deliberately as one of the heaviest cases (104 CWEs across multiple large orgs).
2. Production projection
2.1 Expected total edges
Multiplying sandbox5's ~152 edges-per-contact (measured on the first 17k contacts) by 3.5M prod contacts:
3,508,418 contacts × 152 edges/contact ≈ 533M edges
| Estimate | Edges | Why |
|---|---|---|
| Lower bound | ~300M | Long-tail contacts may have sparser networks |
| Mid estimate | ~570M | Direct extrapolation from sandbox5 sample |
| Upper bound | ~800M+ | Prod skews toward mega-org workers more than sandbox5 |
2.2 Expected runtime
At sandbox5's measured 12k edges/min throughput on a healthy Sidekiq fleet (8 workers):
570M edges ÷ 12,000 edges/min ÷ 60 min/hr ≈ 790 hours per worker thread
That's ~33 days on one worker. With parallel workers (each independent), runtime divides linearly until the DB itself becomes the bottleneck.
| Sidekiq config | Threads on :default | Wall-clock estimate |
|---|---|---|
| Current sandbox5 (1 process, ~8 threads) | 8 | ~4 days |
| Typical prod (1 process, 25 threads) | 25 | ~30 hours |
| Boosted prod (4 processes × 25 threads) | 100 | ~8 hours |
| Aggressive (10× boost) | 250 | ~3 hours (likely DB-limited) |
2.3 Sidekiq concurrency budget
Each WorkOverlapEdgeWorker invocation:
- Holds 1 Postgres connection for the duration of the cursor scan.
- Uses bounded Ruby memory (batch_size of 1,000 rows × a few hundred bytes each ≈ <10 MB).
- Runs for 1–15 seconds on a typical contact, can spike to 60+ seconds on contacts at multiple mega-orgs.
100 concurrent workers = 100 long-held PG connections. Confirm RDS connection pool headroom before the maintenance window.
3. Read-path readiness — DeepFinder covering indexes
The backfill ships data; DeepFinder reads it. Without the covering indexes below, the depth-3 recursive walk hits 489 ms on real data — past the 250 ms SLO. With them, the same query lands at 9 ms in EXPLAIN ANALYZE (~130 ms warm-cache on real sandbox5 with 12.4M edges). The indexes must be in place before backfill starts: building them after 570M rows have landed would take hours on its own, and any DeepFinder traffic during that gap would time out.
kind='work_overlap'. Pre-backfill there are zero matching rows, so CREATE INDEX CONCURRENTLY finishes in seconds against an empty subset. The indexes then grow row-by-row as the backfill writes. This is why the migration runs first.3.1 Index design + UNION ALL rewrite
Two symmetric covering partial indexes — one per endpoint of the undirected edge — let the planner pull pre-sorted rows directly from the index, no heap fetch, no in-loop sort:
-- db/migrate/20260513213400_add_deep_finder_covering_indexes.rb
add_index :contact_connections, [:contact_a_id, :date_to],
order: { date_to: :desc },
include: %i[contact_b_id bridge_entity_id date_from],
where: "kind = 'work_overlap'",
algorithm: :concurrently,
name: 'idx_cc_walk_a'
add_index :contact_connections, [:contact_b_id, :date_to],
order: { date_to: :desc },
include: %i[contact_a_id bridge_entity_id date_from],
where: "kind = 'work_overlap'",
algorithm: :concurrently,
name: 'idx_cc_walk_b'
Why two indexes. contact_connections is undirected — a walk from contact X needs to find rows where X is either contact_a_id OR contact_b_id. A single composite OR-scan can't preserve sort order through the OR clause, so the planner falls back to a sort step. Splitting into two single-key indexes lets each branch read pre-sorted rows.
Why include (covering). The recursive walk needs bridge_entity_id, date_from, date_to on every hop. Putting them in the INCLUDE clause makes the index a covering index — Postgres serves the query from the index leaf pages without touching the table heap.
Why partial on kind='work_overlap'. The table will hold other connection kinds eventually (linkedin, etc.). The partial predicate keeps these indexes small and dense — only the rows DeepFinder cares about. Smaller index → fits in shared_buffers → cache-friendly.
The matching SQL rewrite. Without the UNION ALL split in DeepFinder#fetch_path_rows, the planner sees a single OR'd predicate and can't tell which index to use — it picks a worse plan. The recursive case rewrite forces it:
-- inside the recursive CTE (see deep_finder.rb fetch_path_rows)
INNER JOIN LATERAL (
SELECT next_contact_id, bridge_entity_id, date_from, date_to
FROM (
(SELECT contact_b_id AS next_contact_id, bridge_entity_id, date_from, date_to
FROM contact_connections
WHERE contact_a_id = b.current_contact_id AND kind = 'work_overlap'
ORDER BY date_to DESC
LIMIT 25) -- ← reads idx_cc_walk_a, already sorted
UNION ALL
(SELECT contact_a_id AS next_contact_id, bridge_entity_id, date_from, date_to
FROM contact_connections
WHERE contact_b_id = b.current_contact_id AND kind = 'work_overlap'
ORDER BY date_to DESC
LIMIT 25) -- ← reads idx_cc_walk_b, already sorted
) merged
WHERE next_contact_id <> ALL(b.contact_path)
AND EXISTS (SELECT 1 FROM user_contact_collections ucc ...)
ORDER BY date_to DESC
LIMIT 25
) cc ON TRUE
3.2 EXPLAIN ANALYZE — before vs after
Measured on sandbox5 (12.4M contact_connections rows, depth-3 walk on contact 3083 — Sebastian Stadil, ~15k edges):
| Setup | Plan | Execution time | Verdict |
|---|---|---|---|
| Before (single composite index, OR-scan) | Bitmap Heap Scan → Sort → Nested Loop | 489 ms | Over SLO |
After (idx_cc_walk_a/b + UNION ALL) | Index Only Scan ×2 → Append → Limit | 9 ms | 54× faster |
kind='work_overlap'Pre-flight checklist — index hygiene
- Run the migration first.
db/migrate/20260513213400_add_deep_finder_covering_indexes.rbusesCONCURRENTLY+if_not_exists— safe on prod, idempotent. Finishes in seconds against an emptywork_overlappartition. - Verify presence before kickoff:
SELECT indexname FROM pg_indexes WHERE tablename='contact_connections' AND indexname IN ('idx_cc_walk_a','idx_cc_walk_b');— must return both. - After backfill, run
ANALYZE contact_connections. Autoanalyze threshold is 10% of table size; bulk inserts outpace it. Without manual ANALYZE the planner picks bad plans against stale stats. - Warm shared_buffers with sample DeepFinder queries (extended timeouts) before re-enabling user traffic. Cold-cache first reads dominate latency.
4. Known blockers
4.1 Sandbox5 Sidekiq stability — gating issue Open
Symptom: Every ~5 hours of running the backfill task, the MaintenanceTasks::TaskJob (the enqueuer) gets killed by Sidekiq shutdown and the framework's interrupted status doesn't resume. Workers continue draining whatever was already enqueued, but no new contacts get added to the queue until a human cancels and restarts.
Observed twice: Run #5 (2026-05-12 18:47, interrupted at 17,101/429,904), Run #7 (2026-05-12 22:57, interrupted at 16,801/429,904).
Likely root causes (not yet diagnosed):
- Sidekiq container restarted by orchestrator (ECS task replacement, k8s eviction)
- OOM kill (less likely now that streaming is in)
- Scheduled deploy or maintenance hook hitting nightly
Mitigations (in order of preference):
- Investigate why Sidekiq restarts on sandbox5 and fix at the infra layer.
- Add a Maintenance Tasks resume detector — a small cron worker that finds
interruptedruns older than N minutes and re-enqueues them. Not in PR #6724; would be a follow-up. - Live with manual cancel-and-restart cycles. Loses progress between cycles but is correct (idempotent upserts).
4.2 Unique-jobs lock leakage on purge Documented
When we purge queue:default to start clean, the sidekiq-unique-jobs locks (created by lock: :until_executed) remain in Redis with infinite TTL (lock_ttl: null). Sandbox5 cleanup required deleting ~360k orphan locks separately. Document this in the prod runbook — purging only the queue without the locks will silently swallow re-enqueues for those contacts.
Cleanup pattern (run on prod redis if a purge ever happens):
# In Rails console with Sidekiq client connected to prod Redis:
deleted = 0
cursor = 0
loop do
cursor, keys = Sidekiq.redis { |r| r.scan(cursor, match: 'uniquejobs:*', count: 1000) }
Sidekiq.redis { |r| r.unlink(*keys) } if keys.any?
deleted += keys.size
break if cursor.to_i == 0
end
puts "Deleted #{deleted} uniquejobs locks"
4.3 11% of CWEs have no organization_id Out of scope
3 million CWE rows on prod have organization_id IS NULL. The contacts those rows belong to won't appear in contact_connections for that employer until an upstream company-resolution process links them. This is a coverage limitation customers may notice in DeepFinder. Not in scope for GET-109, but worth flagging on the GTM side.
5. Recommended prod backfill plan
Pre-flight — week before
- Run the diagnostic queries (§1.1 plus the orphan-org check) and update this doc with the effective edge-count target.
- Identify the 10–20 dominant mega-orgs. Spot-check
EXPLAIN ANALYZEfor the per-contact SQL on one contact at each. - Apply the DeepFinder index migration (
20260513213400_add_deep_finder_covering_indexes.rb). Idempotent; finishes in seconds against an emptywork_overlappartition. See §3. - Sandbox5 Sidekiq stability: investigate or accept. If accepting, build a small cron that auto-recovers
interruptedMaintenance Tasks runs (~30 LOC follow-up PR). - RDS capacity check: confirm connection pool can absorb ~100 long-held connections concurrently.
Day of
- Window: weekend or low-traffic block, ≥12 hours.
- Scale Sidekiq: temporarily boost
:defaultqueue concurrency. Coordinate with ops. - Run:
Maintenance::BackfillContactConnectionsByContactTaskfrom/maintenance_taskson prod backoffice. - Monitor: Sidekiq Busy/Queues tabs +
SELECT COUNT(*) FROM contact_connections WHERE kind='work_overlap' AND computed_at > NOW() - interval '5 min'for live throughput. - Throttle if necessary: if customer query latency spikes, lower Sidekiq concurrency on
:defaultmid-run.
Acceptance criteria
contact_connectionsrow count forkind='work_overlap'stabilizes (no significant deltas for 30+ minutes).- Maintenance Tasks run status:
Succeeded(notInterruptedorCancelled). ANALYZE contact_connectionsrun; planner stats refreshed against post-backfill row count.- Spot-check 5 random contacts: DeepFinder returns paths within the 500 ms SLO (warm cache).
6. What's done vs what's left
Shipped on PR #6724 (008-contact-connections-table)
| Commit | Change |
|---|---|
6222dafbb2 | WorkOverlapBackfillService streaming cursor + bounded memory |
daed2d2c6d | New per-contact BackfillContactConnectionsByContactTask (Sidekiq fan-out) |
bbf47c0e2d | Per-contact SQL CTE rewrite — 40× speedup (write path) |
d30a391435 | Removed org-iterating sibling task — single backfill path |
b55749b99d | DeepFinder covering indexes (idx_cc_walk_a/b) + UNION ALL rewrite — 54× speedup (read path) |
ce336eaf71 | DeepFinder org-name fallback from CWE company_name when canonical organizations row is missing |
d59994e8f8 | DeepFinder: surface via_user_is_importer in signals + raise DEFAULT_TIMEOUT_MS 500 → 1500 ms |
Follow-ups (not in #6724)
- Auto-recovery worker for stuck
interruptedMaintenance Tasks runs (~40 LOC + spec). - Investigation of why sandbox5 Sidekiq restarts every ~5 hours.
- Company-name-to-Organization resolver for the 11% NULL-org CWEs (separate ticket).
- Prod runbook for the maintenance window (cancel/restart, purge with lock cleanup, throttle).
7. Cross-references
- DeepFinder query latency: performance.html (different concern — read-path SLO at 250 ms)
- Synthetic seed task:
Maintenance::SeedSyntheticConnectionGraphTask - Per-contact worker:
Contacts::ConnectionPaths::WorkOverlapEdgeWorker - Per-contact service entry:
Contacts::ConnectionPaths::WorkOverlapBackfillService.call(contact_id: X)