Backfill Scale Reality Check

contact_connections work_overlap backfill — measured sandbox5 throughput, prod extrapolation, and the realistic plan for the maintenance window.
Ticket · GET-109 Status · Planning · 2026-05-13 Owner · Enaho Companion PR · #6724 Source markdown

TL;DR — In plain English

Production has ~3.5M contacts to backfill, projected to ~570M contact_connections edges (range 300–800M). The optimized per-contact SQL is 40× faster than the original, but the real prod risk isn't query speed — it's Sidekiq stability and DB connection headroom during the maintenance window.

  1. 1Iteration count is 3.5M, not 27M. We iterate over contacts (not CWE rows). 8.2× the sandbox5 count.
  2. 2Throughput target: 12k edges/min sustained. Measured on sandbox5 with 8 active Sidekiq workers on optimized SQL.
  3. 3Realistic window: 8–30 hours depending on the Sidekiq fleet. ~100 threads on :default lands at ~8h.
  4. 4SQL speedup is real. Heaviest sandbox5 contact (104 CWEs): 619.3s → 15.2s per job (commit bbf47c0e2d).
  5. 5Gating issue: sandbox5 Sidekiq restarts ~every 5h and Maintenance Tasks doesn't resume cleanly. Must understand at the infra layer before prod.

One-line summary: Code path is ready; the prod risk is operational (Sidekiq stability, RDS connection pool) not algorithmic.

Contacts to backfill (prod)
3,508,418
8.2× sandbox5
Expected edges (mid)
~570M
range 300M–800M
SQL speedup
40×
619s → 15s per contact
Throughput (sandbox5)
~12k/min
sustained, 8 workers

1. The numbers

1.1 Production universe — measured 2026-05-13 via prod reader DB

SELECT
  (SELECT COUNT(DISTINCT contact_id)
     FROM contact_work_experiences
     WHERE organization_id IS NOT NULL) AS contacts_to_backfill,
  (SELECT COUNT(*) FROM contact_work_experiences) AS total_cwes,
  (SELECT COUNT(*) FROM contact_work_experiences WHERE organization_id IS NOT NULL) AS cwes_with_org,
  (SELECT COUNT(*) FROM contact_work_experiences WHERE organization_id IS NULL) AS cwes_without_org,
  (SELECT COUNT(DISTINCT organization_id)
     FROM contact_work_experiences
     WHERE organization_id IS NOT NULL) AS distinct_orgs_in_cwe;
MetricProdSandbox5Prod ÷ Sandbox5
Contacts to backfill (iteration count)3,508,418429,9048.2×
Total CWE rows27,783,005~3.6M7.7×
CWEs with organization_id24,772,984~3.2M7.7×
CWEs without organization_id3,010,021~371k8.1×
Distinct orgs referenced in CWE3,236,986759,2454.3×
Orphan org IDs. Sandbox5 had 759k distinct organization_id values in CWE but only 73k exist in the organizations table; the rest reference deleted/orphan orgs. The backfill correctly filters those out via Organization.where(id: ...). Same check on prod is pending; running it tells us the effective org count for edge generation.
-- Run on prod to find the effective org count:
SELECT COUNT(DISTINCT cwe.organization_id) AS in_cwe,
       COUNT(o.id)                          AS exist_in_orgs_table
FROM contact_work_experiences cwe
LEFT JOIN organizations o ON o.id = cwe.organization_id
WHERE cwe.organization_id IS NOT NULL;

~11% of CWEs have no organization_id. Those contacts' employment is recorded as company_name strings only and will not produce edges until an upstream resolver links them to an organizations row. Separate ticket; not blocking GET-109.

1.2 Sandbox5 empirical throughput (2026-05-12 → 2026-05-13)

PhaseContacts processedEdges added (net new)Edges per contact
Pre-task baseline (steady-state callbacks)n/a~3.18M
Run #5 (per-contact, interrupted at 4%)17,101~2.6M~152
Run #7 (per-contact, same iteration order as #5)16,801~130k(mostly re-upserts)
Final sandbox5 state~17k of 430k (4%)~5.9M

Observed edge-write throughput with optimized SQL on healthy Sidekiq (8 active workers):

1.3 Per-contact job latency — sandbox5, contact 372722 (104 CWEs)

SQLPlanIntermediate rowsExecution time
Old (filter post-join)Parallel Hash Join1,003,463,935619.3s
New (CTE-anchored on target)Nested Loop + index~6,00015.2s

40× speedup. Most prod contacts will be faster than this — 372722 was picked deliberately as one of the heaviest cases (104 CWEs across multiple large orgs).

2. Production projection

2.1 Expected total edges

Multiplying sandbox5's ~152 edges-per-contact (measured on the first 17k contacts) by 3.5M prod contacts:

3,508,418 contacts × 152 edges/contact ≈ 533M edges
EstimateEdgesWhy
Lower bound~300MLong-tail contacts may have sparser networks
Mid estimate~570MDirect extrapolation from sandbox5 sample
Upper bound~800M+Prod skews toward mega-org workers more than sandbox5
Sample-size caveats. The 17k sandbox5 sample (early IDs) may skew high (older accounts → more CWEs) or low (haven't reached mega-org workers yet). Net direction is uncertain; the spread is real.

2.2 Expected runtime

At sandbox5's measured 12k edges/min throughput on a healthy Sidekiq fleet (8 workers):

570M edges ÷ 12,000 edges/min ÷ 60 min/hr ≈ 790 hours per worker thread

That's ~33 days on one worker. With parallel workers (each independent), runtime divides linearly until the DB itself becomes the bottleneck.

Sidekiq configThreads on :defaultWall-clock estimate
Current sandbox5 (1 process, ~8 threads)8~4 days
Typical prod (1 process, 25 threads)25~30 hours
Boosted prod (4 processes × 25 threads)100~8 hours
Aggressive (10× boost)250~3 hours (likely DB-limited)
The DB ceiling matters. Each running worker holds a Postgres connection and runs a heavy query. We measured 9 concurrent cursor queries on sandbox5 with no obvious contention, but prod's main RDS handles real customer traffic concurrently — running 100+ heavy CTEs alongside user-facing reads is the real risk.

2.3 Sidekiq concurrency budget

Each WorkOverlapEdgeWorker invocation:

100 concurrent workers = 100 long-held PG connections. Confirm RDS connection pool headroom before the maintenance window.

3. Read-path readiness — DeepFinder covering indexes

The backfill ships data; DeepFinder reads it. Without the covering indexes below, the depth-3 recursive walk hits 489 ms on real data — past the 250 ms SLO. With them, the same query lands at 9 ms in EXPLAIN ANALYZE (~130 ms warm-cache on real sandbox5 with 12.4M edges). The indexes must be in place before backfill starts: building them after 570M rows have landed would take hours on its own, and any DeepFinder traffic during that gap would time out.

Order of operations on prod. Indexes are partial on kind='work_overlap'. Pre-backfill there are zero matching rows, so CREATE INDEX CONCURRENTLY finishes in seconds against an empty subset. The indexes then grow row-by-row as the backfill writes. This is why the migration runs first.

3.1 Index design + UNION ALL rewrite

Two symmetric covering partial indexes — one per endpoint of the undirected edge — let the planner pull pre-sorted rows directly from the index, no heap fetch, no in-loop sort:

-- db/migrate/20260513213400_add_deep_finder_covering_indexes.rb
add_index :contact_connections, [:contact_a_id, :date_to],
          order: { date_to: :desc },
          include: %i[contact_b_id bridge_entity_id date_from],
          where: "kind = 'work_overlap'",
          algorithm: :concurrently,
          name: 'idx_cc_walk_a'

add_index :contact_connections, [:contact_b_id, :date_to],
          order: { date_to: :desc },
          include: %i[contact_a_id bridge_entity_id date_from],
          where: "kind = 'work_overlap'",
          algorithm: :concurrently,
          name: 'idx_cc_walk_b'

Why two indexes. contact_connections is undirected — a walk from contact X needs to find rows where X is either contact_a_id OR contact_b_id. A single composite OR-scan can't preserve sort order through the OR clause, so the planner falls back to a sort step. Splitting into two single-key indexes lets each branch read pre-sorted rows.

Why include (covering). The recursive walk needs bridge_entity_id, date_from, date_to on every hop. Putting them in the INCLUDE clause makes the index a covering index — Postgres serves the query from the index leaf pages without touching the table heap.

Why partial on kind='work_overlap'. The table will hold other connection kinds eventually (linkedin, etc.). The partial predicate keeps these indexes small and dense — only the rows DeepFinder cares about. Smaller index → fits in shared_buffers → cache-friendly.

The matching SQL rewrite. Without the UNION ALL split in DeepFinder#fetch_path_rows, the planner sees a single OR'd predicate and can't tell which index to use — it picks a worse plan. The recursive case rewrite forces it:

-- inside the recursive CTE (see deep_finder.rb fetch_path_rows)
INNER JOIN LATERAL (
  SELECT next_contact_id, bridge_entity_id, date_from, date_to
  FROM (
    (SELECT contact_b_id AS next_contact_id, bridge_entity_id, date_from, date_to
     FROM contact_connections
     WHERE contact_a_id = b.current_contact_id AND kind = 'work_overlap'
     ORDER BY date_to DESC
     LIMIT 25)              -- ← reads idx_cc_walk_a, already sorted
    UNION ALL
    (SELECT contact_a_id AS next_contact_id, bridge_entity_id, date_from, date_to
     FROM contact_connections
     WHERE contact_b_id = b.current_contact_id AND kind = 'work_overlap'
     ORDER BY date_to DESC
     LIMIT 25)              -- ← reads idx_cc_walk_b, already sorted
  ) merged
  WHERE next_contact_id <> ALL(b.contact_path)
    AND EXISTS (SELECT 1 FROM user_contact_collections ucc ...)
  ORDER BY date_to DESC
  LIMIT 25
) cc ON TRUE

3.2 EXPLAIN ANALYZE — before vs after

Measured on sandbox5 (12.4M contact_connections rows, depth-3 walk on contact 3083 — Sebastian Stadil, ~15k edges):

SetupPlanExecution timeVerdict
Before (single composite index, OR-scan)Bitmap Heap Scan → Sort → Nested Loop489 msOver SLO
After (idx_cc_walk_a/b + UNION ALL)Index Only Scan ×2 → Append → Limit9 ms54× faster
Speedup
54×
EXPLAIN ANALYZE, depth-3
EXPLAIN time
9 ms
cold optimizer, warm cache
Real-data latency
~130 ms
warm cache, real depth-3 walk
Index footprint
partial
only kind='work_overlap'
EXPLAIN vs real-data latency gap. 9 ms in EXPLAIN ANALYZE reflects optimizer + execution against a warm buffer cache. Real-world cold-cache depth-3 walks on the heaviest contacts (15k+ edges) land in the 130–250 ms range — still under the 500 ms timeout, but the headroom is real. Post-backfill the buffer cache is cold; expect first queries to time out at the old 250 ms ceiling until shared_buffers warms (see follow-up "Post-backfill warm-up" below).

Pre-flight checklist — index hygiene

  1. Run the migration first. db/migrate/20260513213400_add_deep_finder_covering_indexes.rb uses CONCURRENTLY + if_not_exists — safe on prod, idempotent. Finishes in seconds against an empty work_overlap partition.
  2. Verify presence before kickoff: SELECT indexname FROM pg_indexes WHERE tablename='contact_connections' AND indexname IN ('idx_cc_walk_a','idx_cc_walk_b'); — must return both.
  3. After backfill, run ANALYZE contact_connections. Autoanalyze threshold is 10% of table size; bulk inserts outpace it. Without manual ANALYZE the planner picks bad plans against stale stats.
  4. Warm shared_buffers with sample DeepFinder queries (extended timeouts) before re-enabling user traffic. Cold-cache first reads dominate latency.

4. Known blockers

4.1 Sandbox5 Sidekiq stability — gating issue Open

Symptom: Every ~5 hours of running the backfill task, the MaintenanceTasks::TaskJob (the enqueuer) gets killed by Sidekiq shutdown and the framework's interrupted status doesn't resume. Workers continue draining whatever was already enqueued, but no new contacts get added to the queue until a human cancels and restarts.

Observed twice: Run #5 (2026-05-12 18:47, interrupted at 17,101/429,904), Run #7 (2026-05-12 22:57, interrupted at 16,801/429,904).

Likely root causes (not yet diagnosed):

Impact on prod. A prod backfill that gets interrupted at ~4% completion and won't resume turns a planned 8-hour window into multi-day manual babysitting. This needs to be understood before launching prod.

Mitigations (in order of preference):

  1. Investigate why Sidekiq restarts on sandbox5 and fix at the infra layer.
  2. Add a Maintenance Tasks resume detector — a small cron worker that finds interrupted runs older than N minutes and re-enqueues them. Not in PR #6724; would be a follow-up.
  3. Live with manual cancel-and-restart cycles. Loses progress between cycles but is correct (idempotent upserts).

4.2 Unique-jobs lock leakage on purge Documented

When we purge queue:default to start clean, the sidekiq-unique-jobs locks (created by lock: :until_executed) remain in Redis with infinite TTL (lock_ttl: null). Sandbox5 cleanup required deleting ~360k orphan locks separately. Document this in the prod runbook — purging only the queue without the locks will silently swallow re-enqueues for those contacts.

Cleanup pattern (run on prod redis if a purge ever happens):

# In Rails console with Sidekiq client connected to prod Redis:
deleted = 0
cursor = 0
loop do
  cursor, keys = Sidekiq.redis { |r| r.scan(cursor, match: 'uniquejobs:*', count: 1000) }
  Sidekiq.redis { |r| r.unlink(*keys) } if keys.any?
  deleted += keys.size
  break if cursor.to_i == 0
end
puts "Deleted #{deleted} uniquejobs locks"

4.3 11% of CWEs have no organization_id Out of scope

3 million CWE rows on prod have organization_id IS NULL. The contacts those rows belong to won't appear in contact_connections for that employer until an upstream company-resolution process links them. This is a coverage limitation customers may notice in DeepFinder. Not in scope for GET-109, but worth flagging on the GTM side.

5. Recommended prod backfill plan

Pre-flight — week before

  1. Run the diagnostic queries (§1.1 plus the orphan-org check) and update this doc with the effective edge-count target.
  2. Identify the 10–20 dominant mega-orgs. Spot-check EXPLAIN ANALYZE for the per-contact SQL on one contact at each.
  3. Apply the DeepFinder index migration (20260513213400_add_deep_finder_covering_indexes.rb). Idempotent; finishes in seconds against an empty work_overlap partition. See §3.
  4. Sandbox5 Sidekiq stability: investigate or accept. If accepting, build a small cron that auto-recovers interrupted Maintenance Tasks runs (~30 LOC follow-up PR).
  5. RDS capacity check: confirm connection pool can absorb ~100 long-held connections concurrently.

Day of

  1. Window: weekend or low-traffic block, ≥12 hours.
  2. Scale Sidekiq: temporarily boost :default queue concurrency. Coordinate with ops.
  3. Run: Maintenance::BackfillContactConnectionsByContactTask from /maintenance_tasks on prod backoffice.
  4. Monitor: Sidekiq Busy/Queues tabs + SELECT COUNT(*) FROM contact_connections WHERE kind='work_overlap' AND computed_at > NOW() - interval '5 min' for live throughput.
  5. Throttle if necessary: if customer query latency spikes, lower Sidekiq concurrency on :default mid-run.

Acceptance criteria

6. What's done vs what's left

Shipped on PR #6724 (008-contact-connections-table)

CommitChange
6222dafbb2WorkOverlapBackfillService streaming cursor + bounded memory
daed2d2c6dNew per-contact BackfillContactConnectionsByContactTask (Sidekiq fan-out)
bbf47c0e2dPer-contact SQL CTE rewrite — 40× speedup (write path)
d30a391435Removed org-iterating sibling task — single backfill path
b55749b99dDeepFinder covering indexes (idx_cc_walk_a/b) + UNION ALL rewrite — 54× speedup (read path)
ce336eaf71DeepFinder org-name fallback from CWE company_name when canonical organizations row is missing
d59994e8f8DeepFinder: surface via_user_is_importer in signals + raise DEFAULT_TIMEOUT_MS 500 → 1500 ms

Follow-ups (not in #6724)

7. Cross-references