Backfill Scale Reality Check

contact_connections work_overlap backfill — measured sandbox5 throughput, prod extrapolation, and the realistic plan for the maintenance window.

Ticket · GET-109 Status · Planning · 2026-05-13 Owner · Enaho Companion PR · #6724 Source markdown

TL;DR — In plain English

Production has ~3.5M contacts to backfill, projected to ~570M contact_connections edges (range 300–800M). The optimized per-contact SQL is 40× faster than the original, but the real prod risk isn't query speed — it's Sidekiq stability and DB connection headroom during the maintenance window.

1Iteration count is 3.5M, not 27M. We iterate over contacts (not CWE rows). 8.2× the sandbox5 count.
2Throughput target: 12k edges/min sustained. Measured on sandbox5 with 8 active Sidekiq workers on optimized SQL.
3Realistic window: 8–30 hours depending on the Sidekiq fleet. ~100 threads on :default lands at ~8h.
4SQL speedup is real. Heaviest sandbox5 contact (104 CWEs): 619.3s → 15.2s per job (commit bbf47c0e2d).
5Gating issue: sandbox5 Sidekiq restarts ~every 5h and Maintenance Tasks doesn't resume cleanly. Must understand at the infra layer before prod.

One-line summary: Code path is ready; the prod risk is operational (Sidekiq stability, RDS connection pool) not algorithmic.

Contacts to backfill (prod)

3,508,418

8.2× sandbox5

Expected edges (mid)

~570M

range 300M–800M

SQL speedup

40×

619s → 15s per contact

Throughput (sandbox5)

~12k/min

sustained, 8 workers

1. The numbers

1.1 Production universe — measured 2026-05-13 via prod reader DB

SELECT
  (SELECT COUNT(DISTINCT contact_id)
     FROM contact_work_experiences
     WHERE organization_id IS NOT NULL) AS contacts_to_backfill,
  (SELECT COUNT(*) FROM contact_work_experiences) AS total_cwes,
  (SELECT COUNT(*) FROM contact_work_experiences WHERE organization_id IS NOT NULL) AS cwes_with_org,
  (SELECT COUNT(*) FROM contact_work_experiences WHERE organization_id IS NULL) AS cwes_without_org,
  (SELECT COUNT(DISTINCT organization_id)
     FROM contact_work_experiences
     WHERE organization_id IS NOT NULL) AS distinct_orgs_in_cwe;

Metric	Prod	Sandbox5	Prod ÷ Sandbox5
Contacts to backfill (iteration count)	3,508,418	429,904	8.2×
Total CWE rows	27,783,005	~3.6M	7.7×
CWEs with `organization_id`	24,772,984	~3.2M	7.7×
CWEs without `organization_id`	3,010,021	~371k	8.1×
Distinct orgs referenced in CWE	3,236,986	759,245	4.3×

Orphan org IDs. Sandbox5 had 759k distinct organization_id values in CWE but only 73k exist in the organizations table; the rest reference deleted/orphan orgs. The backfill correctly filters those out via Organization.where(id: ...). Same check on prod is pending; running it tells us the effective org count for edge generation.

-- Run on prod to find the effective org count:
SELECT COUNT(DISTINCT cwe.organization_id) AS in_cwe,
       COUNT(o.id)                          AS exist_in_orgs_table
FROM contact_work_experiences cwe
LEFT JOIN organizations o ON o.id = cwe.organization_id
WHERE cwe.organization_id IS NOT NULL;

~11% of CWEs have no organization_id. Those contacts' employment is recorded as company_name strings only and will not produce edges until an upstream resolver links them to an organizations row. Separate ticket; not blocking GET-109.

1.2 Sandbox5 empirical throughput (2026-05-12 → 2026-05-13)

Phase	Contacts processed	Edges added (net new)	Edges per contact
Pre-task baseline (steady-state callbacks)	n/a	~3.18M	—
Run #5 (per-contact, interrupted at 4%)	17,101	~2.6M	~152
Run #7 (per-contact, same iteration order as #5)	16,801	~130k	(mostly re-upserts)
Final sandbox5 state	~17k of 430k (4%)	~5.9M	—

Observed edge-write throughput with optimized SQL on healthy Sidekiq (8 active workers):

~12,000 edges/min sustained
~10–60k edges per 5-minute window depending on which contacts are being processed (mega-org contacts spike higher)

1.3 Per-contact job latency — sandbox5, contact 372722 (104 CWEs)

SQL	Plan	Intermediate rows	Execution time
Old (filter post-join)	Parallel Hash Join	1,003,463,935	619.3s
New (CTE-anchored on target)	Nested Loop + index	~6,000	15.2s

40× speedup. Most prod contacts will be faster than this — 372722 was picked deliberately as one of the heaviest cases (104 CWEs across multiple large orgs).

2. Production projection

2.1 Expected total edges

Multiplying sandbox5's ~152 edges-per-contact (measured on the first 17k contacts) by 3.5M prod contacts:

3,508,418 contacts × 152 edges/contact ≈ 533M edges

Estimate	Edges	Why
Lower bound	~300M	Long-tail contacts may have sparser networks
Mid estimate	~570M	Direct extrapolation from sandbox5 sample
Upper bound	~800M+	Prod skews toward mega-org workers more than sandbox5

Sample-size caveats. The 17k sandbox5 sample (early IDs) may skew high (older accounts → more CWEs) or low (haven't reached mega-org workers yet). Net direction is uncertain; the spread is real.

2.2 Expected runtime

At sandbox5's measured 12k edges/min throughput on a healthy Sidekiq fleet (8 workers):

570M edges ÷ 12,000 edges/min ÷ 60 min/hr ≈ 790 hours per worker thread

That's ~33 days on one worker. With parallel workers (each independent), runtime divides linearly until the DB itself becomes the bottleneck.

Sidekiq config	Threads on `:default`	Wall-clock estimate
Current sandbox5 (1 process, ~8 threads)	8	~4 days
Typical prod (1 process, 25 threads)	25	~30 hours
Boosted prod (4 processes × 25 threads)	100	~8 hours
Aggressive (10× boost)	250	~3 hours (likely DB-limited)

The DB ceiling matters. Each running worker holds a Postgres connection and runs a heavy query. We measured 9 concurrent cursor queries on sandbox5 with no obvious contention, but prod's main RDS handles real customer traffic concurrently — running 100+ heavy CTEs alongside user-facing reads is the real risk.

2.3 Sidekiq concurrency budget

Each WorkOverlapEdgeWorker invocation:

Holds 1 Postgres connection for the duration of the cursor scan.
Uses bounded Ruby memory (batch_size of 1,000 rows × a few hundred bytes each ≈ <10 MB).
Runs for 1–15 seconds on a typical contact, can spike to 60+ seconds on contacts at multiple mega-orgs.

100 concurrent workers = 100 long-held PG connections. Confirm RDS connection pool headroom before the maintenance window.

3. Read-path readiness — DeepFinder covering indexes

The backfill ships data; DeepFinder reads it. Without the covering indexes below, the depth-3 recursive walk hits 489 ms on real data — past the 250 ms SLO. With them, the same query lands at 9 ms in EXPLAIN ANALYZE (~130 ms warm-cache on real sandbox5 with 12.4M edges). The indexes must be in place before backfill starts: building them after 570M rows have landed would take hours on its own, and any DeepFinder traffic during that gap would time out.

Order of operations on prod. Indexes are partial on kind='work_overlap'. Pre-backfill there are zero matching rows, so CREATE INDEX CONCURRENTLY finishes in seconds against an empty subset. The indexes then grow row-by-row as the backfill writes. This is why the migration runs first.

3.1 Index design + UNION ALL rewrite

Two symmetric covering partial indexes — one per endpoint of the undirected edge — let the planner pull pre-sorted rows directly from the index, no heap fetch, no in-loop sort:

-- db/migrate/20260513213400_add_deep_finder_covering_indexes.rb
add_index :contact_connections, [:contact_a_id, :date_to],
          order: { date_to: :desc },
          include: %i[contact_b_id bridge_entity_id date_from],
          where: "kind = 'work_overlap'",
          algorithm: :concurrently,
          name: 'idx_cc_walk_a'

add_index :contact_connections, [:contact_b_id, :date_to],
          order: { date_to: :desc },
          include: %i[contact_a_id bridge_entity_id date_from],
          where: "kind = 'work_overlap'",
          algorithm: :concurrently,
          name: 'idx_cc_walk_b'

Why two indexes. contact_connections is undirected — a walk from contact X needs to find rows where X is either contact_a_id OR contact_b_id. A single composite OR-scan can't preserve sort order through the OR clause, so the planner falls back to a sort step. Splitting into two single-key indexes lets each branch read pre-sorted rows.

Why include (covering). The recursive walk needs bridge_entity_id, date_from, date_to on every hop. Putting them in the INCLUDE clause makes the index a covering index — Postgres serves the query from the index leaf pages without touching the table heap.

Why partial on kind='work_overlap'. The table will hold other connection kinds eventually (linkedin, etc.). The partial predicate keeps these indexes small and dense — only the rows DeepFinder cares about. Smaller index → fits in shared_buffers → cache-friendly.

The matching SQL rewrite. Without the UNION ALL split in DeepFinder#fetch_path_rows, the planner sees a single OR'd predicate and can't tell which index to use — it picks a worse plan. The recursive case rewrite forces it:

-- inside the recursive CTE (see deep_finder.rb fetch_path_rows)
INNER JOIN LATERAL (
  SELECT next_contact_id, bridge_entity_id, date_from, date_to
  FROM (
    (SELECT contact_b_id AS next_contact_id, bridge_entity_id, date_from, date_to
     FROM contact_connections
     WHERE contact_a_id = b.current_contact_id AND kind = 'work_overlap'
     ORDER BY date_to DESC
     LIMIT 25)              -- ← reads idx_cc_walk_a, already sorted
    UNION ALL
    (SELECT contact_a_id AS next_contact_id, bridge_entity_id, date_from, date_to
     FROM contact_connections
     WHERE contact_b_id = b.current_contact_id AND kind = 'work_overlap'
     ORDER BY date_to DESC
     LIMIT 25)              -- ← reads idx_cc_walk_b, already sorted
  ) merged
  WHERE next_contact_id <> ALL(b.contact_path)
    AND EXISTS (SELECT 1 FROM user_contact_collections ucc ...)
  ORDER BY date_to DESC
  LIMIT 25
) cc ON TRUE

3.2 EXPLAIN ANALYZE — before vs after

Measured on sandbox5 (12.4M contact_connections rows, depth-3 walk on contact 3083 — Sebastian Stadil, ~15k edges):

Setup	Plan	Execution time	Verdict
Before (single composite index, OR-scan)	Bitmap Heap Scan → Sort → Nested Loop	489 ms	Over SLO
After (`idx_cc_walk_a/b` + UNION ALL)	Index Only Scan ×2 → Append → Limit	9 ms	54× faster

Speedup

54×

EXPLAIN ANALYZE, depth-3

EXPLAIN time

9 ms

cold optimizer, warm cache

Real-data latency

~130 ms

warm cache, real depth-3 walk

Index footprint

partial

only kind='work_overlap'

EXPLAIN vs real-data latency gap. 9 ms in EXPLAIN ANALYZE reflects optimizer + execution against a warm buffer cache. Real-world cold-cache depth-3 walks on the heaviest contacts (15k+ edges) land in the 130–250 ms range — still under the 500 ms timeout, but the headroom is real. Post-backfill the buffer cache is cold; expect first queries to time out at the old 250 ms ceiling until shared_buffers warms (see follow-up "Post-backfill warm-up" below).

Pre-flight checklist — index hygiene

Run the migration first. db/migrate/20260513213400_add_deep_finder_covering_indexes.rb uses CONCURRENTLY + if_not_exists — safe on prod, idempotent. Finishes in seconds against an empty work_overlap partition.
Verify presence before kickoff: SELECT indexname FROM pg_indexes WHERE tablename='contact_connections' AND indexname IN ('idx_cc_walk_a','idx_cc_walk_b'); — must return both.
After backfill, run ANALYZE contact_connections. Autoanalyze threshold is 10% of table size; bulk inserts outpace it. Without manual ANALYZE the planner picks bad plans against stale stats.
Warm shared_buffers with sample DeepFinder queries (extended timeouts) before re-enabling user traffic. Cold-cache first reads dominate latency.

4. Known blockers

4.1 Sandbox5 Sidekiq stability — gating issue Open

Symptom: Every ~5 hours of running the backfill task, the MaintenanceTasks::TaskJob (the enqueuer) gets killed by Sidekiq shutdown and the framework's interrupted status doesn't resume. Workers continue draining whatever was already enqueued, but no new contacts get added to the queue until a human cancels and restarts.

Observed twice: Run #5 (2026-05-12 18:47, interrupted at 17,101/429,904), Run #7 (2026-05-12 22:57, interrupted at 16,801/429,904).

Likely root causes (not yet diagnosed):

Sidekiq container restarted by orchestrator (ECS task replacement, k8s eviction)
OOM kill (less likely now that streaming is in)
Scheduled deploy or maintenance hook hitting nightly

Impact on prod. A prod backfill that gets interrupted at ~4% completion and won't resume turns a planned 8-hour window into multi-day manual babysitting. This needs to be understood before launching prod.

Mitigations (in order of preference):

Investigate why Sidekiq restarts on sandbox5 and fix at the infra layer.
Add a Maintenance Tasks resume detector — a small cron worker that finds interrupted runs older than N minutes and re-enqueues them. Not in PR #6724; would be a follow-up.
Live with manual cancel-and-restart cycles. Loses progress between cycles but is correct (idempotent upserts).

4.2 Unique-jobs lock leakage on purge Documented

When we purge queue:default to start clean, the sidekiq-unique-jobs locks (created by lock: :until_executed) remain in Redis with infinite TTL (lock_ttl: null). Sandbox5 cleanup required deleting ~360k orphan locks separately. Document this in the prod runbook — purging only the queue without the locks will silently swallow re-enqueues for those contacts.

Cleanup pattern (run on prod redis if a purge ever happens):

# In Rails console with Sidekiq client connected to prod Redis:
deleted = 0
cursor = 0
loop do
  cursor, keys = Sidekiq.redis { |r| r.scan(cursor, match: 'uniquejobs:*', count: 1000) }
  Sidekiq.redis { |r| r.unlink(*keys) } if keys.any?
  deleted += keys.size
  break if cursor.to_i == 0
end
puts "Deleted #{deleted} uniquejobs locks"

4.3 11% of CWEs have no `organization_id` Out of scope

3 million CWE rows on prod have organization_id IS NULL. The contacts those rows belong to won't appear in contact_connections for that employer until an upstream company-resolution process links them. This is a coverage limitation customers may notice in DeepFinder. Not in scope for GET-109, but worth flagging on the GTM side.

5. Recommended prod backfill plan

Pre-flight — week before

Run the diagnostic queries (§1.1 plus the orphan-org check) and update this doc with the effective edge-count target.
Identify the 10–20 dominant mega-orgs. Spot-check EXPLAIN ANALYZE for the per-contact SQL on one contact at each.
Apply the DeepFinder index migration (20260513213400_add_deep_finder_covering_indexes.rb). Idempotent; finishes in seconds against an empty work_overlap partition. See §3.
Sandbox5 Sidekiq stability: investigate or accept. If accepting, build a small cron that auto-recovers interrupted Maintenance Tasks runs (~30 LOC follow-up PR).
RDS capacity check: confirm connection pool can absorb ~100 long-held connections concurrently.

Day of

Window: weekend or low-traffic block, ≥12 hours.
Scale Sidekiq: temporarily boost :default queue concurrency. Coordinate with ops.
Run: Maintenance::BackfillContactConnectionsByContactTask from /maintenance_tasks on prod backoffice.
Monitor: Sidekiq Busy/Queues tabs + SELECT COUNT(*) FROM contact_connections WHERE kind='work_overlap' AND computed_at > NOW() - interval '5 min' for live throughput.
Throttle if necessary: if customer query latency spikes, lower Sidekiq concurrency on :default mid-run.

Acceptance criteria

contact_connections row count for kind='work_overlap' stabilizes (no significant deltas for 30+ minutes).
Maintenance Tasks run status: Succeeded (not Interrupted or Cancelled).
ANALYZE contact_connections run; planner stats refreshed against post-backfill row count.
Spot-check 5 random contacts: DeepFinder returns paths within the 500 ms SLO (warm cache).

6. What's done vs what's left

Shipped on PR #6724 (`008-contact-connections-table`)

Commit	Change
`6222dafbb2`	`WorkOverlapBackfillService` streaming cursor + bounded memory
`daed2d2c6d`	New per-contact `BackfillContactConnectionsByContactTask` (Sidekiq fan-out)
`bbf47c0e2d`	Per-contact SQL CTE rewrite — 40× speedup (write path)
`d30a391435`	Removed org-iterating sibling task — single backfill path
`b55749b99d`	DeepFinder covering indexes (`idx_cc_walk_a/b`) + UNION ALL rewrite — 54× speedup (read path)
`ce336eaf71`	DeepFinder org-name fallback from CWE `company_name` when canonical `organizations` row is missing
`d59994e8f8`	DeepFinder: surface `via_user_is_importer` in signals + raise `DEFAULT_TIMEOUT_MS` 500 → 1500 ms

Follow-ups (not in #6724)

Auto-recovery worker for stuck interrupted Maintenance Tasks runs (~40 LOC + spec).
Investigation of why sandbox5 Sidekiq restarts every ~5 hours.
Company-name-to-Organization resolver for the 11% NULL-org CWEs (separate ticket).
Prod runbook for the maintenance window (cancel/restart, purge with lock cleanup, throttle).

7. Cross-references

DeepFinder query latency: performance.html (different concern — read-path SLO at 250 ms)
Synthetic seed task: Maintenance::SeedSyntheticConnectionGraphTask
Per-contact worker: Contacts::ConnectionPaths::WorkOverlapEdgeWorker
Per-contact service entry: Contacts::ConnectionPaths::WorkOverlapBackfillService.call(contact_id: X)

TL;DR — In plain English

1. The numbers

1.1 Production universe — measured 2026-05-13 via prod reader DB

1.2 Sandbox5 empirical throughput (2026-05-12 → 2026-05-13)

1.3 Per-contact job latency — sandbox5, contact 372722 (104 CWEs)

2. Production projection

2.1 Expected total edges

2.2 Expected runtime

2.3 Sidekiq concurrency budget

3. Read-path readiness — DeepFinder covering indexes

3.1 Index design + UNION ALL rewrite

3.2 EXPLAIN ANALYZE — before vs after

Pre-flight checklist — index hygiene

4. Known blockers

4.1 Sandbox5 Sidekiq stability — gating issue Open

4.2 Unique-jobs lock leakage on purge Documented

4.3 11% of CWEs have no organization_id Out of scope

5. Recommended prod backfill plan

Pre-flight — week before

Day of

Acceptance criteria

6. What's done vs what's left

Shipped on PR #6724 (008-contact-connections-table)

Follow-ups (not in #6724)

7. Cross-references

4.3 11% of CWEs have no `organization_id` Out of scope

Shipped on PR #6724 (`008-contact-connections-table`)