# Backfill Scale Reality Check: `contact_connections` work_overlap

**Ticket:** GET-109
**Status:** Planning · 2026-05-13
**Owner:** Enaho
**Companion code:** PR [#6724](https://github.com/getro/backend/pull/6724) — `backend/app/services/contacts/connection_paths/work_overlap_backfill_service.rb`, `backend/app/tasks/maintenance/backfill_contact_connections_by_contact_task.rb`

This doc captures what we actually measured on sandbox5 and what those numbers extrapolate to on production, so we walk into the prod backfill window with realistic expectations and a plan — not a guess.

---

## 1. TL;DR

- **Production has ~3.5M contacts with org-linked work experiences.** That's the iteration count for the per-contact backfill task.
- **Expected edge count on prod: 300–800 million `contact_connections` rows** (mid estimate ~570M), extrapolated from sandbox5's measured ~152 edges-per-processed-contact.
- **Realistic prod backfill runtime: 12–24 hours** with adequate Sidekiq concurrency (≥100 threads on `:default`). Plan a weekend maintenance window.
- **Known blocker for now: sandbox5 Sidekiq stability.** Every ~5 hours the maintenance task gets `interrupted` and the framework doesn't resume cleanly. This will bite on prod too if the infra reason isn't understood first.
- **The optimized SQL (commit `bbf47c0e2d`) is 40x faster than the original.** Per-contact jobs went from minutes to seconds on the same data. Streaming cursor (`6222dafbb2`) keeps memory bounded regardless of org size.

---

## 2. The numbers

### 2.1 Production universe (measured 2026-05-13 via prod reader DB)

```sql
SELECT
  (SELECT COUNT(DISTINCT contact_id)
     FROM contact_work_experiences
     WHERE organization_id IS NOT NULL) AS contacts_to_backfill,
  (SELECT COUNT(*) FROM contact_work_experiences) AS total_cwes,
  (SELECT COUNT(*) FROM contact_work_experiences WHERE organization_id IS NOT NULL) AS cwes_with_org,
  (SELECT COUNT(*) FROM contact_work_experiences WHERE organization_id IS NULL) AS cwes_without_org,
  (SELECT COUNT(DISTINCT organization_id)
     FROM contact_work_experiences
     WHERE organization_id IS NOT NULL) AS distinct_orgs_in_cwe;
```

| Metric | Prod | Sandbox5 | Prod ÷ Sandbox5 |
|---|---:|---:|---:|
| Contacts to backfill (iteration count) | **3,508,418** | 429,904 | 8.2× |
| Total CWE rows | 27,783,005 | ~3.6M | 7.7× |
| CWEs with `organization_id` | 24,772,984 | ~3.2M | 7.7× |
| CWEs **without** `organization_id` | 3,010,021 | ~371k | 8.1× |
| Distinct orgs referenced in CWE | 3,236,986 | 759,245 | 4.3× |

**Important caveat — orphan org IDs.** Sandbox5 had 759k distinct `organization_id` values in CWE but only 73k of those exist in the `organizations` table; the rest reference deleted/orphan orgs. The backfill correctly filters those out via `Organization.where(id: ...)`. Same check on prod is pending; running it tells us the *effective* org count for edge generation.

```sql
-- Run on prod to find the effective org count:
SELECT COUNT(DISTINCT cwe.organization_id) AS in_cwe,
       COUNT(o.id)                          AS exist_in_orgs_table
FROM contact_work_experiences cwe
LEFT JOIN organizations o ON o.id = cwe.organization_id
WHERE cwe.organization_id IS NOT NULL;
```

**~11% of CWEs have no org_id.** Those contacts' employment is recorded as company_name strings only and **will not produce edges** until an upstream resolver links them to an `organizations` row. Separate ticket; not blocking GET-109.

### 2.2 Sandbox5 empirical throughput (measured 2026-05-12 through 2026-05-13)

| Phase | Contacts processed | Edges added (net new) | Edges per contact |
|---|---:|---:|---:|
| Pre-task baseline (steady-state callbacks) | n/a | ~3.18M | — |
| Run #5 (per-contact, interrupted at 4%) | 17,101 | ~2.6M | ~152 |
| Run #7 (per-contact, same iteration order as #5) | 16,801 | ~130k | (mostly re-upserts of #5) |
| **Final sandbox5 state** | ~17k of 430k (4%) | **~5.9M** | — |

Observed edge-write throughput with optimized SQL on healthy Sidekiq (8 active workers):
- **~12,000 edges/min** sustained
- **~10–60k edges per 5-minute window** depending on which contacts are being processed (mega-org contacts spike higher)

### 2.3 Per-contact job latency (sandbox5, contact 372722 = 104 CWEs)

| SQL | Plan | Intermediate rows | Execution time |
|---|---|---:|---:|
| Old (filter post-join) | Parallel Hash Join | **1,003,463,935** | **619.3s** |
| New (CTE-anchored on target) | Nested Loop + index | ~6,000 | **15.2s** |

**40× speedup.** Most prod contacts will be faster than this — contact 372722 was deliberately picked as one of the heaviest cases (104 CWEs across multiple large orgs).

---

## 3. Production projection

### 3.1 Expected total edges

Multiplying sandbox5's ~152 edges-per-contact (measured on the first 17k contacts) by 3.5M prod contacts:

```
3,508,418 contacts × 152 edges/contact ≈ 533M edges
```

Range:
- **Lower bound:** ~300M (if long-tail contacts have sparser networks)
- **Mid estimate:** ~570M
- **Upper bound:** ~800M+ (if prod skews toward mega-org workers more than sandbox5)

Caveats:
- Sample of 17k early-ID contacts may skew high (older accounts → more CWEs).
- Sample of 17k may skew low (haven't reached the mega-org workers in iteration order yet).
- Net direction is uncertain; the spread is real.

### 3.2 Expected runtime

At sandbox5's measured 12k edges/min throughput on a healthy Sidekiq fleet (8 workers):

```
570M edges ÷ 12,000 edges/min ÷ 60 min/hr ≈ 790 hours per worker thread
```

That's ~33 days on one worker. With parallel workers (each independent), runtime divides linearly until the DB itself becomes the bottleneck.

| Sidekiq config | Threads on `:default` | Wall-clock estimate |
|---|---:|---:|
| Current sandbox5 (1 process, ~8 threads) | 8 | ~4 days |
| Typical prod (1 process, 25 threads) | 25 | **~30 hours** |
| Boosted prod (4 processes × 25 threads) | 100 | **~8 hours** |
| Aggressive (10× boost) | 250 | ~3 hours (likely DB-limited) |

**The DB ceiling matters.** Each running worker holds a Postgres connection and runs a heavy query. We measured 9 concurrent cursor queries on sandbox5 with no obvious contention, but prod's main RDS handles real customer traffic concurrently — running 100+ heavy CTEs alongside user-facing reads is the real risk.

### 3.3 Sidekiq concurrency budget

Each `WorkOverlapEdgeWorker` invocation:
- Holds 1 Postgres connection for the duration of the cursor scan.
- Uses bounded Ruby memory (batch_size of 1,000 rows × a few hundred bytes each ≈ <10 MB).
- Runs for 1–15 seconds on a typical contact, can spike to 60+ seconds on contacts at multiple mega-orgs.

100 concurrent workers = 100 long-held PG connections. Confirm RDS connection pool headroom **before** the maintenance window.

---

## 4. Known issues blocking a clean prod run

### 4.1 Sandbox5 Sidekiq stability — gating issue

**Symptom:** Every ~5 hours of running the backfill task, the `MaintenanceTasks::TaskJob` (the enqueuer) gets killed by Sidekiq shutdown and the framework's `interrupted` status doesn't resume. Workers continue draining whatever was already enqueued, but no new contacts get added to the queue until a human cancels and restarts.

**Observed twice:** Run #5 (2026-05-12 18:47, interrupted at 17,101/429,904), Run #7 (2026-05-12 22:57, interrupted at 16,801/429,904).

**Likely root causes (not yet diagnosed):**
- Sidekiq container restarted by orchestrator (ECS task replacement, k8s eviction)
- OOM kill (less likely now that streaming is in)
- Scheduled deploy or maintenance hook hitting nightly

**Impact on prod:** A prod backfill that gets interrupted at ~4% completion and won't resume turns a planned 8-hour window into multi-day manual babysitting. This needs to be understood before launching prod.

**Mitigations (in order of preference):**
1. Investigate **why** Sidekiq restarts on sandbox5 and fix at the infra layer.
2. Add a Maintenance Tasks **resume detector** — a small cron worker that finds `interrupted` runs older than N minutes and re-enqueues them. Not in PR #6724, would be a follow-up.
3. Live with manual cancel-and-restart cycles. Loses progress between cycles but is correct (idempotent upserts).

### 4.2 Unique-jobs lock leakage on purge

When we purge `queue:default` to start clean, the `sidekiq-unique-jobs` locks (created by `lock: :until_executed`) remain in Redis with infinite TTL (`lock_ttl: null`). Sandbox5 cleanup required deleting ~360k orphan locks separately. **Document this in the prod runbook** — purging only the queue without the locks will silently swallow re-enqueues for those contacts.

Cleanup pattern (run on prod redis if a purge ever happens):

```ruby
# In Rails console with Sidekiq client connected to prod Redis:
deleted = 0
cursor = 0
loop do
  cursor, keys = Sidekiq.redis { |r| r.scan(cursor, match: 'uniquejobs:*', count: 1000) }
  Sidekiq.redis { |r| r.unlink(*keys) } if keys.any?
  deleted += keys.size
  break if cursor.to_i == 0
end
puts "Deleted #{deleted} uniquejobs locks"
```

### 4.3 11% of CWEs have no `organization_id`

3 million CWE rows on prod have `organization_id IS NULL`. The contacts those rows belong to won't appear in `contact_connections` for that employer until an upstream company-resolution process links them. **This is a coverage limitation customers may notice in DeepFinder.** Not in scope for GET-109, but worth flagging on the GTM side.

---

## 5. Recommended prod backfill plan

### Pre-flight (week before)

1. **Run the diagnostic queries** (section 2.1 plus the orphan-org check) and update this doc with the effective edge-count target.
2. **Run query #4 from the prod plan** (top mega-orgs by contact count) to identify the 10–20 employers that will dominate runtime. Spot-check `EXPLAIN ANALYZE` for the per-contact SQL on one contact at each.
3. **Sandbox5 Sidekiq stability:** investigate or accept. If accepting, build a small cron that auto-recovers `interrupted` Maintenance Tasks runs (~30 LOC follow-up PR).
4. **RDS capacity check:** confirm connection pool can absorb ~100 long-held connections concurrently.

### Day of

1. **Window:** weekend or low-traffic block, ≥12 hours.
2. **Scale Sidekiq:** temporarily boost `:default` queue concurrency. Coordinate with ops.
3. **Run:** `Maintenance::BackfillContactConnectionsByContactTask` from `/maintenance_tasks` on prod backoffice.
4. **Monitor:** Sidekiq Busy/Queues tabs + `SELECT COUNT(*) FROM contact_connections WHERE kind='work_overlap' AND computed_at > NOW() - interval '5 min'` for live throughput.
5. **Throttle if necessary:** if customer query latency spikes, lower Sidekiq concurrency on `:default` mid-run.

### Acceptance criteria

- `contact_connections` row count for `kind='work_overlap'` stabilizes (no significant deltas for 30+ minutes).
- Maintenance Tasks run status: `Succeeded` (not `Interrupted` or `Cancelled`).
- Spot-check 5 random contacts: DeepFinder returns paths within the 250ms SLO.

---

## 6. What's done vs what's left

### Shipped on PR #6724 (`008-contact-connections-table`)

- `WorkOverlapBackfillService` streaming cursor + bounded memory (`6222dafbb2`)
- Per-org log line emitted from maintenance task — `Rails.logger.info` per iteration
- New `BackfillContactConnectionsByContactTask` (Sidekiq fan-out) (`daed2d2c6d`)
- Per-contact SQL CTE rewrite — 40× speedup (`bbf47c0e2d`)
- Removed the org-iterating sibling task — single backfill path (`d30a391435`)

### Follow-ups (not in #6724)

- Auto-recovery worker for stuck `interrupted` Maintenance Tasks runs (40 LOC, +spec).
- Investigation of why sandbox5 Sidekiq restarts every ~5 hours.
- Company-name-to-Organization resolver for the 11% NULL-org CWEs (separate ticket entirely).
- Prod runbook for the maintenance window (cancel/restart, purge with lock cleanup, throttle).

---

## 7. Cross-references

- DeepFinder query latency: [`performance.md`](performance.md) (different concern — read-path SLO at 250ms)
- Synthetic seed task: `Maintenance::SeedSyntheticConnectionGraphTask` in backend
- Per-contact worker: `Contacts::ConnectionPaths::WorkOverlapEdgeWorker`
- Per-contact service entry: `Contacts::ConnectionPaths::WorkOverlapBackfillService.call(contact_id: X)`