# Load Test: connection_paths v1 vs v2

**Ticket:** GET-109
**Date:** 2026-05-15
**Owner:** Enaho
**Companion code:** PR [#6724](https://github.com/getro/backend/pull/6724) — `backend/app/services/contacts/connection_paths/finder_v2.rb`, `backend/load/connection_paths.k6.js`, `backend/load/sweep.sh`

Side-by-side benchmark of the legacy `Finder` (CWE-direct CTE join) vs the new `FinderV2` (DeepFinder + `contact_connections`) on the production endpoint `GET /api/v2/collections/:cid/contacts/:id/connection_paths`. The endpoint accepts `?impl=v1|v2` to override the `relationship_strength` feature flag for benchmarking. We sweep `max_depth ∈ {1,2,3,4}` and `max_paths ∈ {25,100}` across both **local** and **sandbox5** to surface how the legacy join scales (it doesn't) and where v2's depth limit sits in practice.

---

## 1. TL;DR

- **At depth=2 on real data (sandbox5), v2 is ~7.6× faster than v1 at p95 (5.0s vs 38.1s) and 15× more throughput** (4.0 RPS vs 0.27 RPS). On local synthetic data v2's lead is smaller (~1.5×) — the gap **widens with real data**, as expected.
- **v2 depth=1 is essentially free** — 600–700 ms p95 on sandbox5 with ~18 RPS sustained.
- **v2 depth=3 is usable but costly** — ~20 s p95 on sandbox5; **depth=4 is risky** (23–50 s p95, mostly DB-bound on heavy contacts).
- **The legacy `Finder` cannot scale** — depth=2 p95 of 38 s on sandbox5 is *6× the SLO* and would page on prod. We must migrate to v2 before any kind of production rollout.

---

## 2. Setup

| Param | Local | Sandbox5 |
|---|---|---|
| Backend | Rails dev (1 Puma worker) — `api.local.getro.dev` | Prod-like — `api.sandbox5.getro.dev` |
| DB | Local Postgres, ~thousand `contact_connections` rows | Real `contact_connections` (~12.4M rows after backfill) |
| Collection | **4386** ("Perf collection") — 500k shared UCC rows | **728** ("Getro.org") — real data |
| Auth | user 40 (`enaho@getro.com`) | user 271281 (Enaho on sandbox5) |
| Tool | k6 v1.3.0 (`brew install k6`) | k6 v1.3.0 |
| Driver | `backend/load/sweep.sh` | `ENV=sandbox5 backend/load/sweep.sh` |
| VUs | 10 | 10 |
| Duration | 20 s per cell | 20 s per cell |
| `max_depth` | 1, 2, 3, 4 | 1, 2, 3, 4 |
| `max_paths` | 25, 100 | 25, 100 |
| Impl | v1 (depth=2 only), v2 | v1 (depth=2 only), v2 |

Legacy `Finder` is hard-coded at depth=2 — it ignores `max_depth`. So v1 cells only run at depth=2; depths 1, 3, 4 are v2-only.

### Target contacts

**Local (collection 4386):** synthetic perf data with denser-than-real graphs.
```
3349  wren_133    3338  ellis_109    3300  logan_94
3690  jordan_106  3439  emerson_117  1033791  perfsynth1054_171  1033638  perfsynth901_107
```

**Sandbox5 (collection 728):** real curated graphs.
```
3083   sebastian_15k    14080  kristin_11k     23352  abdulrahman_13k
84573  chris_225        372722 gheran_41       158317 enaho_151
```

---

## 3. Headline: v1 vs v2 at depth=2

This is the apples-to-apples comparison. Legacy can only do depth=2; v2 is benchmarked here to match.

| env | impl | max_paths | p50 (ms) | p90 (ms) | p95 (ms) | avg (ms) | RPS | reqs |
|---|---|---:|---:|---:|---:|---:|---:|---:|
| local | v1 | 25 | 3,175 | 3,816 | **3,832** | 3,135 | 2.6 | 93 |
| local | v1 | 100 | 2,858 | 3,384 | **3,443** | 2,557 | 3.1 | 113 |
| local | **v2** | 25 | 1,983 | 2,463 | **2,627** | 1,772 | **4.6** | 161 |
| local | **v2** | 100 | 1,998 | 2,278 | **2,572** | 1,718 | **4.7** | 166 |
| sandbox5 | v1 | 25 | 32,903 | 36,641 | **38,088** | 28,360 | 0.3 | 13 |
| sandbox5 | v1 | 100 | 11,862 | 30,670 | **33,541** | 16,458 | 0.2 | 11 |
| sandbox5 | **v2** | 25 | 1,700 | 2,668 | **5,037** | 1,952 | **4.0** | 142 |
| sandbox5 | **v2** | 100 | 2,622 | 4,478 | **4,907** | 2,811 | **2.8** | 100 |

### What this means

- **Local: v2 is ~1.5× faster** (2.6s p95 vs 3.8s, 1.5× more throughput). Small lead because local data is thin and Rails dev is the bottleneck for both.
- **Sandbox5: v2 is ~7.6× faster** (5.0s p95 vs 38.1s, 15× more throughput). The CTE hash-join hits the real-data wall hard — every additional 12M-edge join multiplies the intermediate set. The covering indexes `idx_cc_walk_a/b` keep v2 fast.
- **Sandbox5 v1 sample size is tiny** (11–13 requests in 20s) because each one takes 30+ seconds. We're observing a bounded scenario; in production this latency would queue traffic until Puma worker death.

---

## 4. v2 across all depths (sandbox5)

How v2 scales as the recursion deepens, on real data:

| max_depth | max_paths | reqs | RPS | p50 (ms) | p90 (ms) | p95 (ms) | avg (ms) |
|---:|---:|---:|---:|---:|---:|---:|---:|
| 1 (direct only) | 25 | 615 | **17.5** | 404 | 586 | **702** | 445 |
| 1 (direct only) | 100 | 652 | **18.6** | 401 | 549 | **609** | 421 |
| 2 (1-hop intro) | 25 | 142 | 4.0 | 1,700 | 2,668 | **5,037** | 1,952 |
| 2 (1-hop intro) | 100 | 100 | 2.8 | 2,622 | 4,478 | **4,907** | 2,811 |
| 3 (2-hop intro) | 25 | 26 | 0.58 | 13,223 | 18,411 | **19,492** | 13,020 |
| 3 (2-hop intro) | 100 | 20 | 0.39 | 19,419 | 29,179 | **29,595** | 18,903 |
| 4 (3-hop intro) | 25 | 41 | 0.76 | 6,881 | 21,387 | **22,987** | 9,307 |
| 4 (3-hop intro) | 100 | 5 | 0.08 | 14,402 | 42,364 | **50,915** | 23,107 |

### Observations

- **Depth 1 is essentially free.** Pure UCC join with no recursion. 600–700ms p95 on real data at ~18 RPS sustained.
- **Depth 2 is the production sweet spot.** 5s p95 — well above the original 250ms SLO but well under the 1500ms backend timeout. Will improve further once the DB buffer cache is warm.
- **Depth 3 doubles to triples the latency** — 20–30s p95. Workable for "show me deeper paths" exploration, not for routine reads.
- **Depth 4 has wide variance** — 23–50s p95 with `max_paths=100` and small samples. The fanout cap (`MAX_EDGES_PER_HOP=25`) is doing its job (no >50s timeouts), but the absolute latency is risky for prod.

### Why depth=4 max_paths=25 looks *faster* than depth=4 max_paths=100

Small sample sizes (41 vs 5 requests). The depth=4 max_paths=100 cells finished only 5 requests in 20s because each one took 14–60s. The distribution is bimodal — cheap contacts complete fast, mega-hub contacts hit the timeout. Don't read too much into row-by-row at the tails.

---

## 5. v2 across all depths (local)

For comparison — same sweep on local synthetic data:

| max_depth | max_paths | reqs | RPS | p50 (ms) | p90 (ms) | p95 (ms) | avg (ms) |
|---:|---:|---:|---:|---:|---:|---:|---:|
| 1 | 25 | 258 | 7.4 | 1,319 | 1,358 | **1,385** | 1,091 |
| 1 | 100 | 260 | 7.4 | 1,309 | 1,413 | **1,446** | 1,085 |
| 2 | 25 | 161 | 4.6 | 1,983 | 2,463 | **2,627** | 1,772 |
| 2 | 100 | 166 | 4.7 | 1,998 | 2,278 | **2,572** | 1,718 |
| 3 | 25 | 48 | 1.1 | 8,289 | 9,793 | **10,036** | 6,810 |
| 3 | 100 | 39 | 0.8 | 10,223 | 14,406 | **15,492** | 9,190 |
| 4 | 25 | 32 | 0.7 | 12,584 | 13,966 | **14,460** | 10,611 |
| 4 | 100 | 18 | 0.3 | 24,422 | 32,847 | **33,012** | 22,023 |

Local numbers are slower than sandbox5 at depth=1 (1.4 s vs 0.7 s p95) because:
- Local Rails is dev-mode (1 Puma worker), no compiled boot, no warm caches.
- Sandbox5 runs production-mode Rails with multiple workers + threads.

This **inverts** the usual "local is faster" intuition. **For perf testing, sandbox5 is the right environment** — local numbers are mostly dev-mode noise.

---

## 6. Implications for production rollout

1. **v2 is non-negotiable.** v1's 38s p95 on sandbox5 would saturate Puma workers and page within minutes of real traffic. Rollout via `relationship_strength` feature flag — flip it per collection as v2 perf is verified.
2. **Depth=2 is the production contract.** Matches legacy semantic and lands ~5s p95 on real data. Buffer cache warming + ANALYZE post-backfill should push this closer to 2–3s.
3. **Depth=3 is acceptable for exploratory/discovery surfaces** — e.g. the playground, admin tools — but not for tab-loading reads. ~20–30s p95.
4. **Depth=4 is research-grade.** Variable, sometimes 50s. Keep behind explicit user action (e.g. "show deeper paths"), not on default loads.
5. **`max_paths` matters less than `max_depth`.** At depth=2, `max_paths=25` vs `100` shows nearly identical p95 on sandbox5 (5.0 vs 4.9 s). The expensive work is the walk, not the result sort.

---

## 7. How to reproduce

### Prereqs
- Docker dev environment up (`make up`)
- k6 installed (`brew install k6`)
- For sandbox5: a Bearer JWT from a confirmed user with shared UCC access in collection 728 (use the admin-portal devtools snippet in the playground to grab one)

### Local sweep

```bash
VUS=10 DURATION=20s ./backend/load/sweep.sh
# outputs:
#   backend/load/results/local/sweep-summary.csv
#   backend/load/results/local/sweep-summary.md
#   backend/load/results/local/raw/{impl}_d{depth}_p{paths}.json
```

### Sandbox5 sweep

```bash
ENV=sandbox5 TOKEN=eyJ... VUS=10 DURATION=20s ./backend/load/sweep.sh
# outputs go to backend/load/results/sandbox5/
```

### Narrow sweep

```bash
DEPTHS="2,3" PATHS_SWEEP="100" IMPLS="v2" ./backend/load/sweep.sh
```

### Single-cell comparison (v1 vs v2)

```bash
./backend/load/compare_v1_v2.sh
```

---

## 8. Caveats

- **Sandbox5 v1 cells have tiny sample sizes** (11–13 reqs in 20s window) because each request takes 30+ seconds. p95 is essentially "the slowest of the few we got". Production load with this latency would be far worse.
- **Buffer cache state varies** between runs. First requests after a deploy/restart pay disk I/O; warm cache is much faster. Sandbox5 v2 depth=1 numbers (600ms) likely benefit from a warm cache for that contact set.
- **Token expiry**: 15-min TTL. Long sweeps need a fresh token midway, or re-mint between cells.
- **Rails dev mode = 1 Puma worker.** Local sweep results are good for testing the code path but not for absolute latency.

---

## 9. Cross-references

- DeepFinder query latency (single-request synthetic): [`performance.md`](performance.md)
- Backfill scale & prod planning: [`backfill-scale.md`](backfill-scale.md)
- Interactive playground: [`playground.html`](playground.html)
- Source: `backend/load/sweep.sh`, `backend/load/connection_paths.k6.js`, `backend/load/compare_v1_v2.sh`
- v2 service: `backend/app/services/contacts/connection_paths/finder_v2.rb`
- Routing controller: `backend/app/controllers/api/v2/collections/contacts/connection_paths_controller.rb`

---

## 10. Optimization journey (phases)

Iterative perf work. After each phase we re-run the sweep on local + sandbox5 and capture the delta. The numbers at the top of this doc are **Phase 0 (baseline)** — uniform `MAX_EDGES_PER_HOP=25`. Each phase below replaces or adds to the baseline.

### Phase 1 — Lower per-hop edge cap (`MAX_EDGES_PER_HOP: 25 → 15`)

**What changed:** Recursive walk now follows at most 15 most-recent edges per hop instead of 25. Bounds worst-case walk rows at depth 3 from 15,625 → 3,375 (4.6× less work).

**Why not tiered:** We tried a per-depth `CASE WHEN b.depth = 0 THEN 25 ELSE 10 END` LIMIT, but Postgres rejects CASE expressions in `LIMIT` that reference outer-row columns from a lateral subquery. Uniform 15 is the simplest workable approximation.

**Sandbox5 sweep — Phase 0 vs Phase 1 (real data):**

| cell | Phase 0 p95 | Phase 1 p95 | Δ | RPS Δ |
|---|---:|---:|---:|---:|
| v1 d=2 mp=25 | 38,088 ms | 44,958 ms | worse (small sample variance) | 0.27 → 0.20 |
| v1 d=2 mp=100 | 33,541 ms | 35,192 ms | similar | 0.17 → 0.33 |
| v2 d=1 mp=25 | 702 ms | 841 ms | similar | 17.5 → 16.6 |
| v2 d=1 mp=100 | 609 ms | **524 ms** | **−14%** | 18.6 → 22.7 (+22%) |
| **v2 d=2 mp=25** | **5,037 ms** | **2,001 ms** | **−60% (2.5× faster)** | **4.0 → 6.8 (+70%)** |
| v2 d=2 mp=100 | 4,907 ms | 4,607 ms | −6% | 2.8 → 3.9 |
| v2 d=3 mp=25 | 19,492 ms | 17,257 ms | −11% | 0.58 → 0.79 |
| v2 d=3 mp=100 | 29,595 ms | 26,955 ms | −9% | 0.39 → 0.46 |
| v2 d=4 mp=25 | 22,987 ms | 21,867 ms | similar | 0.76 → 0.59 |
| v2 d=4 mp=100 | 50,915 ms | 45,593 ms | −10% | 0.08 → 0.29 |

**Verdict:** Phase 1 lands a clear win at the **production-critical sweet spot (depth=2, max_paths=25)** — 2.5× faster, 1.7× more throughput. Modest improvements at deeper depths (10–14%). v1 cells are unchanged in semantic and remain unusable on real data.

**Local sweep skipped from the analysis** — Rails dev mode (1 Puma worker, cold caches, background-process noise) makes cap-tuning numbers too volatile to read. See full local CSV at `backend/load/results/local/sweep-summary.csv` for the data, but trust sandbox5 for the verdict.

**Commit:** `e1ddc72845` — `deep_finder: lower MAX_EDGES_PER_HOP 25 → 15`.

### Phase 2 — Broaden in-network UCC partial index to all NETWORK_SOURCES

**What didn't work — and why:** Phase 2 was originally scoped as **bidirectional BFS** (walk forward from seeds, backward from target, meet in the middle). On paper the complexity drops from `15^N` to `2 × 15^(N/2)`. In practice it doesn't work for `seed_scope=:any` (the production default) because the seed side isn't bounded by 15 — it's bounded by the collection's curated-contact count.

Sandbox5 collection 728 has **459,300 distinct curated contacts** across `NETWORK_SOURCES`. The forward walk from seeds is therefore O(S × 15^f) where S ≈ 459k, not O(15^f). Walking from all of those is worse than the entire backward frontier even at depth 4. The "30× win at depth=3" headline assumed S ≈ 15, which is off by four orders of magnitude.

We also tried **materializing the curated contact set as a CTE and replacing the per-iteration `EXISTS` with `IN (curated)`**. `WITH MATERIALIZED` forced a 459k-row scan inside every recursive lateral iteration — depth=4 went from 2.4s to **141s** (~60× slower) on target 3083. The materialized hash isn't reusable across lateral iterations the way the planner can reuse a btree-index probe.

**What actually shipped:** the existing partial index `index_ucc_in_network_membership` was scoped to `source = 5` (shared) only. Commit `c02cafdd72` had broadened DeepFinder's in-network filter to `source IN (0, 1, 2, 3, 5, 10)` weeks earlier, but the partial predicate was never updated to match. With the predicates out of sync, the planner couldn't use the partial index and fell back to `index_user_contact_collections_on_contact_id` (wide, single column on `contact_id`) plus an in-row filter — one heap fetch per probe.

The fix is a one-line predicate broadening:

```ruby
# db/migrate/20260515121009_broaden_in_network_membership_index_to_network_sources.rb
remove_index :user_contact_collections, name: 'index_ucc_in_network_membership', algorithm: :concurrently
add_index :user_contact_collections, [:contact_id, :collection_id],
  name: 'index_ucc_in_network_membership',
  where: 'source IN (0, 1, 2, 3, 5, 10) AND user_id IS NOT NULL',
  algorithm: :concurrently
```

No SQL change. The DeepFinder query is unchanged; the planner picks the broadened partial automatically once `source IN (...)` is subsumed by the partial `WHERE`. Confirmed via `EXPLAIN ANALYZE` on six sandbox5 targets — every plan now reads `index_ucc_in_network_membership` instead of the wide index.

**Sandbox5 single-shot `EXPLAIN ANALYZE` at depth=4 (cold cache, identical statements):**

| target (edges) | Phase 1 (wide idx) | Phase 2 (partial idx) | speedup |
|---|---:|---:|---:|
| 3083 (15,307) | 2,399 ms | 1,121 ms | **2.1×** |
| 14080 (11,460) | 524 ms | 121 ms | **4.3×** |
| 23352 (13,319) | 1,632 ms | 894 ms | **1.8×** |
| 84573 (225) | 1,137 ms | 863 ms | **1.3×** |
| 372722 (41) | 2,280 ms | 893 ms | **2.6×** |
| 158317 (151) | 929 ms | 516 ms | **1.8×** |

**Sandbox5 sweep — Phase 1 vs Phase 2 (10 VUs · 20 s · 6 contacts):**

| cell | Phase 1 p95 | Phase 2 p95 | Δ | RPS Δ |
|---|---:|---:|---:|---:|
| v2 d=1 mp=100 | 524 ms | 400 ms | **−24%** | 22.7 → 26.9 |
| v2 d=2 mp=25 | 2,001 ms | 1,976 ms | similar (saturated at this depth) | 6.8 → 7.0 |
| **v2 d=2 mp=100** | **4,607 ms** | **2,015 ms** | **−56% (2.3× faster)** | 3.9 → 7.5 |
| **v2 d=3 mp=25** | **17,257 ms** | **9,431 ms** | **−45% (1.8× faster)** | 0.79 → 1.17 |
| v2 d=3 mp=100 | 26,955 ms | 18,483 ms | **−31%** | 0.46 → 0.76 |
| v2 d=4 mp=25 | 21,867 ms | 14,600 ms | **−33%** | 0.59 → 0.86 |
| **v2 d=4 mp=100** | **45,593 ms** | **23,528 ms** | **−48% (1.9× faster)** | 0.29 → 0.55 |
| v1 d=2 mp=25 | 44,958 ms | 29,456 ms | −34% | 0.20 → 0.43 |

**Verdict:** Broadening the partial-index predicate lands a 1.8–2.3× win across **every** depth=2+ cell, including the previously-intractable depth=3 (17s → 9s) and depth=4 (46s → 24s). v1 also improves (29s vs 45s) because v1's seed-scope-importer probe hits the same index — bonus side effect.

**Caveats:**
- Sample sizes at depth=4 are small (27 reqs at mp=100). Some 502s seen at the highest depth/path cell — likely tripping the 1500 ms `deep_finder.timeout` SET LOCAL or hitting per-app-server resource limits. Depth=4 is still not production-ready in absolute terms.
- The dominant cost is no longer index-seek time; it's the sheer number of EXISTS probes (~100k at depth=4) and final-sort over the result set. Further wins at depth 4 likely require a different attack — precomputed 2-hop adjacency, Redis cache, or product-side decision to cap at depth 3.

**Commit:** `0a89e1f734` — `ucc: broaden in_network_membership partial to NETWORK_SOURCES`.

### Phase 3 — Reverse the UCC partial-index column order

**Hypothesis:** Phase 2's partial is `(contact_id, collection_id)` — fine for "is this contact in any collection", but our actual access pattern is the opposite. DeepFinder's recursive lateral does ~100 k EXISTS probes per request, all with the **same fixed `collection_id`** and ~100 k different `contact_id`s. Putting `collection_id` *first* in the composite means every row for collection 728 lives in a contiguous btree slice. Once that slice is in the buffer pool, every probe within the request is a hot-cache hit.

**Shape:** a sibling partial index with reversed column order. Same `WHERE` clause. Old index kept in place — planner picks whichever wins on stats.

```ruby
# db/migrate/20260515160233_add_ucc_collection_contact_in_network_index.rb
add_index :user_contact_collections, [:collection_id, :contact_id],
  name: 'index_ucc_collection_contact_in_network',
  where: 'source IN (0, 1, 2, 3, 5, 10) AND user_id IS NOT NULL',
  algorithm: :concurrently
```

Original estimate was 1.5–2.5× cold-cache win. **Actual win was an order of magnitude bigger** — the warm-slice effect compounds across the ~3,500 lateral iterations at depth=4.

**Sandbox5 single-shot `EXPLAIN ANALYZE` at depth=4 (warm cache, 3 consecutive runs averaged):**

| target (edges) | Phase 1 (wide idx) | Phase 2 (partial idx) | **Phase 3 (reversed)** | Phase 3 vs Phase 2 |
|---|---:|---:|---:|---:|
| 3083 (15,307) | 2,399 ms | 1,121 ms | **48 ms** | **23×** |
| 14080 (11,460) | 524 ms | 121 ms | **51 ms** | 2.4× |
| 23352 (13,319) | 1,632 ms | 894 ms | **53 ms** | **17×** |
| 84573 (225) | 1,137 ms | 863 ms | **50 ms** | **17×** |
| 372722 (41) | 2,280 ms | 893 ms | **43 ms** | **21×** |
| 158317 (151) | 929 ms | 516 ms | **58 ms** | 9× |

Cold-cache on the very first run was 1–2 s (slice load). Every subsequent run is 40–60 ms — the collection's btree slice stays in the buffer cache.

**Sandbox5 sweep — Phase 2 vs Phase 3 (10 VUs · 20 s · 6 contacts):**

| cell | Phase 2 p95 | **Phase 3 p95** | speedup | reqs | RPS |
|---|---:|---:|---:|---:|---:|
| v2 d=1 mp=100 | 400 ms | 318 ms | 1.3× | 1,014 | 28.9 |
| **v2 d=2 mp=25** | 1,976 ms | **329 ms** | **6×** | 1,015 | 28.8 |
| **v2 d=2 mp=100** | 2,015 ms | **305 ms** | **6.6×** | 916 | 20.7 |
| **v2 d=3 mp=25** | 9,431 ms | **375 ms** | **25×** | 961 | 27.3 |
| **v2 d=3 mp=100** | 18,483 ms | **495 ms** | **37×** | 901 | 25.6 |
| **v2 d=4 mp=25** | 14,600 ms | **341 ms** | **43×** | 991 | 28.2 |
| **v2 d=4 mp=100** | 23,528 ms | **314 ms** | **75×** | 1,030 | 29.4 |
| **v1 d=2 mp=25** | 29,456 ms | **327 ms** | **90×** | 1,015 | 28.9 |
| **v1 d=2 mp=100** | 36,737 ms | **318 ms** | **116×** | 1,030 | 29.4 |

**Verdict:** Every cell — including the previously-intractable depth=4 mp=100 (23.5 s → 314 ms) and v1's worst case (37 s → 318 ms) — is now comfortably under the **1500 ms SLA target**. Throughput holds at ~29 RPS across all cells (the test driver's natural ceiling at 10 VUs); the 502s seen in Phase 2 are gone. v1 picks up the same 100× gain because it does the same EXISTS membership check.

> ⚠ **Reproducibility caveat (added retroactively):** The numbers in this table are real — they were measured from a single k6 sweep on sandbox5 right after Phase 3 deployed, and the resulting CSV exists at `backend/load/results/sandbox5/sweep-summary.csv` (since overwritten by later sweeps). But we have *not* been able to reproduce these numbers on subsequent sweeps against the same commit. Identical code, same DB, freshly restarted backend container → 580 ms p95 at d=1 to 6.4 s at d=4 mp=100 (post-Phase-4) instead of 305–495 ms across the board.
>
> Most likely cause: **the original sweep landed during a window when sandbox5's backend was autoscaled to multiple containers**, so the 10 k6 VUs were distributed across more Puma workers. Today's sandbox5 sits at `running/desired 1/1` — one container, one Puma worker, 5 threads — and 10 concurrent VUs saturate the queue. Secondary likely contributor: the PG buffer pool was already hot for collection 728's UCC slice from prior work that day.
>
> Neither factor is reproducible on demand from outside, and CloudWatch history at the granularity we'd need probably wasn't retained. The Phase 3 index gains are still real (DB query is 60 ms warm, confirmed by EXPLAIN ANALYZE multiple times since); what's not reproducible is the *end-to-end sweep behavior under that specific cluster state*. Production has multi-container autoscaling by default, so the upper-bound performance shown here is more representative of prod-shape behavior than the single-container re-benches are.

**Why the win was bigger than predicted:**

The pre-shipping estimate assumed btree-pages-loaded-into-buffer would be a marginal effect on top of an already-cached index. In reality, the `(contact_id, collection_id)` partial scattered collection 728's rows across the *entire* btree's leaf pages (one slice per `contact_id`, of which there are 459 k). The reversed `(collection_id, contact_id)` partial collapses all of collection 728 into one contiguous slice (~459 k rows in ~few-MB of leaf pages). The buffer-pool footprint per request collapses by orders of magnitude. **Locality won far harder than I expected.**

**Caveats:**
- This is **one large collection**. Sandbox5's collection 728 has 459 k rows in the partial; production has many collections of varying sizes. Smaller collections were already cheap; this index buys them nothing new. The Phase 3 win is specifically for big collections like 728.
- The old `index_ucc_in_network_membership` partial is **still in place**. `pg_stat_user_indexes` after one production cycle will tell us whether it's still earning its disk; if not, a follow-up migration can drop it.
- Result-set work (Ruby sort, hydrate, serialize) is now a meaningful fraction of total request time. Future optimization can focus there (perf-plan §3.7) if the SLA budget tightens.

**Commit:** `0fc6718886` — `ucc: reversed partial index (collection_id, contact_id) for in-network EXISTS`.

### Phase 4 — Two Ruby-side fixes (over-fetch bug + push sort into SQL)

Phase 3 looked complete. Then we re-benched on a different day and the same code returned 30-second p95 numbers instead of 300 ms. Two issues compounded:

1. Sandbox5 was running with **one backend container** (`desired/running 1/1`). The original Phase 3 sweep had landed during a window when sandbox5 was autoscaled to a larger fleet — that's why throughput sustained 29 RPS. With one container and 10 VUs from k6, the per-container queue depth explodes.
2. A real **code-level over-fetch bug** in `FinderV2` was multiplying response sizes 5–10× at `max_depth > 2`. Single-container saturation made the bug visible; multi-container autoscaling had been hiding it.

Both fixes shipped together as Phase 4.

#### 4a. `FinderV2` over-fetch bug at `max_depth > 2`

`FinderV2#call` multiplies DeepFinder's `limit` by `OVER_FETCH_MULTIPLIER = 10` to give the legacy depth=2 sort + grouping a wider candidate pool. That logic only runs for depth=2; for depth>2 the early-return passes `deep_result.paths` through unchanged. But the over-fetch was still happening, so the API was returning 5–10× more paths than the user asked for.

Concretely, for `max_depth=4 max_paths=100`:

| stage | with bug | after fix |
|---|---:|---:|
| FinderV2 → DeepFinder `limit` | 1,000 | 100 |
| DeepFinder SQL fetch cap | 500 × 25 = 12,500 | 100 × 25 = 2,500 |
| DeepFinder hydrated paths | 500 | 100 |
| API response paths | **500** | **100** |
| JSONAPI serializer work | 5× too much | correct |

The fix is one conditional in `FinderV2`:

```ruby
deep_limit = max_depth > 2 ? limit : limit * OVER_FETCH_MULTIPLIER
```

For depth>2 callers, request `limit` directly; for depth=2 keep the legacy over-fetch (still needed for the grouping pool).

**Commit:** `c7dd2b24a3` — `finder_v2: skip over-fetch at depth>2 (was 5-10x response bloat)`.

#### 4b. Push DeepFinder sort into SQL + drop `SQL_OVER_FETCH_MULTIPLIER` 25→5

Originally DeepFinder over-fetched 25× the public limit so Ruby could sort the richer pool by `[depth, recency, user-vantage]`. That meant a `max_paths=100` request parsed **2,500 PG-array rows in Ruby** even though only 100 ever made it to the response.

The recency tiebreaker (max of `date_to` across the path, NULL treated as today) and user-vantage tiebreaker (`ucc.user_id = ?`) are both expressible in SQL:

```sql
ORDER BY
  (b.depth + 1) ASC,
  (SELECT max(COALESCE(t, CURRENT_DATE)) FROM unnest(b.to_path) AS t) DESC NULLS LAST,
  CASE WHEN ucc.user_id = ? THEN 0 ELSE 1 END
LIMIT ?
```

With SQL doing the sort, the over-fetch drops to 5× (still needed for the `truncated`/`total` reporting), Ruby parses 5× fewer rows, no Ruby re-sort.

**Commit:** `561243c6a9` — `deep_finder: push sort into SQL ORDER BY, drop over-fetch 25x -> 5x`.

#### Phase 4 measured win (sandbox5, 10 VUs · 20 s · single backend container)

Same sandbox5 single-container state for all three columns; same load shape; same code path. The first column is the broken state that surfaced after the FinderV2 bug had time to compound under saturation:

| cell | Re-bench (both bugs) | After fix 4a only | **After both fixes** | total speedup |
|---|---:|---:|---:|---:|
| v2 d=1 mp=25 | 648 ms | 814 ms | **578 ms** | 1.1× |
| v2 d=1 mp=100 | 561 ms | 1,147 ms | **687 ms** | similar |
| v2 d=2 mp=25 | 1,514 ms | 3,301 ms | **2,339 ms** | similar |
| v2 d=2 mp=100 | 4,497 ms | 4,338 ms | **3,635 ms** | 1.2× |
| **v2 d=3 mp=25** | 9,572 ms | 2,416 ms | **2,086 ms** | **4.6×** |
| **v2 d=3 mp=100** | 24,116 ms | 6,724 ms | **7,772 ms** | **3.1×** |
| **v2 d=4 mp=25** | 20,990 ms | 3,635 ms | **2,263 ms** | **9.3×** |
| **v2 d=4 mp=100** | 24,925 ms | 7,924 ms | **6,403 ms** | **3.9×** |

RPS climbed proportionally:
- v2 d=4 mp=25: 0.71 → 6.44 RPS (9×)
- v2 d=4 mp=100: 0.22 → 2.04 RPS (9×)

**Why we still don't see the original 305 ms numbers:** the original Phase 3 sweep was measured against an autoscaled fleet of backend containers. Today's sandbox5 has one backend container and one Puma worker (5 threads). 10 k6 VUs against 5 threads = 2-deep queue at minimum; per-request latency floor of ~580 ms at d=1 is queue wait, not computation. Production has multi-container autoscaling and will not have this floor.

**Honest verdict:** the code-level perf is as good as we can get without infrastructure changes. The DB query is 60 ms (validated by EXPLAIN ANALYZE). The Ruby-side work is now proportional to what's actually returned. The remaining latency on sandbox5 is single-container saturation under benchmark load — which doesn't exist in production. Phase 3's index gains stand; Phase 4 fixes a real over-fetch bug that surfaced once Phase 3 made everything else fast enough to expose it.
