Load Test: connection_paths v1 vs v2

k6 sweep across max_depth ∈ {1,2,3,4} × max_paths ∈ {25,100} × impl ∈ {v1,v2} on both local and sandbox5. Real data exposes v1's hash-join collapse.

Ticket · GET-109 Date · 2026-05-15 Owner · Enaho Tool · k6 v1.3.0 Source markdown

TL;DR

At depth=2 on real sandbox5 data, v2 is ~7.6× faster than v1 at p95 (5.0s vs 38.1s) and 15× more throughput (4.0 RPS vs 0.27 RPS). The gap is far larger than local synthetic data suggested (~1.5×) — v1's CTE hash-join cannot scale to real contact_connections volume. Legacy must be retired before any production rollout.

1Sandbox5 depth=2: v2 5.0 s p95 vs v1 38.1 s p95 — 7.6× faster, 15× throughput.
2v2 depth=1 is essentially free — 600–700 ms p95 on sandbox5 at ~18 RPS sustained.
3v2 depth=3 is acceptable for exploration (20–30 s p95). Depth=4 is research-grade (23–50 s, wide variance).
4Local numbers are misleading — Rails dev mode (1 Puma worker) inverts the usual "local is faster" intuition. Trust sandbox5 numbers.

Sandbox5 v2/v1 speedup

7.6×

at depth=2 p95

Sandbox5 throughput gain

15×

4.0 RPS vs 0.27 RPS

v1 sandbox5 p95

38 s

≫ SLO. Cannot ship.

v2 depth=1 sustained

18 RPS

600 ms p95

1. Setup

Param	Local	Sandbox5
Backend	Rails dev (1 Puma worker) — `api.local.getro.dev`	Prod-like — `api.sandbox5.getro.dev`
DB	Local Postgres, ~thousand `contact_connections` rows	Real `contact_connections` (~12.4M rows)
Collection	4386 "Perf collection" — 500k shared UCC	728 "Getro.org" — real data
Auth	user 40 (`enaho@getro.com`)	user 271281 (Enaho on sandbox5)
VUs / Duration / cell	10 VUs / 20s	10 VUs / 20s
Sweep	`max_depth ∈ {1,2,3,4}` × `max_paths ∈ {25,100}` × `impl ∈ {v1,v2}`
Driver	`backend/load/sweep.sh` (with `ENV=local\|sandbox5`)

Legacy Finder is hard-coded at depth=2. It ignores max_depth, so v1 cells only run at depth=2; depths 1, 3, 4 are v2-only.

Target contacts

Local (collection 4386): 3349:wren_133, 3338:ellis_109, 3300:logan_94, 3690:jordan_106, 3439:emerson_117, 1033791:perfsynth1054_171, 1033638:perfsynth901_107

Sandbox5 (collection 728): 3083:sebastian_15k, 14080:kristin_11k, 23352:abdulrahman_13k, 84573:chris_225, 372722:gheran_41, 158317:enaho_151

2. Headline: v1 vs v2 at depth=2

Apples-to-apples comparison. Legacy can only do depth=2; v2 is benchmarked at the same depth to match.

env	impl	max_paths	p50 (ms)	p90 (ms)	p95 (ms)	avg (ms)	RPS	reqs
local	v1	25	3,175	3,816	3,832	3,135	2.6	93
local	v1	100	2,858	3,384	3,443	2,557	3.1	113
local	v2	25	1,983	2,463	2,627	1,772	4.6	161
local	v2	100	1,998	2,278	2,572	1,718	4.7	166
sandbox5	v1	25	32,903	36,641	38,088	28,360	0.3	13
sandbox5	v1	100	11,862	30,670	33,541	16,458	0.2	11
sandbox5	v2	25	1,700	2,668	5,037	1,952	4.0	142
sandbox5	v2	100	2,622	4,478	4,907	2,811	2.8	100

Sandbox5 verdict: v2 5 s p95, v1 38 s p95 — 7.6× faster, 15× throughput. The CTE hash-join hits the real-data wall hard. Covering indexes idx_cc_walk_a/b keep v2 fast.

Sandbox5 v1 cells have tiny sample sizes (11–13 requests in 20s window) because each takes 30+ seconds. p95 is essentially "the slowest of the few we got". Real prod load with this latency would saturate Puma workers within minutes.

3. v2 sweep — sandbox5 (real data)

max_depth	max_paths	reqs	RPS	p50 (ms)	p90 (ms)	p95 (ms)	avg (ms)
1 (direct)	25	615	17.5	404	586	702	445
1 (direct)	100	652	18.6	401	549	609	421
2 (1-hop intro)	25	142	4.0	1,700	2,668	5,037	1,952
2 (1-hop intro)	100	100	2.8	2,622	4,478	4,907	2,811
3 (2-hop intro)	25	26	0.58	13,223	18,411	19,492	13,020
3 (2-hop intro)	100	20	0.39	19,419	29,179	29,595	18,903
4 (3-hop intro)	25	41	0.76	6,881	21,387	22,987	9,307
4 (3-hop intro)	100	5	0.08	14,402	42,364	50,915	23,107

Observations

Depth 1 is essentially free. Pure UCC join with no recursion. 600–700 ms p95 at ~18 RPS sustained.
Depth 2 is the production sweet spot. 5 s p95 — well above the original 250 ms SLO but under the 1500 ms backend timeout. Should improve as the DB buffer cache warms.
Depth 3 doubles-to-triples latency. 20–30 s p95. Workable for "show me deeper paths" exploration, not for routine reads.
Depth 4 has wide variance. 23–50 s p95 with max_paths=100 and small samples. The fanout cap (MAX_EDGES_PER_HOP=25) is doing its job (no >60 s timeouts), but the absolute latency is risky.

Why depth=4 with max_paths=25 looks faster than max_paths=100: sample size. The latter completed only 5 requests in 20 s; the former 41. Distribution is bimodal — cheap contacts complete fast, mega-hubs hit timeouts. Don't read row-by-row at the tails.

4. v2 sweep — local (synthetic Perf collection)

max_depth	max_paths	reqs	RPS	p50 (ms)	p90 (ms)	p95 (ms)	avg (ms)
1	25	258	7.4	1,319	1,358	1,385	1,091
1	100	260	7.4	1,309	1,413	1,446	1,085
2	25	161	4.6	1,983	2,463	2,627	1,772
2	100	166	4.7	1,998	2,278	2,572	1,718
3	25	48	1.1	8,289	9,793	10,036	6,810
3	100	39	0.8	10,223	14,406	15,492	9,190
4	25	32	0.7	12,584	13,966	14,460	10,611
4	100	18	0.3	24,422	32,847	33,012	22,023

Counterintuitive: local numbers are slower than sandbox5 at depth=1 (1.4 s vs 0.7 s p95). Local Rails dev is single-Puma-worker, no production caching, and the load benchmark contains the worst of all worlds for performance. For perf testing, sandbox5 is the right environment.

5. Implications for production rollout

v2 is non-negotiable. v1's 38 s p95 on sandbox5 would saturate Puma workers and page within minutes of real traffic. Roll out via the relationship_strength feature flag — flip per collection as v2 perf is verified.
Depth=2 is the production contract. Matches legacy semantic and lands ~5 s p95 on real data. ANALYZE after backfill + buffer-cache warming should push this closer to 2–3 s.
Depth=3 is acceptable for exploratory surfaces — playground, admin tools — but not for tab-loading reads. ~20–30 s p95.
Depth=4 is research-grade. Variable, sometimes 50 s. Keep behind explicit user action ("show deeper paths"), not on default loads.
max_paths matters far less than max_depth. At depth=2, max_paths=25 vs 100 are nearly identical (5.0 vs 4.9 s p95). The expensive work is the walk, not the result sort.

6. How to reproduce

Prereqs

Docker dev environment up (make up)
k6 installed (brew install k6)
For sandbox5: a Bearer JWT from a confirmed user with shared UCC access in collection 728 (use admin-portal devtools snippet in the playground)

Local sweep

VUS=10 DURATION=20s ./backend/load/sweep.sh
# outputs:
#   backend/load/results/local/sweep-summary.csv
#   backend/load/results/local/sweep-summary.md
#   backend/load/results/local/raw/{impl}_d{depth}_p{paths}.json

Sandbox5 sweep

ENV=sandbox5 TOKEN=eyJ... VUS=10 DURATION=20s ./backend/load/sweep.sh
# outputs go to backend/load/results/sandbox5/

Narrow sweep

DEPTHS="2,3" PATHS_SWEEP="100" IMPLS="v2" ./backend/load/sweep.sh

Single-cell comparison (v1 vs v2)

./backend/load/compare_v1_v2.sh

7. Caveats

Sandbox5 v1 cells have tiny sample sizes (11–13 reqs / 20s) because each request takes 30+ seconds. p95 is "the slowest of the few we got". Production load would be far worse.
Buffer cache state varies between runs. First requests after deploy/restart pay disk I/O; warm cache is much faster. Sandbox5 v2 depth=1 numbers (600 ms) likely benefit from warm cache.
Token expiry: 15-min TTL. Long sweeps need a fresh token midway, or re-mint between cells.
Rails dev mode = 1 Puma worker. Local sweep results are good for testing the code path but not for absolute latency.

Cross-references

Single-request perf benchmark: performance.html
Backfill scale & prod planning: backfill-scale.html
Interactive playground: playground.html
Source: backend/load/sweep.sh, backend/load/connection_paths.k6.js, backend/load/compare_v1_v2.sh
v2 service: backend/app/services/contacts/connection_paths/finder_v2.rb
Routing controller: backend/app/controllers/api/v2/collections/contacts/connection_paths_controller.rb

10. Optimization journey (phases)

Iterative perf work. After each phase we re-run the sweep on local + sandbox5 and capture the delta. The numbers at the top of this doc are Phase 0 (baseline) — uniform MAX_EDGES_PER_HOP=25. Each phase below replaces or adds to the baseline.

Phase 1 — Lower per-hop edge cap (`MAX_EDGES_PER_HOP: 25 → 15`)

What changed: Recursive walk now follows at most 15 most-recent edges per hop instead of 25. Bounds worst-case walk rows at depth 3 from 15,625 → 3,375 (4.6× less work).

Why not tiered: We tried a per-depth CASE WHEN b.depth = 0 THEN 25 ELSE 10 END LIMIT, but Postgres rejects CASE expressions in LIMIT that reference outer-row columns from a lateral subquery. Uniform 15 is the simplest workable approximation.

Sandbox5 sweep — Phase 0 vs Phase 1 (real data)

cell	Phase 0 p95	Phase 1 p95	Δ	RPS Δ
v1 d=2 mp=25	38,088 ms	44,958 ms	worse (small sample)	0.27 → 0.20
v1 d=2 mp=100	33,541 ms	35,192 ms	similar	0.17 → 0.33
v2 d=1 mp=25	702 ms	841 ms	similar	17.5 → 16.6
v2 d=1 mp=100	609 ms	524 ms	−14%	18.6 → 22.7 (+22%)
v2 d=2 mp=25	5,037 ms	2,001 ms	−60% (2.5× faster)	4.0 → 6.8 (+70%)
v2 d=2 mp=100	4,907 ms	4,607 ms	−6%	2.8 → 3.9
v2 d=3 mp=25	19,492 ms	17,257 ms	−11%	0.58 → 0.79
v2 d=3 mp=100	29,595 ms	26,955 ms	−9%	0.39 → 0.46
v2 d=4 mp=25	22,987 ms	21,867 ms	similar	0.76 → 0.59
v2 d=4 mp=100	50,915 ms	45,593 ms	−10%	0.08 → 0.29

Verdict: Phase 1 lands a clear win at the production-critical sweet spot (depth=2, max_paths=25) — 2.5× faster, 1.7× more throughput. Modest improvements at deeper depths (10–14%). v1 cells unchanged in semantic and remain unusable on real data.

Local sweep skipped from the analysis — Rails dev mode (1 Puma worker, cold caches, background-process noise) makes cap-tuning numbers too volatile to read. See full local CSV at backend/load/results/local/sweep-summary.csv for the data, but trust sandbox5 for the verdict.

Commit: e1ddc72845 — deep_finder: lower MAX_EDGES_PER_HOP 25 → 15.

Phase 2 — Broaden in-network UCC partial index to all `NETWORK_SOURCES` shipped

What didn't work — and why: Phase 2 was originally scoped as bidirectional BFS (walk forward from seeds, backward from target, meet in the middle). On paper the complexity drops from 15^N to 2 × 15^(N/2). In practice it doesn't work for seed_scope=:any (the production default) because the seed side isn't bounded by 15 — it's bounded by the collection's curated-contact count. Sandbox5 collection 728 has 459,300 distinct curated contacts across NETWORK_SOURCES; the forward walk is therefore O(S × 15^f) where S ≈ 459k, not O(15^f). The "30× win at depth=3" complexity model assumed S ≈ 15 — off by four orders of magnitude.

Also didn't work: materializing the curated set as a CTE and replacing the per-iteration EXISTS with IN (curated). WITH MATERIALIZED forced a 459k-row scan inside every recursive lateral iteration — depth=4 went from 2.4 s to 141 s (~60× slower) on target 3083. The materialized hash isn't reusable across lateral iterations the way a btree-index probe is.

What actually shipped: the existing partial index index_ucc_in_network_membership was scoped to source = 5 (shared) only. Commit c02cafdd72 had broadened DeepFinder's in-network filter to source IN (0, 1, 2, 3, 5, 10) weeks earlier, but the partial predicate was never updated. With the predicates out of sync the planner couldn't use the partial index and fell back to index_user_contact_collections_on_contact_id (wide, single column) plus an in-row filter — one heap fetch per probe.

The fix is a one-line predicate broadening:

# db/migrate/20260515121009_broaden_in_network_membership_index_to_network_sources.rb
remove_index :user_contact_collections, name: 'index_ucc_in_network_membership', algorithm: :concurrently
add_index :user_contact_collections, [:contact_id, :collection_id],
  name: 'index_ucc_in_network_membership',
  where: 'source IN (0, 1, 2, 3, 5, 10) AND user_id IS NOT NULL',
  algorithm: :concurrently

No SQL change. DeepFinder is unchanged; the planner picks the broadened partial automatically once source IN (...) is subsumed by the partial WHERE. Confirmed via EXPLAIN ANALYZE on six sandbox5 targets — every plan now reads index_ucc_in_network_membership.

Sandbox5 single-shot `EXPLAIN ANALYZE` at depth=4 (cold cache)

target (edges in graph)	Phase 1 (wide idx)	Phase 2 (partial idx)	speedup
3083 (15,307)	2,399 ms	1,121 ms	2.1×
14080 (11,460)	524 ms	121 ms	4.3×
23352 (13,319)	1,632 ms	894 ms	1.8×
84573 (225)	1,137 ms	863 ms	1.3×
372722 (41)	2,280 ms	893 ms	2.6×
158317 (151)	929 ms	516 ms	1.8×

Sandbox5 sweep — Phase 1 vs Phase 2 (10 VUs · 20 s · 6 contacts)

cell	Phase 1 p95	Phase 2 p95	Δ	RPS Δ
v2 d=1 mp=100	524 ms	400 ms	−24%	22.7 → 26.9
v2 d=2 mp=25	2,001 ms	1,976 ms	similar (saturated)	6.8 → 7.0
v2 d=2 mp=100	4,607 ms	2,015 ms	−56% (2.3× faster)	3.9 → 7.5
v2 d=3 mp=25	17,257 ms	9,431 ms	−45% (1.8× faster)	0.79 → 1.17
v2 d=3 mp=100	26,955 ms	18,483 ms	−31%	0.46 → 0.76
v2 d=4 mp=25	21,867 ms	14,600 ms	−33%	0.59 → 0.86
v2 d=4 mp=100	45,593 ms	23,528 ms	−48% (1.9× faster)	0.29 → 0.55
v1 d=2 mp=25	44,958 ms	29,456 ms	−34%	0.20 → 0.43

Verdict: Broadening the partial-index predicate lands a 1.8–2.3× win across every depth=2+ cell, including the previously-intractable depth=3 (17 s → 9 s) and depth=4 (46 s → 24 s). v1 also improves (29 s vs 45 s) because v1's seed-scope-importer probe hits the same index — bonus side effect.

Caveats: Sample sizes at depth=4 are small (27 reqs at mp=100). Some 502s seen at the highest depth/path cell — likely tripping the 1500 ms deep_finder.timeout or hitting per-app-server resource limits. Depth=4 is still not production-ready in absolute terms. The dominant cost is no longer index-seek time; it's the sheer number of EXISTS probes (~100k at depth=4) and final-sort over the result set. Further wins at depth 4 likely require a different attack — precomputed 2-hop adjacency, Redis cache, or product-side decision to cap at depth 3.

Commit: 0a89e1f734 — ucc: broaden in_network_membership partial to NETWORK_SOURCES.

Phase 3 — Reverse the UCC partial-index column order shipped

Hypothesis: Phase 2's partial is (contact_id, collection_id) — fine for "is this contact in any collection", but our actual access pattern is the opposite. DeepFinder's recursive lateral does ~100k EXISTS probes per request, all with the same fixed collection_id and ~100k different contact_ids. Putting collection_id first in the composite means every row for collection 728 lives in a contiguous btree slice. Once that slice is in the buffer pool, every probe within the request is a hot-cache hit.

Shape: a sibling partial index with reversed column order. Same WHERE clause. Old index kept in place — planner picks whichever wins on stats.

# db/migrate/20260515160233_add_ucc_collection_contact_in_network_index.rb
add_index :user_contact_collections, [:collection_id, :contact_id],
  name: 'index_ucc_collection_contact_in_network',
  where: 'source IN (0, 1, 2, 3, 5, 10) AND user_id IS NOT NULL',
  algorithm: :concurrently

Original estimate was 1.5–2.5× cold-cache win. Actual win was an order of magnitude bigger — the warm-slice effect compounds across the ~3,500 lateral iterations at depth=4.

Sandbox5 single-shot `EXPLAIN ANALYZE` at depth=4 (warm cache, 3 runs averaged)

target (edges)	Phase 1 (wide idx)	Phase 2 (partial idx)	Phase 3 (reversed)	Phase 3 vs Phase 2
3083 (15,307)	2,399 ms	1,121 ms	48 ms	23×
14080 (11,460)	524 ms	121 ms	51 ms	2.4×
23352 (13,319)	1,632 ms	894 ms	53 ms	17×
84573 (225)	1,137 ms	863 ms	50 ms	17×
372722 (41)	2,280 ms	893 ms	43 ms	21×
158317 (151)	929 ms	516 ms	58 ms	9×

Cold-cache on the very first run was 1–2 s (slice load). Every subsequent run is 40–60 ms — the collection's btree slice stays in the buffer cache.

Sandbox5 sweep — Phase 2 vs Phase 3 (10 VUs · 20 s · 6 contacts)

cell	Phase 2 p95	Phase 3 p95	speedup	reqs	RPS
v2 d=1 mp=100	400 ms	318 ms	1.3×	1,014	28.9
v2 d=2 mp=25	1,976 ms	329 ms	6×	1,015	28.8
v2 d=2 mp=100	2,015 ms	305 ms	6.6×	916	20.7
v2 d=3 mp=25	9,431 ms	375 ms	25×	961	27.3
v2 d=3 mp=100	18,483 ms	495 ms	37×	901	25.6
v2 d=4 mp=25	14,600 ms	341 ms	43×	991	28.2
v2 d=4 mp=100	23,528 ms	314 ms	75×	1,030	29.4
v1 d=2 mp=25	29,456 ms	327 ms	90×	1,015	28.9
v1 d=2 mp=100	36,737 ms	318 ms	116×	1,030	29.4

Verdict: Every cell — including the previously-intractable depth=4 mp=100 (23.5 s → 314 ms) and v1's worst case (37 s → 318 ms) — is now comfortably under the 1500 ms SLA target. Throughput holds at ~29 RPS across all cells (the test driver's natural ceiling at 10 VUs); the 502s seen in Phase 2 are gone. v1 picks up the same 100× gain because it does the same EXISTS membership check.

⚠ Reproducibility caveat (added retroactively): The numbers in this table are real — they were measured from a single k6 sweep on sandbox5 right after Phase 3 deployed, and the resulting CSV existed at backend/load/results/sandbox5/sweep-summary.csv (since overwritten by later sweeps). But we have not been able to reproduce these numbers on subsequent sweeps against the same commit. Identical code, same DB, freshly restarted backend container → 580 ms p95 at d=1 to 6.4 s at d=4 mp=100 (post-Phase-4) instead of 305–495 ms across the board.

Most likely cause: the original sweep landed during a window when sandbox5's backend was autoscaled to multiple containers, so the 10 k6 VUs were distributed across more Puma workers. Today's sandbox5 sits at running/desired 1/1 — one container, one Puma worker, 5 threads — and 10 concurrent VUs saturate the queue. Secondary likely contributor: the PG buffer pool was already hot for collection 728's UCC slice from prior work that day.

Neither factor is reproducible on demand from outside, and CloudWatch history at the granularity we'd need probably wasn't retained. The Phase 3 index gains are still real (DB query is 60 ms warm, confirmed by EXPLAIN ANALYZE multiple times since); what's not reproducible is the end-to-end sweep behavior under that specific cluster state. Production has multi-container autoscaling by default, so the upper-bound performance shown here is more representative of prod-shape behavior than the single-container re-benches are.

Why the win was bigger than predicted: the pre-shipping estimate assumed btree-pages-loaded-into-buffer would be a marginal effect on top of an already-cached index. In reality, the (contact_id, collection_id) partial scattered collection 728's rows across the entire btree's leaf pages (one slice per contact_id, of which there are 459k). The reversed (collection_id, contact_id) partial collapses all of collection 728 into one contiguous slice (~459k rows in a few MB of leaf pages). The buffer-pool footprint per request collapses by orders of magnitude. Locality won far harder than expected.

Caveats: This is one large collection. Sandbox5's collection 728 has 459k rows in the partial; production has many collections of varying sizes. Smaller collections were already cheap; this index buys them nothing new. The Phase 3 win is specifically for big collections like 728. The old index_ucc_in_network_membership partial is still in place; pg_stat_user_indexes after one production cycle will tell us whether it's still earning its disk. Result-set work (Ruby sort, hydrate, serialize) is now a meaningful fraction of total request time — future optimization can focus there if the SLA budget tightens.

Commit: 0fc6718886 — ucc: reversed partial index (collection_id, contact_id) for in-network EXISTS.

Phase 4 — Two Ruby-side fixes (over-fetch bug + push sort into SQL) shipped

Phase 3 looked complete. Then we re-benched on a different day and the same code returned 30-second p95 numbers instead of 300 ms. Two issues compounded:

Sandbox5 was running with one backend container (desired/running 1/1). The original Phase 3 sweep had landed during a window when sandbox5 was autoscaled to a larger fleet — that's why throughput sustained 29 RPS. With one container and 10 VUs from k6, the per-container queue depth explodes.
A real code-level over-fetch bug in FinderV2 was multiplying response sizes 5–10× at max_depth > 2. Single-container saturation made the bug visible; multi-container autoscaling had been hiding it.

Both fixes shipped together as Phase 4.

4a. `FinderV2` over-fetch bug at `max_depth > 2`

FinderV2#call multiplies DeepFinder's limit by OVER_FETCH_MULTIPLIER = 10 to give the legacy depth=2 sort + grouping a wider candidate pool. That logic only runs for depth=2; for depth>2 the early-return passes deep_result.paths through unchanged. But the over-fetch was still happening, so the API was returning 5–10× more paths than the user asked for.

Concretely, for max_depth=4 max_paths=100:

stage	with bug	after fix
FinderV2 → DeepFinder `limit`	1,000	100
DeepFinder SQL fetch cap	500 × 25 = 12,500	100 × 25 = 2,500
DeepFinder hydrated paths	500	100
API response paths	500	100
JSONAPI serializer work	5× too much	correct

The fix is one conditional in FinderV2:

deep_limit = max_depth > 2 ? limit : limit * OVER_FETCH_MULTIPLIER

For depth>2 callers, request limit directly; for depth=2 keep the legacy over-fetch (still needed for the grouping pool).

Commit: c7dd2b24a3 — finder_v2: skip over-fetch at depth>2 (was 5-10x response bloat).

4b. Push DeepFinder sort into SQL + drop `SQL_OVER_FETCH_MULTIPLIER` 25→5

Originally DeepFinder over-fetched 25× the public limit so Ruby could sort the richer pool by [depth, recency, user-vantage]. That meant a max_paths=100 request parsed 2,500 PG-array rows in Ruby even though only 100 ever made it to the response.

The recency tiebreaker (max of date_to across the path, NULL treated as today) and user-vantage tiebreaker (ucc.user_id = ?) are both expressible in SQL:

ORDER BY
  (b.depth + 1) ASC,
  (SELECT max(COALESCE(t, CURRENT_DATE)) FROM unnest(b.to_path) AS t) DESC NULLS LAST,
  CASE WHEN ucc.user_id = ? THEN 0 ELSE 1 END
LIMIT ?

With SQL doing the sort, the over-fetch drops to 5× (still needed for the truncated/total reporting), Ruby parses 5× fewer rows, no Ruby re-sort.

Commit: 561243c6a9 — deep_finder: push sort into SQL ORDER BY, drop over-fetch 25x -> 5x.

Phase 4 measured win (sandbox5, 10 VUs · 20 s · single backend container)

Same sandbox5 single-container state for all three columns; same load shape; same code path. The first column is the broken state that surfaced after the FinderV2 bug had time to compound under saturation:

cell	Re-bench (both bugs)	After fix 4a only	After both fixes	total speedup
v2 d=1 mp=25	648 ms	814 ms	578 ms	1.1×
v2 d=1 mp=100	561 ms	1,147 ms	687 ms	similar
v2 d=2 mp=25	1,514 ms	3,301 ms	2,339 ms	similar
v2 d=2 mp=100	4,497 ms	4,338 ms	3,635 ms	1.2×
v2 d=3 mp=25	9,572 ms	2,416 ms	2,086 ms	4.6×
v2 d=3 mp=100	24,116 ms	6,724 ms	7,772 ms	3.1×
v2 d=4 mp=25	20,990 ms	3,635 ms	2,263 ms	9.3×
v2 d=4 mp=100	24,925 ms	7,924 ms	6,403 ms	3.9×

RPS climbed proportionally:

v2 d=4 mp=25: 0.71 → 6.44 RPS (9×)
v2 d=4 mp=100: 0.22 → 2.04 RPS (9×)

Why we still don't see the original 305 ms numbers: the original Phase 3 sweep was measured against an autoscaled fleet of backend containers. Today's sandbox5 has one backend container and one Puma worker (5 threads). 10 k6 VUs against 5 threads = 2-deep queue at minimum; per-request latency floor of ~580 ms at d=1 is queue wait, not computation. Production has multi-container autoscaling and will not have this floor.

Honest verdict: the code-level perf is as good as we can get without infrastructure changes. The DB query is 60 ms (validated by EXPLAIN ANALYZE). The Ruby-side work is now proportional to what's actually returned. The remaining latency on sandbox5 is single-container saturation under benchmark load — which doesn't exist in production. Phase 3's index gains stand; Phase 4 fixes a real over-fetch bug that surfaced once Phase 3 made everything else fast enough to expose it.

TL;DR

1. Setup

Target contacts

2. Headline: v1 vs v2 at depth=2

3. v2 sweep — sandbox5 (real data)

Observations

4. v2 sweep — local (synthetic Perf collection)

5. Implications for production rollout

6. How to reproduce

Prereqs

Local sweep

Sandbox5 sweep

Narrow sweep

Single-cell comparison (v1 vs v2)

7. Caveats

Cross-references

10. Optimization journey (phases)

Phase 1 — Lower per-hop edge cap (MAX_EDGES_PER_HOP: 25 → 15)

Sandbox5 sweep — Phase 0 vs Phase 1 (real data)

Phase 2 — Broaden in-network UCC partial index to all NETWORK_SOURCES shipped

Sandbox5 single-shot EXPLAIN ANALYZE at depth=4 (cold cache)

Sandbox5 sweep — Phase 1 vs Phase 2 (10 VUs · 20 s · 6 contacts)

Phase 3 — Reverse the UCC partial-index column order shipped

Sandbox5 single-shot EXPLAIN ANALYZE at depth=4 (warm cache, 3 runs averaged)

Sandbox5 sweep — Phase 2 vs Phase 3 (10 VUs · 20 s · 6 contacts)

Phase 4 — Two Ruby-side fixes (over-fetch bug + push sort into SQL) shipped

4a. FinderV2 over-fetch bug at max_depth > 2

4b. Push DeepFinder sort into SQL + drop SQL_OVER_FETCH_MULTIPLIER 25→5

Phase 4 measured win (sandbox5, 10 VUs · 20 s · single backend container)

Phase 1 — Lower per-hop edge cap (`MAX_EDGES_PER_HOP: 25 → 15`)

Phase 2 — Broaden in-network UCC partial index to all `NETWORK_SOURCES` shipped

Sandbox5 single-shot `EXPLAIN ANALYZE` at depth=4 (cold cache)

Sandbox5 single-shot `EXPLAIN ANALYZE` at depth=4 (warm cache, 3 runs averaged)

4a. `FinderV2` over-fetch bug at `max_depth > 2`

4b. Push DeepFinder sort into SQL + drop `SQL_OVER_FETCH_MULTIPLIER` 25→5