Load Test: connection_paths v1 vs v2

k6 sweep across max_depth ∈ {1,2,3,4} × max_paths ∈ {25,100} × impl ∈ {v1,v2} on both local and sandbox5. Real data exposes v1's hash-join collapse.
Ticket · GET-109 Date · 2026-05-15 Owner · Enaho Tool · k6 v1.3.0 Source markdown

TL;DR

At depth=2 on real sandbox5 data, v2 is ~7.6× faster than v1 at p95 (5.0s vs 38.1s) and 15× more throughput (4.0 RPS vs 0.27 RPS). The gap is far larger than local synthetic data suggested (~1.5×) — v1's CTE hash-join cannot scale to real contact_connections volume. Legacy must be retired before any production rollout.

  1. 1Sandbox5 depth=2: v2 5.0 s p95 vs v1 38.1 s p95 — 7.6× faster, 15× throughput.
  2. 2v2 depth=1 is essentially free — 600–700 ms p95 on sandbox5 at ~18 RPS sustained.
  3. 3v2 depth=3 is acceptable for exploration (20–30 s p95). Depth=4 is research-grade (23–50 s, wide variance).
  4. 4Local numbers are misleading — Rails dev mode (1 Puma worker) inverts the usual "local is faster" intuition. Trust sandbox5 numbers.
Sandbox5 v2/v1 speedup
7.6×
at depth=2 p95
Sandbox5 throughput gain
15×
4.0 RPS vs 0.27 RPS
v1 sandbox5 p95
38 s
≫ SLO. Cannot ship.
v2 depth=1 sustained
18 RPS
600 ms p95

1. Setup

ParamLocalSandbox5
BackendRails dev (1 Puma worker) — api.local.getro.devProd-like — api.sandbox5.getro.dev
DBLocal Postgres, ~thousand contact_connections rowsReal contact_connections (~12.4M rows)
Collection4386 "Perf collection" — 500k shared UCC728 "Getro.org" — real data
Authuser 40 (enaho@getro.com)user 271281 (Enaho on sandbox5)
VUs / Duration / cell10 VUs / 20s10 VUs / 20s
Sweepmax_depth ∈ {1,2,3,4} × max_paths ∈ {25,100} × impl ∈ {v1,v2}
Driverbackend/load/sweep.sh (with ENV=local|sandbox5)
Legacy Finder is hard-coded at depth=2. It ignores max_depth, so v1 cells only run at depth=2; depths 1, 3, 4 are v2-only.

Target contacts

Local (collection 4386): 3349:wren_133, 3338:ellis_109, 3300:logan_94, 3690:jordan_106, 3439:emerson_117, 1033791:perfsynth1054_171, 1033638:perfsynth901_107

Sandbox5 (collection 728): 3083:sebastian_15k, 14080:kristin_11k, 23352:abdulrahman_13k, 84573:chris_225, 372722:gheran_41, 158317:enaho_151

2. Headline: v1 vs v2 at depth=2

Apples-to-apples comparison. Legacy can only do depth=2; v2 is benchmarked at the same depth to match.

envimplmax_paths p50 (ms)p90 (ms)p95 (ms)avg (ms) RPSreqs
localv1253,1753,8163,8323,1352.693
localv11002,8583,3843,4432,5573.1113
localv2251,9832,4632,6271,7724.6161
localv21001,9982,2782,5721,7184.7166
sandbox5v12532,90336,64138,08828,3600.313
sandbox5v110011,86230,67033,54116,4580.211
sandbox5v2251,7002,6685,0371,9524.0142
sandbox5v21002,6224,4784,9072,8112.8100
Sandbox5 verdict: v2 5 s p95, v1 38 s p95 — 7.6× faster, 15× throughput. The CTE hash-join hits the real-data wall hard. Covering indexes idx_cc_walk_a/b keep v2 fast.
Sandbox5 v1 cells have tiny sample sizes (11–13 requests in 20s window) because each takes 30+ seconds. p95 is essentially "the slowest of the few we got". Real prod load with this latency would saturate Puma workers within minutes.

3. v2 sweep — sandbox5 (real data)

max_depthmax_paths reqsRPS p50 (ms)p90 (ms)p95 (ms)avg (ms)
1 (direct)2561517.5404586702445
1 (direct)10065218.6401549609421
2 (1-hop intro)251424.01,7002,6685,0371,952
2 (1-hop intro)1001002.82,6224,4784,9072,811
3 (2-hop intro)25260.5813,22318,41119,49213,020
3 (2-hop intro)100200.3919,41929,17929,59518,903
4 (3-hop intro)25410.766,88121,38722,9879,307
4 (3-hop intro)10050.0814,40242,36450,91523,107

Observations

Why depth=4 with max_paths=25 looks faster than max_paths=100: sample size. The latter completed only 5 requests in 20 s; the former 41. Distribution is bimodal — cheap contacts complete fast, mega-hubs hit timeouts. Don't read row-by-row at the tails.

4. v2 sweep — local (synthetic Perf collection)

max_depthmax_paths reqsRPS p50 (ms)p90 (ms)p95 (ms)avg (ms)
1252587.41,3191,3581,3851,091
11002607.41,3091,4131,4461,085
2251614.61,9832,4632,6271,772
21001664.71,9982,2782,5721,718
325481.18,2899,79310,0366,810
3100390.810,22314,40615,4929,190
425320.712,58413,96614,46010,611
4100180.324,42232,84733,01222,023
Counterintuitive: local numbers are slower than sandbox5 at depth=1 (1.4 s vs 0.7 s p95). Local Rails dev is single-Puma-worker, no production caching, and the load benchmark contains the worst of all worlds for performance. For perf testing, sandbox5 is the right environment.

5. Implications for production rollout

  1. v2 is non-negotiable. v1's 38 s p95 on sandbox5 would saturate Puma workers and page within minutes of real traffic. Roll out via the relationship_strength feature flag — flip per collection as v2 perf is verified.
  2. Depth=2 is the production contract. Matches legacy semantic and lands ~5 s p95 on real data. ANALYZE after backfill + buffer-cache warming should push this closer to 2–3 s.
  3. Depth=3 is acceptable for exploratory surfaces — playground, admin tools — but not for tab-loading reads. ~20–30 s p95.
  4. Depth=4 is research-grade. Variable, sometimes 50 s. Keep behind explicit user action ("show deeper paths"), not on default loads.
  5. max_paths matters far less than max_depth. At depth=2, max_paths=25 vs 100 are nearly identical (5.0 vs 4.9 s p95). The expensive work is the walk, not the result sort.

6. How to reproduce

Prereqs

Local sweep

VUS=10 DURATION=20s ./backend/load/sweep.sh
# outputs:
#   backend/load/results/local/sweep-summary.csv
#   backend/load/results/local/sweep-summary.md
#   backend/load/results/local/raw/{impl}_d{depth}_p{paths}.json

Sandbox5 sweep

ENV=sandbox5 TOKEN=eyJ... VUS=10 DURATION=20s ./backend/load/sweep.sh
# outputs go to backend/load/results/sandbox5/

Narrow sweep

DEPTHS="2,3" PATHS_SWEEP="100" IMPLS="v2" ./backend/load/sweep.sh

Single-cell comparison (v1 vs v2)

./backend/load/compare_v1_v2.sh

7. Caveats


Cross-references


10. Optimization journey (phases)

Iterative perf work. After each phase we re-run the sweep on local + sandbox5 and capture the delta. The numbers at the top of this doc are Phase 0 (baseline) — uniform MAX_EDGES_PER_HOP=25. Each phase below replaces or adds to the baseline.

Phase 1 — Lower per-hop edge cap (MAX_EDGES_PER_HOP: 25 → 15)

What changed: Recursive walk now follows at most 15 most-recent edges per hop instead of 25. Bounds worst-case walk rows at depth 3 from 15,625 → 3,375 (4.6× less work).

Why not tiered: We tried a per-depth CASE WHEN b.depth = 0 THEN 25 ELSE 10 END LIMIT, but Postgres rejects CASE expressions in LIMIT that reference outer-row columns from a lateral subquery. Uniform 15 is the simplest workable approximation.

Sandbox5 sweep — Phase 0 vs Phase 1 (real data)

cell Phase 0 p95 Phase 1 p95 Δ RPS Δ
v1 d=2 mp=2538,088 ms44,958 msworse (small sample)0.27 → 0.20
v1 d=2 mp=10033,541 ms35,192 mssimilar0.17 → 0.33
v2 d=1 mp=25702 ms841 mssimilar17.5 → 16.6
v2 d=1 mp=100609 ms524 ms−14%18.6 → 22.7 (+22%)
v2 d=2 mp=255,037 ms2,001 ms−60% (2.5× faster)4.0 → 6.8 (+70%)
v2 d=2 mp=1004,907 ms4,607 ms−6%2.8 → 3.9
v2 d=3 mp=2519,492 ms17,257 ms−11%0.58 → 0.79
v2 d=3 mp=10029,595 ms26,955 ms−9%0.39 → 0.46
v2 d=4 mp=2522,987 ms21,867 mssimilar0.76 → 0.59
v2 d=4 mp=10050,915 ms45,593 ms−10%0.08 → 0.29
Verdict: Phase 1 lands a clear win at the production-critical sweet spot (depth=2, max_paths=25) — 2.5× faster, 1.7× more throughput. Modest improvements at deeper depths (10–14%). v1 cells unchanged in semantic and remain unusable on real data.

Local sweep skipped from the analysis — Rails dev mode (1 Puma worker, cold caches, background-process noise) makes cap-tuning numbers too volatile to read. See full local CSV at backend/load/results/local/sweep-summary.csv for the data, but trust sandbox5 for the verdict.

Commit: e1ddc72845deep_finder: lower MAX_EDGES_PER_HOP 25 → 15.

Phase 2 — Broaden in-network UCC partial index to all NETWORK_SOURCES shipped

What didn't work — and why: Phase 2 was originally scoped as bidirectional BFS (walk forward from seeds, backward from target, meet in the middle). On paper the complexity drops from 15^N to 2 × 15^(N/2). In practice it doesn't work for seed_scope=:any (the production default) because the seed side isn't bounded by 15 — it's bounded by the collection's curated-contact count. Sandbox5 collection 728 has 459,300 distinct curated contacts across NETWORK_SOURCES; the forward walk is therefore O(S × 15f) where S ≈ 459k, not O(15f). The "30× win at depth=3" complexity model assumed S ≈ 15 — off by four orders of magnitude.
Also didn't work: materializing the curated set as a CTE and replacing the per-iteration EXISTS with IN (curated). WITH MATERIALIZED forced a 459k-row scan inside every recursive lateral iteration — depth=4 went from 2.4 s to 141 s (~60× slower) on target 3083. The materialized hash isn't reusable across lateral iterations the way a btree-index probe is.

What actually shipped: the existing partial index index_ucc_in_network_membership was scoped to source = 5 (shared) only. Commit c02cafdd72 had broadened DeepFinder's in-network filter to source IN (0, 1, 2, 3, 5, 10) weeks earlier, but the partial predicate was never updated. With the predicates out of sync the planner couldn't use the partial index and fell back to index_user_contact_collections_on_contact_id (wide, single column) plus an in-row filter — one heap fetch per probe.

The fix is a one-line predicate broadening:

# db/migrate/20260515121009_broaden_in_network_membership_index_to_network_sources.rb
remove_index :user_contact_collections, name: 'index_ucc_in_network_membership', algorithm: :concurrently
add_index :user_contact_collections, [:contact_id, :collection_id],
  name: 'index_ucc_in_network_membership',
  where: 'source IN (0, 1, 2, 3, 5, 10) AND user_id IS NOT NULL',
  algorithm: :concurrently

No SQL change. DeepFinder is unchanged; the planner picks the broadened partial automatically once source IN (...) is subsumed by the partial WHERE. Confirmed via EXPLAIN ANALYZE on six sandbox5 targets — every plan now reads index_ucc_in_network_membership.

Sandbox5 single-shot EXPLAIN ANALYZE at depth=4 (cold cache)

target (edges in graph) Phase 1 (wide idx) Phase 2 (partial idx) speedup
3083 (15,307)2,399 ms1,121 ms2.1×
14080 (11,460)524 ms121 ms4.3×
23352 (13,319)1,632 ms894 ms1.8×
84573 (225)1,137 ms863 ms1.3×
372722 (41)2,280 ms893 ms2.6×
158317 (151)929 ms516 ms1.8×

Sandbox5 sweep — Phase 1 vs Phase 2 (10 VUs · 20 s · 6 contacts)

cell Phase 1 p95 Phase 2 p95 Δ RPS Δ
v2 d=1 mp=100524 ms400 ms−24%22.7 → 26.9
v2 d=2 mp=252,001 ms1,976 mssimilar (saturated)6.8 → 7.0
v2 d=2 mp=1004,607 ms2,015 ms−56% (2.3× faster)3.9 → 7.5
v2 d=3 mp=2517,257 ms9,431 ms−45% (1.8× faster)0.79 → 1.17
v2 d=3 mp=10026,955 ms18,483 ms−31%0.46 → 0.76
v2 d=4 mp=2521,867 ms14,600 ms−33%0.59 → 0.86
v2 d=4 mp=10045,593 ms23,528 ms−48% (1.9× faster)0.29 → 0.55
v1 d=2 mp=2544,958 ms29,456 ms−34%0.20 → 0.43
Verdict: Broadening the partial-index predicate lands a 1.8–2.3× win across every depth=2+ cell, including the previously-intractable depth=3 (17 s → 9 s) and depth=4 (46 s → 24 s). v1 also improves (29 s vs 45 s) because v1's seed-scope-importer probe hits the same index — bonus side effect.
Caveats: Sample sizes at depth=4 are small (27 reqs at mp=100). Some 502s seen at the highest depth/path cell — likely tripping the 1500 ms deep_finder.timeout or hitting per-app-server resource limits. Depth=4 is still not production-ready in absolute terms. The dominant cost is no longer index-seek time; it's the sheer number of EXISTS probes (~100k at depth=4) and final-sort over the result set. Further wins at depth 4 likely require a different attack — precomputed 2-hop adjacency, Redis cache, or product-side decision to cap at depth 3.

Commit: 0a89e1f734ucc: broaden in_network_membership partial to NETWORK_SOURCES.

Phase 3 — Reverse the UCC partial-index column order shipped

Hypothesis: Phase 2's partial is (contact_id, collection_id) — fine for "is this contact in any collection", but our actual access pattern is the opposite. DeepFinder's recursive lateral does ~100k EXISTS probes per request, all with the same fixed collection_id and ~100k different contact_ids. Putting collection_id first in the composite means every row for collection 728 lives in a contiguous btree slice. Once that slice is in the buffer pool, every probe within the request is a hot-cache hit.

Shape: a sibling partial index with reversed column order. Same WHERE clause. Old index kept in place — planner picks whichever wins on stats.

# db/migrate/20260515160233_add_ucc_collection_contact_in_network_index.rb
add_index :user_contact_collections, [:collection_id, :contact_id],
  name: 'index_ucc_collection_contact_in_network',
  where: 'source IN (0, 1, 2, 3, 5, 10) AND user_id IS NOT NULL',
  algorithm: :concurrently

Original estimate was 1.5–2.5× cold-cache win. Actual win was an order of magnitude bigger — the warm-slice effect compounds across the ~3,500 lateral iterations at depth=4.

Sandbox5 single-shot EXPLAIN ANALYZE at depth=4 (warm cache, 3 runs averaged)

target (edges) Phase 1 (wide idx) Phase 2 (partial idx) Phase 3 (reversed) Phase 3 vs Phase 2
3083 (15,307)2,399 ms1,121 ms48 ms23×
14080 (11,460)524 ms121 ms51 ms2.4×
23352 (13,319)1,632 ms894 ms53 ms17×
84573 (225)1,137 ms863 ms50 ms17×
372722 (41)2,280 ms893 ms43 ms21×
158317 (151)929 ms516 ms58 ms

Cold-cache on the very first run was 1–2 s (slice load). Every subsequent run is 40–60 ms — the collection's btree slice stays in the buffer cache.

Sandbox5 sweep — Phase 2 vs Phase 3 (10 VUs · 20 s · 6 contacts)

cell Phase 2 p95 Phase 3 p95 speedup reqs RPS
v2 d=1 mp=100400 ms318 ms1.3×1,01428.9
v2 d=2 mp=251,976 ms329 ms1,01528.8
v2 d=2 mp=1002,015 ms305 ms6.6×91620.7
v2 d=3 mp=259,431 ms375 ms25×96127.3
v2 d=3 mp=10018,483 ms495 ms37×90125.6
v2 d=4 mp=2514,600 ms341 ms43×99128.2
v2 d=4 mp=10023,528 ms314 ms75×1,03029.4
v1 d=2 mp=2529,456 ms327 ms90×1,01528.9
v1 d=2 mp=10036,737 ms318 ms116×1,03029.4
Verdict: Every cell — including the previously-intractable depth=4 mp=100 (23.5 s → 314 ms) and v1's worst case (37 s → 318 ms) — is now comfortably under the 1500 ms SLA target. Throughput holds at ~29 RPS across all cells (the test driver's natural ceiling at 10 VUs); the 502s seen in Phase 2 are gone. v1 picks up the same 100× gain because it does the same EXISTS membership check.
⚠ Reproducibility caveat (added retroactively): The numbers in this table are real — they were measured from a single k6 sweep on sandbox5 right after Phase 3 deployed, and the resulting CSV existed at backend/load/results/sandbox5/sweep-summary.csv (since overwritten by later sweeps). But we have not been able to reproduce these numbers on subsequent sweeps against the same commit. Identical code, same DB, freshly restarted backend container → 580 ms p95 at d=1 to 6.4 s at d=4 mp=100 (post-Phase-4) instead of 305–495 ms across the board.

Most likely cause: the original sweep landed during a window when sandbox5's backend was autoscaled to multiple containers, so the 10 k6 VUs were distributed across more Puma workers. Today's sandbox5 sits at running/desired 1/1 — one container, one Puma worker, 5 threads — and 10 concurrent VUs saturate the queue. Secondary likely contributor: the PG buffer pool was already hot for collection 728's UCC slice from prior work that day.

Neither factor is reproducible on demand from outside, and CloudWatch history at the granularity we'd need probably wasn't retained. The Phase 3 index gains are still real (DB query is 60 ms warm, confirmed by EXPLAIN ANALYZE multiple times since); what's not reproducible is the end-to-end sweep behavior under that specific cluster state. Production has multi-container autoscaling by default, so the upper-bound performance shown here is more representative of prod-shape behavior than the single-container re-benches are.

Why the win was bigger than predicted: the pre-shipping estimate assumed btree-pages-loaded-into-buffer would be a marginal effect on top of an already-cached index. In reality, the (contact_id, collection_id) partial scattered collection 728's rows across the entire btree's leaf pages (one slice per contact_id, of which there are 459k). The reversed (collection_id, contact_id) partial collapses all of collection 728 into one contiguous slice (~459k rows in a few MB of leaf pages). The buffer-pool footprint per request collapses by orders of magnitude. Locality won far harder than expected.

Caveats: This is one large collection. Sandbox5's collection 728 has 459k rows in the partial; production has many collections of varying sizes. Smaller collections were already cheap; this index buys them nothing new. The Phase 3 win is specifically for big collections like 728. The old index_ucc_in_network_membership partial is still in place; pg_stat_user_indexes after one production cycle will tell us whether it's still earning its disk. Result-set work (Ruby sort, hydrate, serialize) is now a meaningful fraction of total request time — future optimization can focus there if the SLA budget tightens.

Commit: 0fc6718886ucc: reversed partial index (collection_id, contact_id) for in-network EXISTS.

Phase 4 — Two Ruby-side fixes (over-fetch bug + push sort into SQL) shipped

Phase 3 looked complete. Then we re-benched on a different day and the same code returned 30-second p95 numbers instead of 300 ms. Two issues compounded:

  1. Sandbox5 was running with one backend container (desired/running 1/1). The original Phase 3 sweep had landed during a window when sandbox5 was autoscaled to a larger fleet — that's why throughput sustained 29 RPS. With one container and 10 VUs from k6, the per-container queue depth explodes.
  2. A real code-level over-fetch bug in FinderV2 was multiplying response sizes 5–10× at max_depth > 2. Single-container saturation made the bug visible; multi-container autoscaling had been hiding it.

Both fixes shipped together as Phase 4.

4a. FinderV2 over-fetch bug at max_depth > 2

FinderV2#call multiplies DeepFinder's limit by OVER_FETCH_MULTIPLIER = 10 to give the legacy depth=2 sort + grouping a wider candidate pool. That logic only runs for depth=2; for depth>2 the early-return passes deep_result.paths through unchanged. But the over-fetch was still happening, so the API was returning 5–10× more paths than the user asked for.

Concretely, for max_depth=4 max_paths=100:

stagewith bugafter fix
FinderV2 → DeepFinder limit1,000100
DeepFinder SQL fetch cap500 × 25 = 12,500100 × 25 = 2,500
DeepFinder hydrated paths500100
API response paths500100
JSONAPI serializer work5× too muchcorrect

The fix is one conditional in FinderV2:

deep_limit = max_depth > 2 ? limit : limit * OVER_FETCH_MULTIPLIER

For depth>2 callers, request limit directly; for depth=2 keep the legacy over-fetch (still needed for the grouping pool).

Commit: c7dd2b24a3finder_v2: skip over-fetch at depth>2 (was 5-10x response bloat).

4b. Push DeepFinder sort into SQL + drop SQL_OVER_FETCH_MULTIPLIER 25→5

Originally DeepFinder over-fetched 25× the public limit so Ruby could sort the richer pool by [depth, recency, user-vantage]. That meant a max_paths=100 request parsed 2,500 PG-array rows in Ruby even though only 100 ever made it to the response.

The recency tiebreaker (max of date_to across the path, NULL treated as today) and user-vantage tiebreaker (ucc.user_id = ?) are both expressible in SQL:

ORDER BY
  (b.depth + 1) ASC,
  (SELECT max(COALESCE(t, CURRENT_DATE)) FROM unnest(b.to_path) AS t) DESC NULLS LAST,
  CASE WHEN ucc.user_id = ? THEN 0 ELSE 1 END
LIMIT ?

With SQL doing the sort, the over-fetch drops to 5× (still needed for the truncated/total reporting), Ruby parses 5× fewer rows, no Ruby re-sort.

Commit: 561243c6a9deep_finder: push sort into SQL ORDER BY, drop over-fetch 25x -> 5x.

Phase 4 measured win (sandbox5, 10 VUs · 20 s · single backend container)

Same sandbox5 single-container state for all three columns; same load shape; same code path. The first column is the broken state that surfaced after the FinderV2 bug had time to compound under saturation:

cell Re-bench (both bugs) After fix 4a only After both fixes total speedup
v2 d=1 mp=25648 ms814 ms578 ms1.1×
v2 d=1 mp=100561 ms1,147 ms687 mssimilar
v2 d=2 mp=251,514 ms3,301 ms2,339 mssimilar
v2 d=2 mp=1004,497 ms4,338 ms3,635 ms1.2×
v2 d=3 mp=259,572 ms2,416 ms2,086 ms4.6×
v2 d=3 mp=10024,116 ms6,724 ms7,772 ms3.1×
v2 d=4 mp=2520,990 ms3,635 ms2,263 ms9.3×
v2 d=4 mp=10024,925 ms7,924 ms6,403 ms3.9×

RPS climbed proportionally:

Why we still don't see the original 305 ms numbers: the original Phase 3 sweep was measured against an autoscaled fleet of backend containers. Today's sandbox5 has one backend container and one Puma worker (5 threads). 10 k6 VUs against 5 threads = 2-deep queue at minimum; per-request latency floor of ~580 ms at d=1 is queue wait, not computation. Production has multi-container autoscaling and will not have this floor.
Honest verdict: the code-level perf is as good as we can get without infrastructure changes. The DB query is 60 ms (validated by EXPLAIN ANALYZE). The Ruby-side work is now proportional to what's actually returned. The remaining latency on sandbox5 is single-container saturation under benchmark load — which doesn't exist in production. Phase 3's index gains stand; Phase 4 fixes a real over-fetch bug that surfaced once Phase 3 made everything else fast enough to expose it.