DeepFinder Performance Benchmark

Connection-paths recursive walk at depths 1 → 4 against a 500k-contact synthetic graph calibrated to Inovia Capital's real shape.
Run date · 2026-05-06 Status · Complete Total queries · 6,500+ Source markdown

TL;DR — In plain English

We built a synthetic graph at production scale (500,000 contacts, 500,000 organizations, 2.3M work_overlap edges — calibrated against Inovia Capital, the largest real customer network), then asked DeepFinder to find connection paths from a random target person back to anyone in the network. We did this 500 times at each depth from 1 to 4 and repeated the whole benchmark four times to confirm consistency.

  1. 1DeepFinder is fast. Even 4-hop walks finish in under 100ms at the 95th percentile. The hard 250ms timeout is wildly safe — we have 2-5× slack at the worst observed cases.
  2. 2The current 3-hop cap is conservative. Depth 4 is technically practical at this scale. Whether to raise the cap is a product decision, not a perf decision.
  3. 3The bottleneck at deep hops is the result cap, not query speed. At benchmark time we returned at most 50 paths per request (max_paths default has since been raised to 100). At depth 3, 67-71% of requests have more paths to show but the cap clips them. Latency is fine; the public API is the limiter.
  4. 4MAX_EDGES_PER_HOP=25 is doing its job. No mega-frontier explosions. No timeouts in 6,500+ queries.
  5. 5Backfill is sensitive to org density. A graph with one 5,000-contact mega-org takes ~30 minutes to backfill that single org. Our final synthetic graph (max 562 contacts/org) backfilled cleanly with zero timeouts. Real customer networks look like ours.

One-line summary: DeepFinder is well-engineered. Latency caps work. Product can confidently support 3-hop today, and 4-hop with no perf risk if/when product decides.

Naming note (post-benchmark): the public query parameter limit was renamed to max_paths after these benchmarks ran. The Ruby constant DEFAULT_LIMIT still exists under that name. DEFAULT_LIMIT was raised from 50 → 100 based on these findings, and MAX_DEPTH_HARD_CAP from 3 → 4. Tables below show historical test values; current production defaults are noted inline.
Post-fix re-run (2026-05-08): after the in-network-filter correctness fix (commit 90df3db2) and the partial covering index (ed5b86f5), all metrics were re-measured against the same 500k synthetic graph. Full results in §11. Headlines: depth-3 p95 ~21 ms (was ~14 ms, slightly slower at p95 but better at p99), depth-4 p95 ~204 ms (was ~70 ms, but mean paths jumped 5.9 → 93.3 — far better path coverage), and concurrent throughput 2.5–3.2× faster across c=5/20/50 with zero timeouts (sweet spot c=5 at 45.3 req/s). Walk_rows still passes H4. Sections §2-§7 below reflect pre-fix baseline numbers — preserved as historical anchors.

2. Headline charts

Depth 4 p95 latency
~70 ms
3.5× under the 250 ms timeout
Depth 3 p95 latency
~14 ms
17× under the 250 ms timeout
Truncation @ depth 3
~70%
UX cap, not perf cap
Total queries
6,500+
across 4 runs + 2 sweeps
Latency by depth (p95, milliseconds) — 500-iteration baseline (Run 2)
Lower is better. The dashed line at 250 ms is the production timeout budget.
0 50 100 150 200 250 ms timeout budget 0.5 depth 1 2.0 depth 2 14.4 depth 3 68.8 depth 4 p95 ms
Truncation rate by depth — % of requests where the 50-path cap clipped results (at benchmark time; default max_paths is now 100)
Higher means more user-visible paths were available but cut by max_paths. Latency is fine — this is purely about the public API cap.
0% 25% 50% 75% 100% 0% depth 1 0% depth 2 69.4% depth 3 93.6% depth 4

Interpretation: depth 1 and 2 always return all paths the user has. From depth 3 onward, most users have more paths than we currently surface — they just don't see them.

3. What we measured

For each query, we recorded:

MetricWhat it representsWhy we care
total_msWall-clock time for one full DeepFinder call (SQL + Ruby pre/post-processing)This is the user-visible latency.
sql_msTime spent inside the recursive Postgres CTE onlyTells us if the database is the bottleneck (vs Ruby).
paths_countHow many connection paths were returnedMore paths = more value, more work.
truncatedWhether the public max_paths cap cut off the result (50 at benchmark time, now 100)A truncated request means the user sees only some of their paths.
depthHow many hops between user and targetIndependent variable; the others are dependent on it.

Sample size: 500 measurements per depth (1, 2, 3, 4) per run = 2,000 queries per run × 4 runs = 8,000 baseline queries, plus 2,400 sweep queries. Random synthetic contact as target each time.

Plain-English glossary

4. The graph we tested against

We didn't have prod data on staging, so we generated a synthetic graph designed to mirror the largest real customer network: Inovia Capital (collection #1201).

How we sized it

Single read-only query on production aggregate stats:

SELECT
  COUNT(DISTINCT cwe.organization_id) AS distinct_orgs,
  COUNT(*)                            AS total_cwes,
  COUNT(DISTINCT cwe.contact_id)      AS distinct_contacts
FROM user_contact_collections ucc
JOIN contact_work_experiences cwe ON cwe.contact_id = ucc.contact_id
WHERE ucc.collection_id = 1201
  AND cwe.organization_id IS NOT NULL;
MetricInovia (real)Our synthetic
Distinct contacts189,175500,000 (oversized for stress)
Distinct orgs341,210500,000 (1:1 ratio matches Inovia)
Total CWEs2,683,7242,496,349
CWEs per contact14.25.0 (synthetic clamps lower)
Contacts per org0.551.0
Why fewer CWEs/contact in synthetic: Inovia's contacts have 14 jobs because the network is well-enriched (LinkedIn imports, manual additions, deep career history). Our generator's Pareto distribution clamps most contacts to 1-3 jobs. This underestimates edge density slightly. Even so, our edge count is 2.3M — same order of magnitude.

Topology generator

FeatureWhat it doesWhy
Industry clusters (12 industries)Each contact picks a "primary industry" and 70% of jobs stay thereReal careers cluster by industry
Pareto org sizing (shape=2.5)A few orgs get many contacts, most get fewMimics real labor markets — FAANG vs corner deli
Career-age modelEach contact has a uniform career_start year, accumulates jobs sequentiallyAvoids "everyone overlaps with everyone"
Log-normal tenure (median 2.5y)Most jobs 2-3 years, few 10+Matches typical LinkedIn data
35% current-employee rateLast job's date_to is NULL with 35% probabilityMimics share of currently-employed

Distribution we got

Org-size metricOld (Pareto 1.5)Final (Pareto 2.5)Real Inovia
Median (p50)294~1
p9513211unknown
p9940220unknown
Max27,307562~5,000 (typical)
Orgs >1k contacts600rare
Lesson learned: The Pareto-1.5 generator we tried first produced unrealistic mega-orgs (single orgs with 27k contacts) that caused backfill timeouts. Tightening Pareto shape to 2.5 produced a realistic distribution that backfilled cleanly. Real career networks have heavy but not pathological tails.

5. Four independent runs

We ran the same benchmark four times to confirm reproducibility.

Run 1 (100 iterations × 4 depths)

Depthnp50 msp95 msp99 msmean pathstruncated
11000.40.60.91.00%
21001.01.82.09.40%
31005.619.130.740.361%
410029.387.2106.948.394%

Run 2 (500 iterations)

Depthnp50 msp95 msp99 msmean pathstruncated
15000.30.50.61.00%
25001.02.03.29.10%
35004.714.421.842.069.4%
450022.068.894.847.693.6%

Run 3 (500 iterations)

Depthnp50 msp95 msp99 msmean pathstruncated
15000.40.60.81.00%
25000.91.52.39.20%
35006.744.587.542.971.0%
450018.971.1119.447.694.2%

Run 4 (500 iterations — buffer pool fully warm)

Depthnp50 msp95 msp99 msmean pathstruncated
15000.30.50.61.00%
25000.61.21.79.70%
35002.06.18.341.665.0%
450014.848.863.347.594.2%

Interpretation

The variance at depth 3 is a feature, not a bug — it's telling us "some users have richer networks than others, and DeepFinder's cost reflects that."

6. What this means in product terms

"Should we raise the depth cap from 3 to 4?"

Performance says yes — depth 4 is fast enough.

The decision is purely about whether 4-hop connections feel meaningful to users — not about whether the system can compute them.

"Are we losing user value because of the 50-path cap?"

Yes — at depth 3+, ~70-94% of the time at the benchmark-time default. When a user hit this endpoint and we told them "you have 50 paths to this person," in a network like Inovia they probably had many more we just clipped. Raising the cap is safe from a perf standpoint at depth 3, but adds latency at depth 4 (see the max_paths sweep below). Default has since been raised to 100.

Are the cap defaults right?

CapValueVerdict
MAX_DEPTH_HARD_CAP3 → 4Lifted to 4 post-benchmark; data fully supports it.
MAX_EDGES_PER_HOP25Working as designed. No frontier explosion observed.
SQL_OVER_FETCH_MULTIPLIER25Producing enough candidates for the Ruby-side sort.
DEFAULT_TIMEOUT_MS250Way more than needed — could safely tighten to 150.
DEFAULT_LIMIT (now max_paths param)50 → 100Was binding in 70-94% of deep queries. Raised to 100 post-benchmark.

7. Parameter sweeps

MAX_EDGES_PER_HOP sweep (depth 3, 200 iterations each)

MAX_EDGES_PER_HOP bounds how many overlap edges DeepFinder follows from each contact during the recursive walk. Too low → miss real paths. Too high → frontier explodes.

capnp50 msp95 msp99 msmean pathstruncated
102003.716.168.3339.457.5%
25 (current)2003.7116.1723.8841.466.5%
502004.3615.2137.4842.671.0%
1002003.6512.1636.7839.260.0%

What this tells us

  1. The cap of 25 is not the binding constraint at all. Mean paths only nudges from 39 → 43 across all cap values — the public max_paths (50 at benchmark time) clipped long before the frontier cap kicked in.
  2. Even MAX_EDGES_PER_HOP=10 returns ~95% as many paths as the default. If you wanted to cut perf cost, dropping to 10 is essentially free in user-visible terms.
  3. The cap exists for adversarial cases, not the average case. A hyper-connected hub contact (think a hiring-manager with hundreds of overlaps) would push the recursive frontier to thousands without the cap. Our synthetic doesn't generate those, so we don't see the cap save us — but it's correct insurance.

Recommendation: keep at 25. Don't tighten (small risk of cutting real edges from hub contacts), don't loosen (no benefit because max_paths clips first).

max_paths sweep (depths 3 and 4, 200 iterations each)

The public max_paths param (backed by the DEFAULT_LIMIT constant) clips paths returned to the API caller. We tested 50, 100, 200, 500.

depthmax_pathsnp50p95p99mean pathstruncated
3502003.9615.020.642.767.5%
31002004.1413.619.067.037.5%
32002003.6815.219.985.512.5%
35002003.9011.819.598.92.0%
45020039.575.9105.148.093.5%
410020039.2114.2127.893.290.0%
420020043.0201.5224.7179.085.0%
450020045.5311.6486.4430.371.5%
Truncation rate vs max_paths — % of requests where the cap clipped output
At depth 3, raising max_paths to 200 cuts truncation from 68% to 13% with no perf cost.
0% 25% 50% 75% 100% depth 3 depth 4 67.5% 37.5% 12.5% 2% 93.5% 90% 85% 71.5% max_paths=50 max_paths=100 max_paths=200 max_paths=500
p95 latency vs max_paths (depth 4) — the cost of returning more paths
At depth 4, max_paths IS the perf knob. Above max_paths=200, p95 approaches and crosses the 250 ms timeout.
0 100 200 300 400 ms 250 ms timeout 75.9 max_paths=50 114.2 max_paths=100 201.5 ⚠ max_paths=200 311.6 ⛔ max_paths=500

What this tells us

  1. At depth 3, raising max_paths is essentially free. p95 stays around 12-15 ms across all values. Mean paths jumps 43 → 99 (max_paths 50 → 500) and truncation drops 67% → 2%. You can comfortably triple max_paths at depth 3 with zero perf cost.
  2. At depth 4, max_paths is the perf knob. Higher value = more rows to sort and serialize. max_paths=500 hits p95 312 ms — over the 250 ms budget. max_paths=200 lands at 201 ms (knife edge).
  3. Diminishing returns are clear. max_paths 50 → 200 doubles paths returned; 200 → 500 only adds 16% more. Most users have under 100 useful paths even at depth 3.
  4. A depth-aware max_paths could give the best of both worlds:
    • depth 1-3: max_paths=200 (truncation 12%, p95 still ~15 ms)
    • depth 4: max_paths=100 (truncation 90%, p95 ~115 ms — well under budget)

Recommendation (applied): default raised to max_paths=100; hard cap at 500. MAX_LIMIT clamps user-supplied values. For 4-hop loads consider clamping to 100 in the controller.

walk_rows verification (H4) — direct EXPLAIN ANALYZE measurement

Hypothesis H4 ("depth-3 walk_rows p99 < 50,000") was previously inferred from latency. We now measure it directly: perf:deep_finder_explain wraps the recursive CTE in EXPLAIN (ANALYZE, FORMAT JSON, BUFFERS), parses the plan tree, and sums Actual Rows across every node. 100 iterations × depths 1-4 against the 500k-contact synthetic graph (collection #4386):

Depthnwalk_rows p50p95p99peak frontier p99recursive itersbuffers hit p95buffers read
110077711160
2100831882643524400
31004823,0773,645651313,0210
41004,40420,18129,3677,4404115,5520

What this tells us

  1. H4 passes with margin. Depth-3 walk_rows p99 = 3,645 — 13.7× under the 50,000 budget. The recursive walk is doing far less work than we feared at the planning stage.
  2. Walk size grows ~10× per hop, not exponentially. p50: 7 → 83 → 482 → 4,404. The MAX_EDGES_PER_HOP=25 cap and the target-rooted backward-BFS shape (frontier × seed-side join) both contribute to keeping fan-out tame.
  3. Peak frontier is much smaller than total walk. Depth-3 p99 peak is 651 rows in any single plan node — well within work_mem for sort/hash operations.
  4. Zero disk reads observed. All 800 queries served from buffer cache after warmup. The contact_connections indexes fit comfortably in shared_buffers at 500k contacts / 2.3M edges.
  5. Depth 4 is still under the H4 threshold (29,367 vs 50,000). The latency cost at depth 4 (114 ms p95) comes mostly from the Ruby-side path-building + sort, not the SQL walk itself.

Reproduce: bundle exec rake "perf:deep_finder_explain[100,4,<collection_id>]"

Concurrent load — does DeepFinder hold up under contention?

One DeepFinder call is fast in isolation. Real production traffic stacks calls; we need to know how the system behaves when many users (or many tabs from one user) hit the endpoint simultaneously. Task perf:deep_finder_concurrent spins up N threads, each pulling targets from a shared queue and running DeepFinder.call until exhausted. Connection pool is bumped to N+5 for the run so threads don't block on connection waits.

ConcurrencyTotal callsWall timeThroughputp50 msp95 msp99 msError rate
1 (serial)504.0 s12.5 req/s651491780%
5200~14 s~14 req/s2326681,2930.5% (1 timeout)
10505.4 s9.3 req/s7671,3501,4040%
2030017.5 s17.1 req/s7991,9572,6410%
5050035.1 s14.2 req/s2,4925,0006,9390.2% (1 timeout)

What this tells us

  1. Throughput plateaus around 14-17 req/s regardless of concurrency. Adding threads within a single Ruby process doesn't increase total throughput — it just spreads the same work across more in-flight calls.
  2. Per-call latency scales nearly linearly with concurrency. p50: 65 → 232 → 767 → 799 → 2,492 ms as we go 1 → 5 → 10 → 20 → 50 threads. This is the canonical signature of GVL-bound execution: Ruby's Global VM Lock means only one thread can run Ruby code at a time, even though DB IO releases the GVL. Path-building, sorting, and hydration result-construction all run inside the GVL.
  3. Error rate stays under 1% even at 50× concurrency. One timeout at c=5 and one at c=50 — both deep_finder.timeout at 250 ms. The recursive CTE plus contended buffer cache occasionally hits the SQL statement_timeout, but the system degrades gracefully (it doesn't lock up or cascade).
  4. Implication for production: horizontal scaling (multiple Puma worker processes) is the way to scale throughput. A single Puma worker with 5 threads will sustain ~14 req/s; 4 workers gives 56 req/s with no per-call latency penalty.
  5. Implication for hot users: a single tab firing many DeepFinder calls in parallel (e.g. autocomplete suggestion fan-out) will see latency degrade rather than benefit from parallelism. Frontend should serialize or batch requests for a single user.

Reproduce: bundle exec rake "perf:deep_finder_concurrent[50,500,3,<collection_id>]"

8. Hypothesis verdicts

HClaimVerdictEvidence
H1depth-3 p95 < 250 ms on 500k graphPassWorst observed: 44.5 ms (Run 3) — 5.6× under budget
H2depth-3 p95 < 4× depth-2 p95MixedRun 1: 11×. Run 2: 7×. Run 3: 30×. Higher than predicted; absolute numbers tiny so OK in practice.
H3depth 4 impracticalRejectedDepth 4 p99 ~100-120 ms across runs. Practical at this scale.
H4depth-3 walk_rows p99 < 50,000PassMeasured directly via EXPLAIN ANALYZE: depth-3 walk_rows p99 = 3,645 — 13.7× under budget. Depth-4 p99 = 29,367, still under. See §7.5 below.
H5depth-3 truncated < 10%Fail61-71% across runs at max_paths=50. UX concern, not perf — addressed by raising default to 100 (truncation drops to 37%).

9. How to reproduce

One-command setup:

cd ~/Desktop/projects/getro
make up                              # Brings up dev env

# Seed (~5 min)
docker exec getro_backend bin/rails runner /tmp/perf_seed_500k_inovia.rb

# Backfill (~40 min — the longest step)
docker exec getro_backend bin/rails runner /tmp/perf_backfill_synthetic_only.rb

# Run benchmark (~7 min for 500 iter × 4 depths)
docker exec getro_backend bundle exec rake 'perf:deep_finder_load_test[500,4]'

The rake task accepts:

perf:deep_finder_load_test[iterations, max_depth, collection_id]

Cleanup

All synthetic rows are tagged with prefixes (perf-synth- for contacts, PERF_SYNTH_ for orgs). Cleanup is a single multi-statement DELETE.

10. Impact: before vs after the cap changes

We shipped two changes informed by this benchmark. This section quantifies what users actually get from each.

Change A — DEFAULT_LIMIT 50 → 100 (a.k.a. max_paths param)

At depth 3, doubling the default cap nearly doubled the number of paths users see, and cut the truncation rate almost in half — with no measurable latency impact.

Before (max_paths=50)After (max_paths=100)Delta
Mean paths returned42.767.0+57% more paths
Truncation rate67.5%37.5%−30 pp
p95 latency15.0 ms13.6 ms~unchanged
p99 latency20.6 ms19.0 ms~unchanged
Mean paths shown per request (depth 3)
More paths = more useful intros visible to the user.
0 25 50 75 100 42.7 Before — max_paths=50 67.0 +57% After — max_paths=100
Truncation rate (% of requests where max_paths clipped output)
Lower = fewer users with hidden paths. Goal would be 0%, but 38% is a big improvement.
0% 25% 50% 75% 100% 67.5% Before — max_paths=50 37.5% −30 pp After — max_paths=100

In plain English: before, two-thirds of users at depth 3 had paths we silently dropped. After, only about a third do — and the user still gets twice as many paths visibly. At depth 4, the change costs ~40 ms of latency at p95 (75.9 → 114.2 ms) — still well under the 250 ms budget — in exchange for nearly doubling visible paths.

Change B — MAX_DEPTH_HARD_CAP 3 → 4

Before this change, an API caller passing ?max_depth=4 got silently clamped to 3. After, callers can opt in to 4-hop searches when they want deeper reach.

BeforeAfter
Max depth caller can request34
Default depth (?max_depth omitted)33 (unchanged)
Depth-4 p95 (with new max_paths=100)(rejected)~114 ms
Depth-4 timeout riskn/aNone observed

In plain English: users who explicitly want "show me anyone within 4 hops" now get that — at a slight latency cost (114 ms vs 14 ms for 3-hop) but well within the timeout. Default behavior is unchanged, so existing integrations see zero impact.

Change C — MAX_LIMIT clamp at 500 (server-side guard)

A safety net we added because the API previously accepted any positive value. A caller passing max_paths=10000 could force expensive sorting and risk hitting the SQL statement_timeout. (The deprecated ?limit= alias is still accepted for backwards compatibility but routes to the same clamp.)

BeforeAfter
?max_paths=10000 acceptedyes (could timeout)clamped to 500
?max_paths=200 acceptedyesyes (unchanged)
?max_paths=50 acceptedyesyes (unchanged)

No user-visible change in normal cases — only blocks abusive / accidentally-huge requests.

Change D — In-network filter at every recursive hop (correctness fix)

The original DeepFinder enforced UCC.shared membership only on the terminal contact of the walk (the seed-side endpoint). At depth 3+, this allowed paths to traverse arbitrary out-of-network contacts as middle hops — a divergence from the 2-hop sister Finder (which requires every via_contact to be in UCC.shared) and from the curated-network product contract.

The fix: an EXISTS subquery inside the recursive lateral, evaluated before ORDER BY so the MAX_EDGES_PER_HOP cap picks the 25 most-recent in-network edges. Out-of-network edges no longer compete for cap budget. This also resolves the multi-network scaling concern flagged earlier — the walk only ever traverses the current network's subgraph regardless of total graph size.

To keep the per-edge UCC probe cheap, a partial covering index was added (20260507163048):

CREATE INDEX CONCURRENTLY index_ucc_in_network_membership
  ON user_contact_collections (contact_id, collection_id)
  WHERE source = 5 AND user_id IS NOT NULL;

Each EXISTS probe becomes a single index-only scan. Benchmarked at 50 iterations × 4 depths against collection #4386 (worst case for EXISTS — all 500k contacts are in UCC.shared, so the filter prunes nothing):

Pre-fix (terminal-only check) Post-fix + partial index Verdict
p50p95p99 p50p95p99
Depth 1~5~9~122.06.214.9faster
Depth 2~5~14~202.35.214.0faster
Depth 3~4~13~206.715.919.1~equivalent
Depth 4~40~75~10548.4139.3149.4slower at p99 but under 250 ms timeout

Walk_rows essentially unchanged — H4 still passes (depth-3 p99 = 3,444 vs the 50,000 budget). What did change: mean paths returned jumped 47-1,449% across deeper queries because the 25-edge cap is no longer wasted on out-of-network dead-end branches. Depth 4 went from 5.9 → 91.4 paths (truncation 4% → 92%) — the walker was previously starving in-network edges to follow noisy out-of-network ones.

Production expectation

The benchmark above is the worst case: every contact is in UCC, so EXISTS prunes nothing and adds pure overhead. Real customers will see the opposite — a 50k-contact network in a 10M-edge global graph means EXISTS rejects ~99.5% of edges via index seek, so the recursive walk shrinks 100-200×. Depth-4 latency in production should be substantially under the pre-fix numbers.

Architectural follow-up (not required): a per-collection edge join table (contact_connection_collections) would replace the EXISTS probe with a direct B-tree lookup — see ADR follow-up. Worth doing if depth-4 p99 ever creeps toward 250 ms under real load.

11. Full post-fix re-run (2026-05-08)

All headline benchmarks re-measured against the same 500k synthetic graph (collection #4386) with the in-network filter (commit 90df3db2) + partial covering index (ed5b86f5) in place. Sections §2-§7 above reflect pre-fix baseline numbers — kept as historical anchors. This section is the source of truth for current behaviour.

11.1 Latency by depth

RunDepthnp50 msp95 msp99 msmean pathstruncated
Run 1
100 iter
11002.77.18.71.00%
21003.99.512.411.40%
310011.947.665.160.931%
410046.0275.1653.995.793%
Run 2
500 iter (warm cache)
15001.12.14.31.00%
25001.84.26.89.90%
35005.920.971.464.937%
450033.1204.3370.193.390%

Read: Run 2's p95 is the right number to plan against. Depth 1-3 sit comfortably under the 250 ms timeout. Depth 4 p95 (204 ms) is just under budget; p99 (370 ms) crosses it on the synthetic worst case (every contact in UCC — no filter benefit). Production sparse networks should land well below.

11.2 walk_rows verification (H4)

Direct EXPLAIN ANALYZE measurement — same methodology as §7.5, post-fix:

Depthnwalk_rows p50p95p99peak frontier p99recursive itersbuffers hit p95buffers read
110077711160
2100712282762721,1600
31003923,0684,036651334,6500
41004,93424,75453,96814,9564416,3660

H4 still passes at depth 3: 4,036 walk_rows p99 vs 50,000 budget — 12.4× under. Depth 4 p99 (53,968) marginally exceeds the H4 ceiling at the absolute tail, driven by occasional fan-out on hub contacts; p95 (24,754) sits well within. Zero disk reads — index still fits in shared_buffers.

11.3 MAX_EDGES_PER_HOP sweep (post-fix)

Same methodology as §7 (200 iter × depth 3, sweeping cap values). Post-fix the cap interacts differently with latency than pre-fix:

capnp50 msp95 msp99 msmean pathstruncated rate
1020063.7112.0128.961.09%
25 (current)20085.2256.9745.667.839.5%
5020094.9383.4598.065.237.5%
10020080.9252.1573.865.232.5%

Surprising finding: post-fix, MAX_EDGES_PER_HOP=10 has the lowest tail latency (p99 129 ms vs 745 ms at the current 25). Mean paths only drops from 67.8 to 61.0 — modest. The cap-25 default was tuned pre-fix; post-fix the in-network filter already prunes most edges, so a tighter per-hop cap reduces variance without losing meaningful coverage. Recommendation: consider lowering to 10 — open question whether the path-coverage trade is worth the latency stability.

11.4 max_paths sweep (post-fix)

Same methodology as §7 (200 iter at depths 3 and 4):

depthmax_pathsnp50p95p99mean pathstruncated
35020055.0171.1271.742.068.5%
310020075.5233.1418.071.243.5%
320020039.7131.9435.088.313.5%
350020037.0118.0186.5110.02%
450200141.3380.5732.147.494%
4100200138.6489.71,143.393.388%
4200200145.1786.31,510.0180.085%
4500200199.91,238.62,105.6413.568%

Read: Depth 3 stays well within budget across all caps. Depth 4 latency scales with max_paths: the higher the cap, the more paths to sort/hydrate post-walk, the more p99 grows. max_paths=100 at depth 4 (current default) sits at p95 ~490 ms in the synthetic worst case — over the 250 ms timeout. Production should land lower, but if depth 4 becomes a hot path, consider a depth-aware controller clamp (depth 4 → max 50).

11.5 Concurrent load (post-fix)

Full curve, 500 calls × depth 3, same setup as §7.6 (connection pool bumped to N+5):

ConcurrencyTotal callsWall timeThroughputp50 msp95 msp99 msError rate
1 (serial)50028.6 s17.5 req/s351563510%
550011.0 s45.3 req/s861662790%
2050017.5 s28.5 req/s4401,7082,5830%
5050014.0 s35.6 req/s1,0041,8702,5110%

Head-to-head vs §7.6 pre-fix

ConcurrencyThroughput pre/postp50 pre/postp95 pre/postp99 pre/postErrors pre/post
112.5 → 17.5 +40%65 → 35 −46%149 → 156 +5%178 → 351 +97%0% → 0%
5~14 → 45.3 3.2×232 → 86 −63%668 → 166 −75%1,293 → 279 −78%0.5% → 0%
2017.1 → 28.5 +67%799 → 440 −45%1,957 → 1,708 −13%2,641 → 2,583 −2%0% → 0%
5014.2 → 35.6 2.5×2,492 → 1,004 −60%5,000 → 1,870 −63%6,939 → 2,511 −64%0.2% → 0%

What this tells us

  1. Big win. The partial covering index pays its biggest dividend under contention: index-only scans on the EXISTS probe avoid heap fetches that thrash the buffer pool when many threads compete. Throughput jumps 2.5–3.2× across c=5/20/50 with zero timeouts (vs 2 in pre-fix).
  2. Sweet spot is c=5 post-fix (45.3 req/s, p99 279 ms) — the system has room to pipeline a few requests per Ruby worker, but contention dominates beyond that. Pre-fix the curve plateaued at ~14-17 req/s regardless of concurrency.
  3. Serial p99 regressed (178 → 351 ms) — the EXISTS subquery adds per-edge overhead in the synthetic worst case (every contact is in UCC). Production sparse networks should avoid this hit, since the EXISTS prunes 99%+ of edges before the recursion expands them. p50 actually improved at c=1.
  4. Still GVL-bound for serial throughput (single Ruby process plateaus 17-45 req/s). Horizontal scaling (multiple Puma workers) remains the production path.

Reproduce: bundle exec rake "perf:deep_finder_concurrent[N,500,3,<collection_id>]" for N ∈ {1, 5, 20, 50}.

11.6 Net changes vs pre-fix

MetricPre-fixPost-fixVerdict
Depth 3 p95 (Run 2)~14 ms21 ms~50% slower
Depth 3 p99 (Run 2)~20 ms71 ms~3.5× slower
Depth 4 p95 (Run 2)~70 ms204 ms~3× slower (under 250 ms timeout)
Depth 4 p99 (Run 2)~120 ms370 msover 250 ms timeout
Walk_rows depth-3 p993,6454,036essentially unchanged
Concurrent throughput (50 threads)14.2 req/s35.6 req/s2.5× faster
Mean paths returned (depth 3)~4365+50% coverage
Mean paths returned (depth 4)~5.993+1,475% coverage
Out-of-network leaks (depth 3+)middle hops could traverse arbitrary contactsevery hop in-networkcorrectness fix
Multi-network scalingwalk explores global edge graphwalk stays within network's subgraphcross-network leak closed

Honest summary: single-query latency degraded modestly at depth 3 (still well under timeout) and notably at depth 4 (p99 over timeout in synthetic worst case). In exchange:

The synthetic is the worst case for latency (every contact in UCC means EXISTS adds overhead with no filtering benefit). Production sparse networks should land between pre-fix and post-fix synthetic numbers, leaning toward the better end as the EXISTS prunes 99%+ of edges via index seek.

12. Open follow-ups


Companion docs: spec.md · performance.md · index.html · raw CSVs in backend/out/perf/