Spike · GET-109

Network Intelligence System

Unified architecture for Getro's network intelligence: connection paths (spec 007), relationship strength + reachability + key connection (spec 008). Built on shared Sidekiq + Postgres caching infrastructure with Findem enrichment.

TL;DR

Two intelligence layers, one infrastructure. Spec 007 surfaces who can intro you to companies on your lists (direct + work-overlap intro paths). Spec 008 turns email + calendar metadata into Warm/Known/Cold strength tiers, reachability rollups, and key-connection picks. Both share the same per-pair caches, edge tables, and rollup architecture documented in the technical spec. 007 v1 ships independently of email/calendar ingestion; 008 thin V1 ships at ~50% heuristic coverage and grows to ~88% as Findem enrichment lands.

Technical spec

Unified architecture & execution plan

For engineers. The single source of truth for the system — services, data model, caching, integrations, phased plan covering both 007 and 008.

System architecture + ER diagrams
Data model: per-pair caches + edges + rollups
§6 Caching architecture (English + technical)
Rule engine & scoring service
Operational envelope (numbers)
Phased plan covering 007 v1/v2/v3 + 008 slices
Decision records — DR-01 through ADR-007-B

Open technical spec → Data flow narrative

From data to strength

For PM, design, reviewers. What data we ingest and how it becomes a signal.

Raw data → signal primitives
Every heuristic in plain English + code
Combining clauses into tiers
Reachability rollup
Worked scenarios + scale walkthrough
Privacy guardrails

Open data-flow narrative → Integration deep-dive

Google & Microsoft — per provider

For backend engineers. What exists, what extends, what's greenfield, with code skeletons and official doc links.

Verified current-state audit
OAuth flow + scope management
Gmail metadata + Calendar clients
MS Graph Mail + Calendar (greenfield)
Azure AD multi-tenant setup
Shared-mailbox decision matrix

Open integration deep-dive → Architecture decision record

Graph DB vs Postgres for connection paths

For tech leads + reviewers. The long-form analysis behind ADR-007-A — why we chose Postgres tables over Neo4j, Apache AGE, and live JOINs.

The two query patterns (list view + drill-in)
Why this is not deeply graph-shaped
Options matrix & performance comparison
SQL vs Cypher side-by-side
Multi-edge-type schema strategy (5 path types)
Operational cost breakdown + revisit triggers

Open ADR companion → Findem code audit

Findem code findings (firstcut + app-next)

For engineers + reviewers. What Findem already does, with clickable links to source files in firstcut and app-next.

The overlap kernel (formulas worth copying)
connection_svc batch path + 15k cap
LoadConnections runtime merge (backend + app-next consumer)
sandbox/matches — what it does and doesn't include
Profile data model (sparse connections, missing score)
Findem capability gaps F1–F9

Open code findings → Architecture variant

Apache AGE — graph in Postgres

Variant: keep one database, add openCypher graph queries via the AGE extension. Smaller step toward graph capability than Neo4j; bigger step than recursive CTEs.

Graph schema (vlabels + elabels)
Cypher inside cypher() SQL functions
Same Postgres backups + connection pool
Hosting compatibility caveat (RDS doesn't support)
Operational envelope + migration cost
ADR-AGE — adopt only if conditions met

Open AGE variant → Architecture variant

Neo4j — dedicated graph database

Variant: Postgres remains source of truth; Neo4j handles traversal and shortest-path. Best graph performance; significant operational overhead.

Two databases, sync layer (dual-write or CDC)
Native Cypher + shortestPath() + Graph Data Science
4-hop drill-in feasible (~600ms vs ~15s in PG)
~$200–800/month hosting (Aura managed)
6–10 week migration + 2-week canary
ADR-NEO4J — adopt only if conditions met

Open Neo4j variant → Performance benchmark

DeepFinder load test — depths 1 → 4

Production-scale benchmark of the Postgres recursive-CTE walk against a 500k-contact synthetic graph calibrated to Inovia Capital's real shape. Plain-English breakdown of every metric with charts.

500k contacts, 500k orgs, 2.3M work_overlap edges
4 independent runs × 4 depths = 6,500+ queries
Parameter sweeps: MAX_EDGES_PER_HOP, limit
Verdict: depth 4 fast (p95 ~70ms), 250ms timeout safe
Bottleneck is the LIMIT, not the SQL
Recommendations for cap tuning

Open performance benchmark → Backfill scale reality check

Production backfill plan — what we're up against

Measured prod numbers, sandbox5 empirical throughput, and the realistic plan for running the work_overlap backfill at production scale.

Prod: 3.5M contacts to backfill, ~570M edges expected
Sandbox5: 12k edges/min sustained on optimized SQL
SQL benchmark: 619s → 15s per contact (40× speedup)
Estimated prod runtime: 8–30 hours depending on Sidekiq fleet
Known blocker: sandbox5 Sidekiq stability (interrupted ≠ resuming)
Prod runbook + acceptance criteria

Open backfill scale doc → Load test (v1 vs v2 · local + sandbox5)

connection_paths: legacy vs FinderV2

k6 sweep across depths 1–4 × max_paths 25/100 × impl v1/v2 on both local and sandbox5 (real data). Real data exposes legacy Finder can't scale.

Sandbox5 depth=2: v2 5s p95, v1 38s p95 (7.6× faster)
Sandbox5 throughput: v2 4 RPS, v1 0.3 RPS (15× more)
v2 depth=1 sustains 18 RPS at 700ms p95
v2 depth=3 usable (~20s); depth=4 research-grade (~23–50s)
Local Rails dev numbers are misleading — trust sandbox5

Open load-test report → Spec 039 explainer

Reachability & Strongest Connection filter

For PM, design, eng. The team-level rollup that turns the contacts list into an action list — column, filter, sort, profile chip, auto-update rules.

What "rollup" means + why a Postgres cache table
Why an OpenSearch mirror is unavoidable for filter/sort
3 workers: 2 backfills (one-time) + 1 incremental
Onboarding sequence — strict 5-step order
What happens when a new admin joins + syncs
Alternatives considered + why rejected

Open reachability explainer → Interactive playground

Try the API live

Hit the deep_connection_paths endpoint with a UI: pick a collection, contact, depth, and max_paths. See ranked paths visualized as chains. Auto-detects local vs sandbox5.

Form-driven request builder
Visualizes each path as a node chain
Shows latency, truncation, raw JSON
Auth token saved in localStorage
Cmd+Enter to fire query

Open playground →

Drafted 2026-04-22 Status · review Source markdown