Network Intelligence System
Unified architecture for Getro's network intelligence: connection paths (spec 007), relationship strength + reachability + key connection (spec 008). Built on shared Sidekiq + Postgres caching infrastructure with Findem enrichment.
TL;DR
Two intelligence layers, one infrastructure. Spec 007 surfaces who can intro you to companies on your lists (direct + work-overlap intro paths). Spec 008 turns email + calendar metadata into Warm/Known/Cold strength tiers, reachability rollups, and key-connection picks. Both share the same per-pair caches, edge tables, and rollup architecture documented in the technical spec. 007 v1 ships independently of email/calendar ingestion; 008 thin V1 ships at ~50% heuristic coverage and grows to ~88% as Findem enrichment lands.
Unified architecture & execution plan
For engineers. The single source of truth for the system — services, data model, caching, integrations, phased plan covering both 007 and 008.
- System architecture + ER diagrams
- Data model: per-pair caches + edges + rollups
- §6 Caching architecture (English + technical)
- Rule engine & scoring service
- Operational envelope (numbers)
- Phased plan covering 007 v1/v2/v3 + 008 slices
- Decision records — DR-01 through ADR-007-B
From data to strength
For PM, design, reviewers. What data we ingest and how it becomes a signal.
- Raw data → signal primitives
- Every heuristic in plain English + code
- Combining clauses into tiers
- Reachability rollup
- Worked scenarios + scale walkthrough
- Privacy guardrails
Google & Microsoft — per provider
For backend engineers. What exists, what extends, what's greenfield, with code skeletons and official doc links.
- Verified current-state audit
- OAuth flow + scope management
- Gmail metadata + Calendar clients
- MS Graph Mail + Calendar (greenfield)
- Azure AD multi-tenant setup
- Shared-mailbox decision matrix
Graph DB vs Postgres for connection paths
For tech leads + reviewers. The long-form analysis behind ADR-007-A — why we chose Postgres tables over Neo4j, Apache AGE, and live JOINs.
- The two query patterns (list view + drill-in)
- Why this is not deeply graph-shaped
- Options matrix & performance comparison
- SQL vs Cypher side-by-side
- Multi-edge-type schema strategy (5 path types)
- Operational cost breakdown + revisit triggers
Findem code findings (firstcut + app-next)
For engineers + reviewers. What Findem already does, with clickable links to source files in firstcut and app-next.
- The overlap kernel (formulas worth copying)
- connection_svc batch path + 15k cap
- LoadConnections runtime merge (backend + app-next consumer)
- sandbox/matches — what it does and doesn't include
- Profile data model (sparse connections, missing score)
- Findem capability gaps F1–F9
Apache AGE — graph in Postgres
Variant: keep one database, add openCypher graph queries via the AGE extension. Smaller step toward graph capability than Neo4j; bigger step than recursive CTEs.
- Graph schema (vlabels + elabels)
- Cypher inside
cypher()SQL functions - Same Postgres backups + connection pool
- Hosting compatibility caveat (RDS doesn't support)
- Operational envelope + migration cost
- ADR-AGE — adopt only if conditions met
Neo4j — dedicated graph database
Variant: Postgres remains source of truth; Neo4j handles traversal and shortest-path. Best graph performance; significant operational overhead.
- Two databases, sync layer (dual-write or CDC)
- Native Cypher + shortestPath() + Graph Data Science
- 4-hop drill-in feasible (~600ms vs ~15s in PG)
- ~$200–800/month hosting (Aura managed)
- 6–10 week migration + 2-week canary
- ADR-NEO4J — adopt only if conditions met
DeepFinder load test — depths 1 → 4
Production-scale benchmark of the Postgres recursive-CTE walk against a 500k-contact synthetic graph calibrated to Inovia Capital's real shape. Plain-English breakdown of every metric with charts.
- 500k contacts, 500k orgs, 2.3M work_overlap edges
- 4 independent runs × 4 depths = 6,500+ queries
- Parameter sweeps: MAX_EDGES_PER_HOP, limit
- Verdict: depth 4 fast (p95 ~70ms), 250ms timeout safe
- Bottleneck is the LIMIT, not the SQL
- Recommendations for cap tuning
Production backfill plan — what we're up against
Measured prod numbers, sandbox5 empirical throughput, and the realistic plan for running the work_overlap backfill at production scale.
- Prod: 3.5M contacts to backfill, ~570M edges expected
- Sandbox5: 12k edges/min sustained on optimized SQL
- SQL benchmark: 619s → 15s per contact (40× speedup)
- Estimated prod runtime: 8–30 hours depending on Sidekiq fleet
- Known blocker: sandbox5 Sidekiq stability (interrupted ≠ resuming)
- Prod runbook + acceptance criteria
connection_paths: legacy vs FinderV2
k6 sweep across depths 1–4 × max_paths 25/100 × impl v1/v2 on both local and sandbox5 (real data). Real data exposes legacy Finder can't scale.
- Sandbox5 depth=2: v2 5s p95, v1 38s p95 (7.6× faster)
- Sandbox5 throughput: v2 4 RPS, v1 0.3 RPS (15× more)
- v2 depth=1 sustains 18 RPS at 700ms p95
- v2 depth=3 usable (~20s); depth=4 research-grade (~23–50s)
- Local Rails dev numbers are misleading — trust sandbox5
Reachability & Strongest Connection filter
For PM, design, eng. The team-level rollup that turns the contacts list into an action list — column, filter, sort, profile chip, auto-update rules.
- What "rollup" means + why a Postgres cache table
- Why an OpenSearch mirror is unavoidable for filter/sort
- 3 workers: 2 backfills (one-time) + 1 incremental
- Onboarding sequence — strict 5-step order
- What happens when a new admin joins + syncs
- Alternatives considered + why rejected
Try the API live
Hit the deep_connection_paths endpoint with a UI: pick a collection, contact, depth, and max_paths. See ranked paths visualized as chains. Auto-detects local vs sandbox5.
- Form-driven request builder
- Visualizes each path as a node chain
- Shows latency, truncation, raw JSON
- Auth token saved in localStorage
- Cmd+Enter to fire query