Graph DB vs Postgres for connection paths
Drafted 2026-04-27 · Updated 2026-04-28 · Status: review · Audience: Getro engineering team
Scope: spec 007 (network connections on company lists) primarily; 008 implications noted.
How this doc fits
This document explains the storage substrate decision underpinning the unified architecture in spec-technical.html. The canonical home of ADR-007-A is spec-technical.html §12; the long-form analysis here drives that record. Table names below match the unified data model in spec-technical.html §3.1.
1. TL;DR
Recommendation
Stay with Postgres tables maintained by Sidekiq. The 007 problem is bounded (2-hop, fixed pattern, aggregation-heavy) and Postgres handles it natively at the scales we expect. A graph DB or graph extension would add operational complexity without solving a problem we actually have. Revisit only if 3+ hop traversal, dense first-class social edges, or community-detection features land on the roadmap.
The rest of this document explains why — with concrete numbers, side-by-side queries, and a decision record. Skim sections 6–9 if you only have five minutes.
2. The actual question
Spec 007 has two query patterns, and they are not equally hot:
| Pattern | What it asks | Trigger | Latency budget |
|---|---|---|---|
| List view hot path | For each org on this list, count direct connections + intro paths and surface a strength signal. | Page render — fires every list visit | < 50ms p99 (rendered with row) |
| Drill-in cold path | Show every direct contact and every intro path into one specific org. | User click on a row | < 2s p99 (modal-style) |
The list view is an aggregation problem keyed on (collection_id, organization_id). The drill-in is a fixed 2-hop traversal with time-windowed bridges. Neither is multi-hop, neither requires shortest-path, neither needs centrality or community detection.
3. Query shape (visual)
The "graph" in 007 has fixed depth and fixed pattern. Two relationship types, one bridge:
shared in collection]:::person TgtC[Target-contact
currently at TargetOrg]:::person TargetOrg[Target Org
on the list]:::org BridgeOrg[Bridge Org
shared past employer]:::org ViaUser[Team user
who knows via-contact]:::person ViaUser -- "UCC.shared" --> ViaC ViaC -- "WORKED_AT
(date_from,date_to)" --> BridgeOrg TgtC -- "WORKED_AT
(date_from,date_to)" --> BridgeOrg TgtC -- "CURRENT_AT" --> TargetOrg linkStyle 0 stroke:#0969da,stroke-width:2px linkStyle 1 stroke:#0969da,stroke-width:2px linkStyle 2 stroke:#0969da,stroke-width:2px linkStyle 3 stroke:#15803d,stroke-width:2px
What's actually being computed
- Direct connection: ViaC has CURRENT_AT TargetOrg directly (1 hop from contact set into org).
- Intro path: ViaC and TgtC share a BridgeOrg, with positive time overlap; TgtC has CURRENT_AT TargetOrg (2 hops).
- Bridge condition:
same organization_id AND date_ranges intersect. This is a property filter on the WORKED_AT edges, not a path-finding question.
4. Why it looks like a graph problem
- The spec uses the word "path" everywhere.
- The data has people, organizations, and relationships — the textbook graph definition.
- A whiteboard sketch of the answer is a literal graph: nodes for contacts/orgs/users, edges for UCC and work-history.
- Cypher / openCypher reads more naturally than SQL for "find paths."
All of these are real, but they don't establish that a graph engine is the right runtime. They establish that the conceptual model is graph-shaped. Conceptual model and storage substrate are independent decisions.
5. Why it isn't deeply graph-shaped
Three structural reasons.
5.1 Depth is fixed at 2
007 has exactly one query pattern with two hops; there is no recursion, no transitive closure, no shortest-path. Graph databases earn their keep at depth ≥ 3, where SQL self-joins or recursive CTEs start to degrade. At depth 2, a well-indexed Postgres join is actually faster than most graph engines because it doesn't pay the per-edge object-graph traversal cost.
5.2 Aggregation, not traversal, is the hot path
The list view is "count direct + count intro per org, sort, render." That's a GROUP BY. Graph databases are bad at GROUP BY across thousands of rows — they're optimized for "find pattern X in the graph," not "summarize all matches per key." Postgres planners are very mature at this.
5.3 The interesting edges don't physically exist
The coemployment edge ("Mike worked alongside Sarah at Stripe 2020-2022") is not stored anywhere as a record. It's inferred from a JOIN of two contact_work_experiences rows where organization_id matches and date ranges overlap. So no matter what storage we pick — Postgres tables, Neo4j, Apache AGE — we face the same compute problem:
- Precompute and materialize the inferred edges, OR
- Compute at query time via a JOIN.
A graph DB doesn't change that fundamental choice. It changes the syntax; the compute stays the same.
6. Options evaluated
| Option | Pros | Cons (for our shape) | Verdict |
|---|---|---|---|
| Postgres tables (current plan) ContactCoemploymentEdge + summary rollup, maintained by Sidekiq |
|
|
Recommended |
| Neo4j (separate graph DB) |
|
|
Not worth it |
| TigerGraph / JanusGraph |
|
|
Not worth it |
| Apache AGE (Postgres extension, openCypher) |
|
|
Risk without reward |
| Postgres recursive CTEs (no extension) |
|
|
Drill-in fallback |
| pgRouting | — | Built for road networks (geographic distance). Wrong tool. | Not a fit |
7. Performance comparison
Estimated p99 read latency for the two query patterns. Numbers are order-of-magnitude based on workload shape, not benchmarks; treat as rough guidance.
7.1 List view (50 orgs on the list, 10k shared contacts in collection)
7.2 Drill-in (one org, enumerate all paths)
Two takeaways:
- Precomputed Postgres wins both query patterns at this depth. The win comes from indexes, not from the query language.
- Network hop matters. Any separate-DB option pays a 30–100ms tax just for the round trip. At a 50ms budget, that's the entire budget.
8. Operational cost
What it costs to run, beyond the latency story.
| Cost dimension | Postgres tables | Apache AGE | Neo4j (separate DB) |
|---|---|---|---|
| New runtime in production | No | Extension | Yes |
| Backups / DR / HA | Existing | Existing PG | New plan needed |
| Sync layer (cross-store) | None | None (same DB) | ETL pipeline + lag monitoring |
| Query engines for engineers to learn | SQL only | SQL + Cypher | SQL + Cypher |
| Test fixtures + factories | Existing FactoryBot | Existing FactoryBot + AGE seed | FactoryBot + Neo4j seed + sync mocks |
| Migration rollback story | Standard Rails | Extension version pin | Cross-DB consistency |
| Estimated added engineering weeks (one-time) | 0 | 2–3 | 6–10 |
| Estimated added on-call burden | 0 | Low | Medium-high |
The hidden cost: doubled mental model
Even if a graph DB is "fine" operationally, every engineer touching connection paths now has to reason in two query languages, two transaction models, two failure modes. That cost compounds for every feature in the area, not just the first one.
9. SQL vs Cypher (the same query)
The drill-in question: "list every direct contact and every intro path into a single organization, scoped to one collection."
Postgres (precomputed, current plan)
-- Direct contacts
SELECT contact_id, current_title, strongest_source
FROM collection_org_current_shared_contacts
WHERE collection_id = $1
AND organization_id = $2;
-- Intro paths
SELECT via_contact_id,
other_contact_id,
organization_id AS bridge_org,
overlap_from,
overlap_to,
overlap_months
FROM contact_coemployment_edges
WHERE collection_id = $1
AND other_contact_id IN (
SELECT contact_id
FROM collection_org_current_shared_contacts
WHERE collection_id = $1
AND organization_id = $2
)
ORDER BY overlap_to DESC;
Cypher (Neo4j or AGE, hypothetical)
// Direct contacts
MATCH (org:Org {id: $org_id})
<-[:CURRENT_AT]-(direct:Contact)
<-[:UCC_SHARED {collection: $cid}]-(u:User)
RETURN direct, u, 'direct' AS type;
// Intro paths
MATCH (org:Org {id: $org_id})
<-[:CURRENT_AT]-(target:Contact)
-[w1:WORKED_AT]->(bridge:Org)
<-[w2:WORKED_AT]-(via:Contact)
<-[:UCC_SHARED {collection: $cid}]-(u:User)
WHERE w1.date_from <= w2.date_to
AND w2.date_from <= w1.date_to
RETURN via, target, bridge, w1, w2,
'intro' AS type
ORDER BY w1.date_to DESC;
Observation: Cypher reads more like a sentence; SQL is more verbose. That's it. Cypher's edge-walking syntax doesn't reduce the work — both queries do the same JOINs, evaluate the same time-overlap predicate, and produce the same rows. At depth 2 with proper indexes, the planners run both in similar time. The Postgres version is the one we already know how to operate.
Where Cypher would actually pull ahead
If we ever needed (via)-[:WORKED_AT]->(:Org)<-[:WORKED_AT]-(:Contact)-[:WORKED_AT]->(:Org)<-[:WORKED_AT]-(target) — i.e., 4-hop "friend of friend at same company" — Cypher's syntax would compress drastically and the engine's traversal optimizer would matter. We don't need that today.
10. Scale walkthrough
Concrete sizing for a collection with 10,000 shared contacts (the upper end of what we expect for an active VC fund or accelerator network).
10.1 Edge cardinality
- 10,000 contacts × 5 past employers each = 50,000 work-history rows.
- Per past employer, average ~20 coemployees (people who happened to work there at any time). Most coemployees are not currently at any org we care about.
- After filtering to "the other side is currently at an org on at least one of our lists": ~5–20 useful edges per contact.
- Total edges per collection: ~100,000 rows in
ContactCoemploymentEdge.
10.2 Storage
~50 bytes/row × 100,000 rows = ~5 MB per collection. Trivial. Indexes add ~3× — still under 20 MB.
10.3 Read at page-load
Index on (collection_id, organization_id); the list-view query returns 50 rows (one per org on the list). ~30ms p99 on standard Postgres hardware.
10.4 Write amplification
When a contact_work_experience row is added or updated:
- Find collections this contact is shared in (small, indexed lookup).
- For each collection, find candidate coemployees from
contact_work_experiencesby matching organization + time. Bounded query. - Upsert ~10–100 edge rows. Cheap.
Per write event: tens to a few hundred milliseconds in a Sidekiq worker. Doesn't block user-facing requests.
11. Multiple edge types (education, investors, board peers)
Spec 007 v1 ships coemployment only, but the broader product intent treats education, co-investment, investor → employee, and board peer as valid intro paths too. Adding these is the most likely near-term schema change. This section answers: does adding them change the storage decision?
Short answer
No. All five path types are still 2-hop with a fixed bridge entity, so the graph-vs-Postgres tradeoff is unchanged. What does change is the schema design — we need per-type tables and a unified summary, designed for extensibility from day one.
11.1 The five path types side by side
All five share the same shape: via_contact → bridge_entity ← target_contact. Only the bridge entity and the time-bound rule change.
| Path type | Bridge entity | Match rule | Source data | Status in Getro |
|---|---|---|---|---|
| Coemployment | Organization | same organization_id + positive date overlap (≥1 month, gap-tolerant) |
contact_work_experiences |
Existing — covered by 007 v1 |
| Coeducation | School | same school + positive year overlap | contact_education |
Existing schema — kernel work needed |
| Co-investment | Company invested in | both invested in the same company (no time predicate, or simple "still active") | External: Pitchbook / Crunchbase / Findem cap-table | New ingest pipeline |
| Investor → Employee | Company | A invested in C; B currently works at C | Cross-reference of investor data + work history | Depends on co-investment |
| Board peer | Company | both on the same company's board (overlap optional) | Same source as investor data | Depends on co-investment |
11.2 Visual: same shape, different bridge
11.3 Schema strategy: per-type tables + unified summary
Two viable storage shapes. We recommend Option B.
Option A — Polymorphic edge table
contact_relationship_edges (
collection_id bigint,
kind enum,
via_contact_id bigint,
other_contact_id bigint,
bridge_entity_type enum,
bridge_entity_id bigint,
bridge_metadata jsonb, -- date ranges
...
)
Pros: single table, single query path.
Cons: coarse indexes; jsonb for time bounds is awkward; per-kind reconciliation harder.
Option B — Per-type tables, unified summary Recommended
ContactCoemploymentEdge
contact_coeducation_edges
contact_coinvestment_edges
contact_investor_employee_edges
contact_board_peer_edges
-- Plus the rollup the list view reads:
shared_list_network_summary (
shared_list_id bigint,
organization_id bigint,
direct_count int,
intro_count_work int,
intro_count_edu int,
intro_count_investor int,
intro_count_board int,
intro_count_total int,
strength_score numeric,
top_intro_preview jsonb,
...
)
Pros: native types, narrow indexes, per-kind reconciliation jobs, isolated migrations.
Cons: drill-in needs UNION ALL across N tables (still cheap at depth 2).
11.4 How this changes the math
| Dimension | Coemployment only | All 5 edge types | Delta |
|---|---|---|---|
| Storage per collection | ~5 MB | ~15–25 MB | 3–5×, still trivial |
| List-view read latency (p99) | ~30 ms | ~30 ms | Unchanged (hits summary table) |
| Drill-in latency (p99) | ~80 ms | ~150 ms | +70 ms (5 small UNION ALL) |
| Write amplification triggers | work-exp change | work-exp / education / investment / board change | More triggers, each still cheap |
| Cypher syntactic appeal | Mild | Stronger (multi-edge pattern is cleaner) | Increases — but no runtime gain at depth 2 |
| Operational cost of a graph DB | High and unjustified | High and still unjustified | Unchanged |
11.5 The non-storage risk: investor data acquisition
Coemployment and coeducation derive from data Getro already owns. Investor and board edges require a data deal or vendor pipeline:
| Source option | Pros | Cons |
|---|---|---|
Findem enrichment (existing investor_provider, pitchbook_provider in firstcut) |
Already a sync partner; no new vendor; surfaces via existing Findem→Getro pipeline | Coverage and freshness depend on Findem's vendor relationships; not Getro-controlled |
| Direct Pitchbook / Crunchbase license | Direct control over coverage, refresh, schema | New vendor relationship; cost; ingest pipeline; ongoing licensing |
| Manual entry / customer-supplied | Zero external dependency | Doesn't scale; coverage gaps obvious to users |
This is the v1 vs v2 scoping question — it isn't a storage question. Whichever DB substrate we pick, the data has to come from somewhere.
11.6 Where multiple edge types could shift the storage decision
None of the five path types alone changes the answer. Combinations or extensions do:
| Trigger | Why it changes the math |
|---|---|
| Drill-in expands to 3-hop mixed-type chains ("A invested in C, B works at C, B went to school with target") | Recursive CTE complexity grows with branching factor across heterogeneous edges; Cypher's pattern syntax wins. |
| "Show me any path" feature with up to N hops across any of the 5 edge types | Postgres falls behind; this is a graph DB sweet spot. |
| Strength scoring across mixed edge types with weighted shortest-path | Graph DBs ship Dijkstra/A* as built-ins; Postgres requires custom CTE work. |
For 007 v1 (coemployment) and even 007 v2 (all 5 types, still 2-hop), Postgres remains the right substrate. Revisit only if a 3+ hop mixed-type product feature is on the explicit roadmap.
Schema-design recommendation
Even if 007 v1 ships with coemployment only, design SharedListNetworkSummary today with the intro_count_edu, intro_count_investor, intro_count_board columns nullable. Migration cost is paid once; v2 then ships per-type-table additions without touching the hot read path.
12. When to revisit
Reconsider a graph layer if any of these become true:
| Trigger | Why it changes the math |
|---|---|
| 3+ hop traversal becomes a real product feature ("warm intros via mutual coworker of a coworker") | Recursive CTEs degrade with branching factor; graph engines start to win. |
| First-class social edges (LinkedIn 1st-degree, email-thread participants) become dense and primary | Edge volume per collection grows 10×–100×; in-place edges become heterogeneous; Cypher's pattern match is genuinely simpler. |
| Centrality / community detection lands on the roadmap ("super-connectors", network clusters) | These are graph-algorithm problems. SQL can't express them succinctly; graph DBs ship them as built-ins. |
| Path enumeration becomes interactive ("show me the cheapest 3 intro paths to anyone at Org X") | Shortest-path with weighted edges is a graph DB sweet spot. |
| Cross-collection or cross-network traversal becomes a feature | Postgres scoping by collection_id falls apart; graph DB modeling is more natural. |
None of these are in 007 or 008 today. If any land in a future quarter, the right move is to keep Postgres as source of truth and introduce a graph layer for the specific feature — not retrofit the entire stack.
13. What about 008?
Spec 008 (relationship strength) is not a graph problem at all. The hot computation is per-pair signal aggregation:
- For each
(team_member, contact)pair: counters for emails sent/received in 12mo, meeting count in 24mo, last-contact recency, reply rate. - Compare counters against tier predicates (Warm / Known / Cold rules).
- Roll up tiers per
(contact, collection)for Reachability.
That's event-stream ingestion + counter rollup + threshold check. A graph DB doesn't help. You'd have to either store the interaction stats outside the graph DB (now you have two systems) or contort the data into edges-with-counters that no native graph algorithm benefits from. Adding a graph DB makes 008 worse, not better.
14. Recommendation
- Build 007 v1 storage in Postgres tables, maintained by Sidekiq workers using Rails
after_commithooks. Start withContactCoemploymentEdge,CollectionOrgCurrentSharedContact, andSharedListNetworkSummary. - Design the schema for multiple edge types from day one. Keep one table per edge kind (coemployment, coeducation, coinvestment, investor-employee, board-peer) and one unified
SharedListNetworkSummarywithintro_count_*columns nullable per kind. v1 ships with only the coemployment column populated; v2+ light up the rest without touching the hot read path. - Use the existing kernel logic from
Contacts::ConnectionPaths::Finderas the per-write computation core (same bridge + time-overlap rule, same signal weighting). Reuse it across edge types — only the bridge entity and time predicate change. - Use Postgres recursive CTEs as a fallback for any drill-in question that needs traversal beyond what the precomputed edges cover. No extension needed.
- Skip Apache AGE and dedicated graph DBs for now. Park them as "revisit if X happens" — see section 12, plus the multi-edge-type triggers in section 11.6.
- Resolve the investor-data source question (Findem vs direct Pitchbook/Crunchbase vs manual) before locking 007 v2 scope. This is a data-acquisition question, not a storage question — but it gates the v2 schema's investor and board-peer tables.
- Do not introduce graph storage for 008 under any scenario. It's the wrong tool for that workload.
Bottom line
What we're really building isn't "a graph database" — it's a denormalized index optimized for the specific query patterns 007 needs. Whether we call it "materialized edges in Postgres," "a graph store," or "a precomputed cache" is implementation detail. For 2-hop with aggregation, Postgres tables with the right indexes win. That's the answer for this particular shape — not a general statement about graph databases.
15. Decision record
- Context
- Spec 007 surfaces network connections (direct + work-overlap intro) on company lists. Findem's live nested-loop approach does not scale to Getro's list-view trigger (50 orgs × 10k contacts × employees = ~1.25B cell comparisons per render). We need a storage strategy that supports <50ms p99 list reads and <2s drill-in.
- Decision
- Use Postgres tables maintained asynchronously by Sidekiq workers via Rails
after_commithooks, with nightly reconciliation jobs as a backstop. v1 shipsContactCoemploymentEdge,CollectionOrgCurrentSharedContact, andSharedListNetworkSummary. The schema is designed for extension to coeducation, coinvestment, investor-employee, and board-peer edges (per-type tables + nullableintro_count_*columns on the summary). Reuse the kernel logic fromContacts::ConnectionPaths::Finderfor per-write edge generation, parameterized over bridge entity type. - Alternatives considered
-
- Live JOIN at query time — rejected. p99 of ~250ms for typical inputs blows the 50ms list-view budget; degrades non-linearly with collection size.
- Apache AGE (Postgres openCypher extension) — rejected. No measurable performance gain at depth 2; immature ecosystem; adds extension version risk; doubles query-language surface area.
- Neo4j or other dedicated graph DB — rejected. Cross-DB ETL pipeline; weaker at GROUP BY aggregations; network hop alone exceeds the latency budget; significant operational and licensing cost.
- Postgres recursive CTEs only (no precompute) — partial accept. Useful for drill-in fallback queries; insufficient for list-view because it doesn't avoid the per-render fan-out cost.
- Consequences
-
- + Zero new infrastructure; ships within existing engineering capacity.
- + List-view p99 ≈ 30ms (v1 coemployment); ~30ms (v2 all 5 types — hits the same summary table).
- + Drill-in p99 ≈ 80ms (v1); ~150ms (v2 with UNION ALL across 5 edge tables).
- + Observable, recoverable, and migratable using existing Rails tooling.
- + Schema is forward-compatible with multi-edge-type expansion (v2): only nullable summary columns and per-type tables added; hot path unchanged.
- − Write amplification: every
contact_work_experience/contact_education/ investor-data change enqueues a recompute job. Bounded but real. - − Storage growth: ~5 MB per collection (v1, coemployment only) → ~15–25 MB (v2, all 5 edge types). Trivial absolute, linear in collections × contacts. Revisit if collections exceed ~200k contacts.
- − Recursive CTE drill-in degrades if traversal depth ever needs to exceed 2 — including 3+ hop chains across mixed edge types.
- − Investor and board edges depend on an external data source not yet committed. Storage is ready; data-acquisition isn't.
- Reversibility
- High. The precomputed tables are a cache layer over the existing schema. If a graph DB is later justified, we add it as a parallel store and migrate query paths feature-by-feature. The kernel logic and per-type table boundaries are both portable.
- Revisit triggers
- See section 12 (general triggers — 3+ hop traversal, dense first-class social edges, centrality features, interactive shortest-path) and section 11.6 (multi-edge-type triggers — mixed-kind 3-hop chains, "any path" with up to N hops, weighted shortest-path scoring).
16. Open questions
- Edge cardinality validation. The 100k edges/collection estimate is back-of-envelope. Run a query against a real top-quartile customer collection to confirm before locking the table size.
- Reconciliation cadence. Nightly is the proposed default for the catch-up reconciler. Is daily acceptable, or do some customer SLAs require hourly?
- v1 vs v2 scope for edge types. Spec 007 v1 ships coemployment only. When are coeducation, coinvestment, investor-employee, and board-peer edges expected? The summary schema needs
intro_count_*columns from day one if v2 lands within ~6 months — otherwise we eat unnecessary migration churn. - Investor data source. Findem (existing pipeline) vs direct Pitchbook/Crunchbase vs manual entry. This is a vendor / business question, not a technical one — but it gates the v2 schema's investor and board-peer tables. See section 11.5.
- Cross-collection drill-in. A future request might be "show me everyone in any of my collections who can intro me to Org X." Is this in scope for v2, or strictly out of scope? Affects whether
collection_idbelongs in the primary index. - Strength-tuple weights across edge types. 007 spec audit recommends using existing
SIGNAL_WEIGHTfromfinder.rb:14–19for coemployment. v2 needs per-edge-type weights (e.g., is "co-investor" stronger than "coemployment"?). Confirm with product before locking the summary table'sstrength_*columns. - Time-bound semantics for non-work edges. Coemployment requires positive date overlap. Co-investment can use "both currently invested" or "either historical or current" — different rules give different edge counts. Pick one per edge type before kernel code lands.