GET-109 · Architecture analysis

Graph DB vs Postgres for connection paths

Drafted 2026-04-27 · Updated 2026-04-28 · Status: review · Audience: Getro engineering team

Scope: spec 007 (network connections on company lists) primarily; 008 implications noted.

How this doc fits

This document explains the storage substrate decision underpinning the unified architecture in spec-technical.html. The canonical home of ADR-007-A is spec-technical.html §12; the long-form analysis here drives that record. Table names below match the unified data model in spec-technical.html §3.1.

1. TL;DR

Recommendation

Stay with Postgres tables maintained by Sidekiq. The 007 problem is bounded (2-hop, fixed pattern, aggregation-heavy) and Postgres handles it natively at the scales we expect. A graph DB or graph extension would add operational complexity without solving a problem we actually have. Revisit only if 3+ hop traversal, dense first-class social edges, or community-detection features land on the roadmap.

The rest of this document explains why — with concrete numbers, side-by-side queries, and a decision record. Skim sections 6–9 if you only have five minutes.

2. The actual question

Spec 007 has two query patterns, and they are not equally hot:

Pattern	What it asks	Trigger	Latency budget
List view hot path	For each org on this list, count direct connections + intro paths and surface a strength signal.	Page render — fires every list visit	< 50ms p99 (rendered with row)
Drill-in cold path	Show every direct contact and every intro path into one specific org.	User click on a row	< 2s p99 (modal-style)

The list view is an aggregation problem keyed on (collection_id, organization_id). The drill-in is a fixed 2-hop traversal with time-windowed bridges. Neither is multi-hop, neither requires shortest-path, neither needs centrality or community detection.

3. Query shape (visual)

The "graph" in 007 has fixed depth and fixed pattern. Two relationship types, one bridge:

flowchart LR classDef person fill:#ddf4ff,stroke:#0969da,color:#24292f classDef org fill:#fef3c7,stroke:#b45309,color:#24292f classDef school fill:#e9d5ff,stroke:#7e22ce,color:#24292f ViaC[Via-contact
shared in collection]:::person TgtC[Target-contact
currently at TargetOrg]:::person TargetOrg[Target Org
on the list]:::org BridgeOrg[Bridge Org
shared past employer]:::org ViaUser[Team user
who knows via-contact]:::person ViaUser -- "UCC.shared" --> ViaC ViaC -- "WORKED_AT
(date_from,date_to)" --> BridgeOrg TgtC -- "WORKED_AT
(date_from,date_to)" --> BridgeOrg TgtC -- "CURRENT_AT" --> TargetOrg linkStyle 0 stroke:#0969da,stroke-width:2px linkStyle 1 stroke:#0969da,stroke-width:2px linkStyle 2 stroke:#0969da,stroke-width:2px linkStyle 3 stroke:#15803d,stroke-width:2px

What's actually being computed

Direct connection: ViaC has CURRENT_AT TargetOrg directly (1 hop from contact set into org).
Intro path: ViaC and TgtC share a BridgeOrg, with positive time overlap; TgtC has CURRENT_AT TargetOrg (2 hops).
Bridge condition: same organization_id AND date_ranges intersect. This is a property filter on the WORKED_AT edges, not a path-finding question.

4. Why it looks like a graph problem

The spec uses the word "path" everywhere.
The data has people, organizations, and relationships — the textbook graph definition.
A whiteboard sketch of the answer is a literal graph: nodes for contacts/orgs/users, edges for UCC and work-history.
Cypher / openCypher reads more naturally than SQL for "find paths."

All of these are real, but they don't establish that a graph engine is the right runtime. They establish that the conceptual model is graph-shaped. Conceptual model and storage substrate are independent decisions.

5. Why it isn't deeply graph-shaped

Three structural reasons.

5.1 Depth is fixed at 2

007 has exactly one query pattern with two hops; there is no recursion, no transitive closure, no shortest-path. Graph databases earn their keep at depth ≥ 3, where SQL self-joins or recursive CTEs start to degrade. At depth 2, a well-indexed Postgres join is actually faster than most graph engines because it doesn't pay the per-edge object-graph traversal cost.

5.2 Aggregation, not traversal, is the hot path

The list view is "count direct + count intro per org, sort, render." That's a GROUP BY. Graph databases are bad at GROUP BY across thousands of rows — they're optimized for "find pattern X in the graph," not "summarize all matches per key." Postgres planners are very mature at this.

5.3 The interesting edges don't physically exist

The coemployment edge ("Mike worked alongside Sarah at Stripe 2020-2022") is not stored anywhere as a record. It's inferred from a JOIN of two contact_work_experiences rows where organization_id matches and date ranges overlap. So no matter what storage we pick — Postgres tables, Neo4j, Apache AGE — we face the same compute problem:

Precompute and materialize the inferred edges, OR
Compute at query time via a JOIN.

A graph DB doesn't change that fundamental choice. It changes the syntax; the compute stays the same.

6. Options evaluated

Option	Pros	Cons (for our shape)	Verdict
Postgres tables (current plan) ContactCoemploymentEdge + summary rollup, maintained by Sidekiq	Already in stack; no new ops Indexed JOINs at this depth are fast Aggregation is SQL's home turf ETag, ACID, mature tooling	Recursive CTEs degrade at 3+ hops Edge density can grow large per collection	Recommended
Neo4j (separate graph DB)	Best-in-class traversal at depth ≥ 3 Cypher is readable Mature ecosystem	Separate DB to operate Cross-DB ETL pipeline needed Time-windowed edges are properties, not first-class Slow for GROUP BY counts License cost (Enterprise features)	Not worth it
TigerGraph / JanusGraph	Higher analytic throughput than Neo4j Distributed at scale	Even more operational overhead Same time-window issue Same separate-DB problem	Not worth it
Apache AGE (Postgres extension, openCypher)	Same DB; no ETL Cypher syntax inside Postgres	Stores nodes/edges as JSON-ish in tables — no real graph index magic Performance ≈ well-indexed normal tables for 2-hop Immature; sparse Rails ecosystem support Adds extension version risk	Risk without reward
Postgres recursive CTEs (no extension)	Already in Postgres Handles 2-hop natively Same query power as basic Cypher patterns at this depth	Verbose syntax Degrades at 3+ hops with branching	Drill-in fallback
pgRouting	—	Built for road networks (geographic distance). Wrong tool.	Not a fit

7. Performance comparison

Estimated p99 read latency for the two query patterns. Numbers are order-of-magnitude based on workload shape, not benchmarks; treat as rough guidance.

7.1 List view (50 orgs on the list, 10k shared contacts in collection)

Postgres precomputed table

~30ms

Postgres JOIN at query time

~250ms

Apache AGE (Cypher in PG)

~300ms

Neo4j (separate DB, network hop)

~500ms

Live nested-loop in app code

∞ (won't fit)

7.2 Drill-in (one org, enumerate all paths)

Postgres CTE (existing finder shape)

~150ms

Postgres precomputed edges

~80ms

Apache AGE (Cypher in PG)

~180ms

Neo4j (separate DB)

~200ms

Two takeaways:

Precomputed Postgres wins both query patterns at this depth. The win comes from indexes, not from the query language.
Network hop matters. Any separate-DB option pays a 30–100ms tax just for the round trip. At a 50ms budget, that's the entire budget.

8. Operational cost

What it costs to run, beyond the latency story.

Cost dimension	Postgres tables	Apache AGE	Neo4j (separate DB)
New runtime in production	No	Extension	Yes
Backups / DR / HA	Existing	Existing PG	New plan needed
Sync layer (cross-store)	None	None (same DB)	ETL pipeline + lag monitoring
Query engines for engineers to learn	SQL only	SQL + Cypher	SQL + Cypher
Test fixtures + factories	Existing FactoryBot	Existing FactoryBot + AGE seed	FactoryBot + Neo4j seed + sync mocks
Migration rollback story	Standard Rails	Extension version pin	Cross-DB consistency
Estimated added engineering weeks (one-time)	0	2–3	6–10
Estimated added on-call burden	0	Low	Medium-high

The hidden cost: doubled mental model

Even if a graph DB is "fine" operationally, every engineer touching connection paths now has to reason in two query languages, two transaction models, two failure modes. That cost compounds for every feature in the area, not just the first one.

9. SQL vs Cypher (the same query)

The drill-in question: "list every direct contact and every intro path into a single organization, scoped to one collection."

Postgres (precomputed, current plan)

-- Direct contacts
SELECT contact_id, current_title, strongest_source
FROM   collection_org_current_shared_contacts
WHERE  collection_id = $1
  AND  organization_id = $2;

-- Intro paths
SELECT via_contact_id,
       other_contact_id,
       organization_id   AS bridge_org,
       overlap_from,
       overlap_to,
       overlap_months
FROM   contact_coemployment_edges
WHERE  collection_id = $1
  AND  other_contact_id IN (
         SELECT contact_id
         FROM   collection_org_current_shared_contacts
         WHERE  collection_id = $1
           AND  organization_id = $2
       )
ORDER BY overlap_to DESC;

Cypher (Neo4j or AGE, hypothetical)

// Direct contacts
MATCH (org:Org {id: $org_id})
      <-[:CURRENT_AT]-(direct:Contact)
      <-[:UCC_SHARED {collection: $cid}]-(u:User)
RETURN direct, u, 'direct' AS type;

// Intro paths
MATCH (org:Org {id: $org_id})
      <-[:CURRENT_AT]-(target:Contact)
      -[w1:WORKED_AT]->(bridge:Org)
      <-[w2:WORKED_AT]-(via:Contact)
      <-[:UCC_SHARED {collection: $cid}]-(u:User)
WHERE w1.date_from <= w2.date_to
  AND w2.date_from <= w1.date_to
RETURN via, target, bridge, w1, w2,
       'intro' AS type
ORDER BY w1.date_to DESC;

Observation: Cypher reads more like a sentence; SQL is more verbose. That's it. Cypher's edge-walking syntax doesn't reduce the work — both queries do the same JOINs, evaluate the same time-overlap predicate, and produce the same rows. At depth 2 with proper indexes, the planners run both in similar time. The Postgres version is the one we already know how to operate.

Where Cypher would actually pull ahead

If we ever needed (via)-[:WORKED_AT]->(:Org)<-[:WORKED_AT]-(:Contact)-[:WORKED_AT]->(:Org)<-[:WORKED_AT]-(target) — i.e., 4-hop "friend of friend at same company" — Cypher's syntax would compress drastically and the engine's traversal optimizer would matter. We don't need that today.

10. Scale walkthrough

Concrete sizing for a collection with 10,000 shared contacts (the upper end of what we expect for an active VC fund or accelerator network).

10.1 Edge cardinality

10,000 contacts × 5 past employers each = 50,000 work-history rows.
Per past employer, average ~20 coemployees (people who happened to work there at any time). Most coemployees are not currently at any org we care about.
After filtering to "the other side is currently at an org on at least one of our lists": ~5–20 useful edges per contact.
Total edges per collection: ~100,000 rows in ContactCoemploymentEdge.

10.2 Storage

~50 bytes/row × 100,000 rows = ~5 MB per collection. Trivial. Indexes add ~3× — still under 20 MB.

10.3 Read at page-load

Index on (collection_id, organization_id); the list-view query returns 50 rows (one per org on the list). ~30ms p99 on standard Postgres hardware.

10.4 Write amplification

When a contact_work_experience row is added or updated:

Find collections this contact is shared in (small, indexed lookup).
For each collection, find candidate coemployees from contact_work_experiences by matching organization + time. Bounded query.
Upsert ~10–100 edge rows. Cheap.

Per write event: tens to a few hundred milliseconds in a Sidekiq worker. Doesn't block user-facing requests.

11. Multiple edge types (education, investors, board peers)

Spec 007 v1 ships coemployment only, but the broader product intent treats education, co-investment, investor → employee, and board peer as valid intro paths too. Adding these is the most likely near-term schema change. This section answers: does adding them change the storage decision?

Short answer

No. All five path types are still 2-hop with a fixed bridge entity, so the graph-vs-Postgres tradeoff is unchanged. What does change is the schema design — we need per-type tables and a unified summary, designed for extensibility from day one.

11.1 The five path types side by side

All five share the same shape: via_contact → bridge_entity ← target_contact. Only the bridge entity and the time-bound rule change.

Path type	Bridge entity	Match rule	Source data	Status in Getro
Coemployment	Organization	same `organization_id` + positive date overlap (≥1 month, gap-tolerant)	`contact_work_experiences`	Existing — covered by 007 v1
Coeducation	School	same school + positive year overlap	`contact_education`	Existing schema — kernel work needed
Co-investment	Company invested in	both invested in the same company (no time predicate, or simple "still active")	External: Pitchbook / Crunchbase / Findem cap-table	New ingest pipeline
Investor → Employee	Company	A invested in C; B currently works at C	Cross-reference of investor data + work history	Depends on co-investment
Board peer	Company	both on the same company's board (overlap optional)	Same source as investor data	Depends on co-investment

11.2 Visual: same shape, different bridge

flowchart TB classDef person fill:#ddf4ff,stroke:#0969da,color:#24292f classDef org fill:#fef3c7,stroke:#b45309,color:#24292f classDef school fill:#e9d5ff,stroke:#7e22ce,color:#24292f classDef inv fill:#dcfce7,stroke:#15803d,color:#24292f subgraph Coemp["Coemployment (007 v1)"] A1[Via contact]:::person O1[Bridge organization]:::org B1[Target]:::person A1 -- "WORKED_AT (date_from,date_to)" --> O1 B1 -- "WORKED_AT (date_from,date_to)" --> O1 end subgraph Coedu["Coeducation"] A2[Via contact]:::person S2[Bridge school]:::school B2[Target]:::person A2 -- "STUDIED_AT (year_from,year_to)" --> S2 B2 -- "STUDIED_AT (year_from,year_to)" --> S2 end subgraph Coinv["Co-investment"] A3[Via contact]:::person C3[Bridge company]:::inv B3[Target]:::person A3 -- "INVESTED_IN" --> C3 B3 -- "INVESTED_IN" --> C3 end subgraph InvEmp["Investor → Employee"] A4[Via investor]:::person C4[Bridge company]:::inv B4[Target employee]:::person A4 -- "INVESTED_IN" --> C4 B4 -- "WORKED_AT (current)" --> C4 end subgraph Board["Board peer"] A5[Via]:::person C5[Bridge company]:::inv B5[Target]:::person A5 -- "BOARD_OF" --> C5 B5 -- "BOARD_OF" --> C5 end

11.3 Schema strategy: per-type tables + unified summary

Two viable storage shapes. We recommend Option B.

Option A — Polymorphic edge table

contact_relationship_edges (
  collection_id      bigint,
  kind               enum,
  via_contact_id     bigint,
  other_contact_id   bigint,
  bridge_entity_type enum,
  bridge_entity_id   bigint,
  bridge_metadata    jsonb,  -- date ranges
  ...
)

Pros: single table, single query path.
Cons: coarse indexes; jsonb for time bounds is awkward; per-kind reconciliation harder.

Option B — Per-type tables, unified summary Recommended

ContactCoemploymentEdge
contact_coeducation_edges
contact_coinvestment_edges
contact_investor_employee_edges
contact_board_peer_edges

-- Plus the rollup the list view reads:
shared_list_network_summary (
  shared_list_id           bigint,
  organization_id          bigint,
  direct_count             int,
  intro_count_work         int,
  intro_count_edu          int,
  intro_count_investor     int,
  intro_count_board        int,
  intro_count_total        int,
  strength_score           numeric,
  top_intro_preview        jsonb,
  ...
)

Pros: native types, narrow indexes, per-kind reconciliation jobs, isolated migrations.
Cons: drill-in needs UNION ALL across N tables (still cheap at depth 2).

11.4 How this changes the math

Dimension	Coemployment only	All 5 edge types	Delta
Storage per collection	~5 MB	~15–25 MB	3–5×, still trivial
List-view read latency (p99)	~30 ms	~30 ms	Unchanged (hits summary table)
Drill-in latency (p99)	~80 ms	~150 ms	+70 ms (5 small UNION ALL)
Write amplification triggers	work-exp change	work-exp / education / investment / board change	More triggers, each still cheap
Cypher syntactic appeal	Mild	Stronger (multi-edge pattern is cleaner)	Increases — but no runtime gain at depth 2
Operational cost of a graph DB	High and unjustified	High and still unjustified	Unchanged

11.5 The non-storage risk: investor data acquisition

Coemployment and coeducation derive from data Getro already owns. Investor and board edges require a data deal or vendor pipeline:

Source option	Pros	Cons
Findem enrichment (existing `investor_provider`, `pitchbook_provider` in firstcut)	Already a sync partner; no new vendor; surfaces via existing Findem→Getro pipeline	Coverage and freshness depend on Findem's vendor relationships; not Getro-controlled
Direct Pitchbook / Crunchbase license	Direct control over coverage, refresh, schema	New vendor relationship; cost; ingest pipeline; ongoing licensing
Manual entry / customer-supplied	Zero external dependency	Doesn't scale; coverage gaps obvious to users

This is the v1 vs v2 scoping question — it isn't a storage question. Whichever DB substrate we pick, the data has to come from somewhere.

11.6 Where multiple edge types could shift the storage decision

None of the five path types alone changes the answer. Combinations or extensions do:

Trigger	Why it changes the math
Drill-in expands to 3-hop mixed-type chains ("A invested in C, B works at C, B went to school with target")	Recursive CTE complexity grows with branching factor across heterogeneous edges; Cypher's pattern syntax wins.
"Show me any path" feature with up to N hops across any of the 5 edge types	Postgres falls behind; this is a graph DB sweet spot.
Strength scoring across mixed edge types with weighted shortest-path	Graph DBs ship Dijkstra/A* as built-ins; Postgres requires custom CTE work.

For 007 v1 (coemployment) and even 007 v2 (all 5 types, still 2-hop), Postgres remains the right substrate. Revisit only if a 3+ hop mixed-type product feature is on the explicit roadmap.

Schema-design recommendation

Even if 007 v1 ships with coemployment only, design SharedListNetworkSummary today with the intro_count_edu, intro_count_investor, intro_count_board columns nullable. Migration cost is paid once; v2 then ships per-type-table additions without touching the hot read path.

12. When to revisit

Reconsider a graph layer if any of these become true:

Trigger	Why it changes the math
3+ hop traversal becomes a real product feature ("warm intros via mutual coworker of a coworker")	Recursive CTEs degrade with branching factor; graph engines start to win.
First-class social edges (LinkedIn 1st-degree, email-thread participants) become dense and primary	Edge volume per collection grows 10×–100×; in-place edges become heterogeneous; Cypher's pattern match is genuinely simpler.
Centrality / community detection lands on the roadmap ("super-connectors", network clusters)	These are graph-algorithm problems. SQL can't express them succinctly; graph DBs ship them as built-ins.
Path enumeration becomes interactive ("show me the cheapest 3 intro paths to anyone at Org X")	Shortest-path with weighted edges is a graph DB sweet spot.
Cross-collection or cross-network traversal becomes a feature	Postgres scoping by `collection_id` falls apart; graph DB modeling is more natural.

None of these are in 007 or 008 today. If any land in a future quarter, the right move is to keep Postgres as source of truth and introduce a graph layer for the specific feature — not retrofit the entire stack.

13. What about 008?

Spec 008 (relationship strength) is not a graph problem at all. The hot computation is per-pair signal aggregation:

For each (team_member, contact) pair: counters for emails sent/received in 12mo, meeting count in 24mo, last-contact recency, reply rate.
Compare counters against tier predicates (Warm / Known / Cold rules).
Roll up tiers per (contact, collection) for Reachability.

That's event-stream ingestion + counter rollup + threshold check. A graph DB doesn't help. You'd have to either store the interaction stats outside the graph DB (now you have two systems) or contort the data into edges-with-counters that no native graph algorithm benefits from. Adding a graph DB makes 008 worse, not better.

14. Recommendation

Build 007 v1 storage in Postgres tables, maintained by Sidekiq workers using Rails after_commit hooks. Start with ContactCoemploymentEdge, CollectionOrgCurrentSharedContact, and SharedListNetworkSummary.
Design the schema for multiple edge types from day one. Keep one table per edge kind (coemployment, coeducation, coinvestment, investor-employee, board-peer) and one unified SharedListNetworkSummary with intro_count_* columns nullable per kind. v1 ships with only the coemployment column populated; v2+ light up the rest without touching the hot read path.
Use the existing kernel logic from Contacts::ConnectionPaths::Finder as the per-write computation core (same bridge + time-overlap rule, same signal weighting). Reuse it across edge types — only the bridge entity and time predicate change.
Use Postgres recursive CTEs as a fallback for any drill-in question that needs traversal beyond what the precomputed edges cover. No extension needed.
Skip Apache AGE and dedicated graph DBs for now. Park them as "revisit if X happens" — see section 12, plus the multi-edge-type triggers in section 11.6.
Resolve the investor-data source question (Findem vs direct Pitchbook/Crunchbase vs manual) before locking 007 v2 scope. This is a data-acquisition question, not a storage question — but it gates the v2 schema's investor and board-peer tables.
Do not introduce graph storage for 008 under any scenario. It's the wrong tool for that workload.

Bottom line

What we're really building isn't "a graph database" — it's a denormalized index optimized for the specific query patterns 007 needs. Whether we call it "materialized edges in Postgres," "a graph store," or "a precomputed cache" is implementation detail. For 2-hop with aggregation, Postgres tables with the right indexes win. That's the answer for this particular shape — not a general statement about graph databases.

15. Decision record

ADR-007-A Storage substrate for spec 007 connection paths Accepted

Context

Spec 007 surfaces network connections (direct + work-overlap intro) on company lists. Findem's live nested-loop approach does not scale to Getro's list-view trigger (50 orgs × 10k contacts × employees = ~1.25B cell comparisons per render). We need a storage strategy that supports <50ms p99 list reads and <2s drill-in.

Decision

Use Postgres tables maintained asynchronously by Sidekiq workers via Rails after_commit hooks, with nightly reconciliation jobs as a backstop. v1 ships ContactCoemploymentEdge, CollectionOrgCurrentSharedContact, and SharedListNetworkSummary. The schema is designed for extension to coeducation, coinvestment, investor-employee, and board-peer edges (per-type tables + nullable intro_count_* columns on the summary). Reuse the kernel logic from Contacts::ConnectionPaths::Finder for per-write edge generation, parameterized over bridge entity type.

Alternatives considered

Live JOIN at query time — rejected. p99 of ~250ms for typical inputs blows the 50ms list-view budget; degrades non-linearly with collection size.
Apache AGE (Postgres openCypher extension) — rejected. No measurable performance gain at depth 2; immature ecosystem; adds extension version risk; doubles query-language surface area.
Neo4j or other dedicated graph DB — rejected. Cross-DB ETL pipeline; weaker at GROUP BY aggregations; network hop alone exceeds the latency budget; significant operational and licensing cost.
Postgres recursive CTEs only (no precompute) — partial accept. Useful for drill-in fallback queries; insufficient for list-view because it doesn't avoid the per-render fan-out cost.

Consequences

+ Zero new infrastructure; ships within existing engineering capacity.
+ List-view p99 ≈ 30ms (v1 coemployment); ~30ms (v2 all 5 types — hits the same summary table).
+ Drill-in p99 ≈ 80ms (v1); ~150ms (v2 with UNION ALL across 5 edge tables).
+ Observable, recoverable, and migratable using existing Rails tooling.
+ Schema is forward-compatible with multi-edge-type expansion (v2): only nullable summary columns and per-type tables added; hot path unchanged.
− Write amplification: every contact_work_experience / contact_education / investor-data change enqueues a recompute job. Bounded but real.
− Storage growth: ~5 MB per collection (v1, coemployment only) → ~15–25 MB (v2, all 5 edge types). Trivial absolute, linear in collections × contacts. Revisit if collections exceed ~200k contacts.
− Recursive CTE drill-in degrades if traversal depth ever needs to exceed 2 — including 3+ hop chains across mixed edge types.
− Investor and board edges depend on an external data source not yet committed. Storage is ready; data-acquisition isn't.

Reversibility

High. The precomputed tables are a cache layer over the existing schema. If a graph DB is later justified, we add it as a parallel store and migrate query paths feature-by-feature. The kernel logic and per-type table boundaries are both portable.

Revisit triggers

See section 12 (general triggers — 3+ hop traversal, dense first-class social edges, centrality features, interactive shortest-path) and section 11.6 (multi-edge-type triggers — mixed-kind 3-hop chains, "any path" with up to N hops, weighted shortest-path scoring).

16. Open questions

Edge cardinality validation. The 100k edges/collection estimate is back-of-envelope. Run a query against a real top-quartile customer collection to confirm before locking the table size.
Reconciliation cadence. Nightly is the proposed default for the catch-up reconciler. Is daily acceptable, or do some customer SLAs require hourly?
v1 vs v2 scope for edge types. Spec 007 v1 ships coemployment only. When are coeducation, coinvestment, investor-employee, and board-peer edges expected? The summary schema needs intro_count_* columns from day one if v2 lands within ~6 months — otherwise we eat unnecessary migration churn.
Investor data source. Findem (existing pipeline) vs direct Pitchbook/Crunchbase vs manual entry. This is a vendor / business question, not a technical one — but it gates the v2 schema's investor and board-peer tables. See section 11.5.
Cross-collection drill-in. A future request might be "show me everyone in any of my collections who can intro me to Org X." Is this in scope for v2, or strictly out of scope? Affects whether collection_id belongs in the primary index.
Strength-tuple weights across edge types. 007 spec audit recommends using existing SIGNAL_WEIGHT from finder.rb:14–19 for coemployment. v2 needs per-edge-type weights (e.g., is "co-investor" stronger than "coemployment"?). Confirm with product before locking the summary table's strength_* columns.
Time-bound semantics for non-work edges. Coemployment requires positive date overlap. Co-investment can use "both currently invested" or "either historical or current" — different rules give different edge counts. Pick one per edge type before kernel code lands.