GET-109 · Findem code findings

What Findem already does, with file references

A consolidated audit of the relevant systems in the Findem codebase (firstcut + app-next), with direct links to source. Used to inform 007 and 008 architecture decisions.

Drafted 2026-04-28 · Status: review · Audience: Getro engineering team

0. About this document

Repo access required. firstcut lives at github.com/findemdev/firstcut (private); app-next at bitbucket.org/rimsaw/app-next (private). All file links below assume reviewer access. External readers will get 404s — file paths are still valid for anyone who pulls the repos locally.

This document captures the reverse-engineering done on Findem's existing implementation while planning Getro's Network Intelligence System. It is not a Findem doc; it is a Getro engineering reference. Each finding is tagged with implications for our architecture.

Conventions

All file links point to main (firstcut) or main (app-next). Lines are accurate as of 2026-04-28; verify with current main if drift is suspected.
Each finding has an ID like F-K-1 (Kernel finding 1) or F-S-2 (Search finding 2) for cross-reference.
Reusable means we can copy the algorithm or pattern into Getro. Diverge means the implementation doesn't fit our scale or shape. Open means an unresolved question for Findem.

Cross-references

spec-technical.html — unified architecture for 007 + 008 (the consumer of these findings)
spec-graph-vs-postgres.html — storage substrate ADR (driven partly by these findings)
spec-integrations.html — the email/calendar provider integration plan (different scope)

1. The overlap kernel

The piece worth copying. Findem's match_engine already implements the "did A and B share an employer or school with positive time overlap" check, with scoring. Both 007 (intro paths) and 008 (relationship-strength W3/K6) need exactly this.

F-K-1 computeOverlapScores — the entry point Reusable algorithm

Takes two profile populations (profiles × target_profiles) and returns overlap scores per pair. Self-exclusion via LinkedIn ID; duplicate-profile guard via candidate_tags.

Source: fc backend/query_svc/engine/match_engine.ts:626
Implication: Port the loop shape into WorkOverlapCalculator + CoemploymentEdgeMaintainer. The two-loop structure (build maps, then probe) is portable to Postgres SQL: index contact_work_experiences on (organization_id) for the inner probe.

F-K-2 Work overlap scoring formula Reusable formula

Scoring: 1 (company match) + (1 + overlap_fraction) (timeline) + dept/title fraction. Match requires comp_lnkd_id OR comp_id OR comp_name equality (three-way fallback).

Source: fc backend/query_svc/engine/match_engine.ts:759 (computeExpOverlapInfo)
Implication: Adopt this formula for 007's strength tuple (FR-006a). Same shape carries to SharedListNetworkSummary.strength_score.

F-K-3 Education overlap scoring formula Reusable formula

Mirror of the work formula: 1 (school match) + (1 + overlap_fraction) (timeline) + degree fraction + major fraction. School matched by inst_name equality (no canonical school ID — known fragility).

Source: fc backend/query_svc/engine/match_engine.ts:820 (computeEduOverlapInfo)
Implication: Adopt for 007 v2 + 008 EducationOverlapCache. Plan: link contact_educations to a canonical schools table to avoid Findem's name-equality fragility.

F-K-4 Self-exclusion + duplicate guard Reusable

Lines 657–664: refuses to compute overlap between a profile and itself (via LinkedIn ID); skips overlap for known-duplicate profiles tagged possibleDuplicate.

Source: fc backend/query_svc/engine/match_engine.ts:657-674
Implication: 007 needs the same: a contact shouldn't be its own intro path. Implement at edge-write time in CoemploymentEdgeMaintainer.

2. Connection precompute (batch path)

Findem's async batch system that precomputes overlap results per saved-search macro. The architecture has both useful patterns and known scale failures. We borrow the patterns and avoid the failures.

F-C-1 Master/agent split with mutex-serialized dispatch Pattern only

Master polls every 5 min, holds a mutex, hands work units to agents. Per-task state in Mongo connection_task. Threshold job recovers stuck tasks.

Sources: fc connection_svc_master.ts:189 (mutex + dispatch)
fc connection_svc_master.ts:224 (resync + create tasks)
fc connection_svc_mongomodels.ts:1 (task ledger)
Implication: The master/agent + task-ledger pattern maps cleanly onto Sidekiq + sidekiq-unique-jobs. Pattern reusable; no need to copy the polling loop.

F-C-2 15k profile cap with silent truncation Anti-pattern at our scale

MAX_IMPORT_PROFILES = 15000. Network side hard-truncated to first 15k results from /profile-matches; the rest silently dropped. Logged but not enforced. No paging, no master-level splitting.

Sources: fc connection_svc_agent.ts:149 (the constant)
fc connection_svc_agent.ts:605 (loadProfiles — single non-paged request)
fc connection_svc_agent.ts:643 (the truncation log line)
Implication: Getro cannot adopt this cap — SC-003 requires complete coverage on customer collections. Use Postgres-driven precompute (no in-memory map) so storage is the only bound.

F-C-3 Two-population asymmetry: small index in memory + large stream Pattern reusable

Network side (B) is loaded once into work_info_map + education_info_map keyed on company/school. ICP side (A) streams in batches via fetchProfiles(batch_info), probing the maps. Different storage decisions per side.

Sources: fc connection_svc_agent.ts:505 (runTask batch loop)
fc connection_svc_agent.ts:644-705 (build maps from network)
fc connection_svc_agent.ts:1141 (storeOverlappingProfiles — write IDs only)
Implication: Same shape applies in Getro: CollectionOrgCurrentSharedContact is the small in-memory-equivalent index (Postgres lookup table); the large side (network contacts) streams via Sidekiq batches.

F-C-4 Compact result format — IDs only Reusable

Output is newline-delimited PRIDs to a fileserver. No hydrated profiles in the result; consumers re-fetch on demand. Avoids storing huge payloads.

Source: fc connection_svc_agent.ts:1141-1169
Implication: Mirror in Getro: ContactCoemploymentEdge stores contact IDs and bridge IDs only. UI fetches contact details on drill-in.

3. Live connection path (per-profile)

The user-facing endpoint Findem uses to populate the "Connections" panel on a candidate-detail page in app-next. Combines stored social edges + reverse-upload lookup + computed overlap. Useful as a model for 007's drill-in.

F-L-1 handleLoadConnections — the runtime merge Pattern reusable

POST /hm/api/profile with type: 'LoadConnections'. Merges (a) explicit candidate.connections[] + reverse upload lookup, (b) evalOptimalReach overlap results, (c) optional LinkedIn nests. Sorted by score; "Social" tier hardcoded at score 2000 to top-rank explicit edges.

Backend: fc profile_api_handler.ts:1251 (dispatch)
fc profile_api_handler.ts:1876 (handler entry)
fc profile_api_handler.ts:2000 (Java path)
fc profile_api_handler.ts:2032 (TS path fallback)
fc candidate_upload_manager.ts:777 (reverse upload lookup)
Frontend: an src/components/MegaEnrichedProfile/Panels/Connections.tsx:305 (useGetConnectionsQuery)
an src/services/matches.ts:474 (RTK Query definition)
an src/components/MegaEnrichedProfile/AboutTab/Connection.tsx (row renderer)
Implication: 007 drill-in mirrors this shape: read precomputed edges + per-pair caches, merge at request time, return ranked list. Score 2000 trick — adopt for direct connections so they always rank above intro paths.

F-L-2 Connection macro is the network definition Diverge

Findem uses macros (saved searches) tagged categories.includes('Connection') + is_private to define "your network." Auto-picks "Connections - Employee Connection" or "linkedin connections" as defaults. Fully customer-configurable.

Source: fc profile_api_handler.ts:1962-1996
Implication: Getro does not need a macro engine. The network is defined statically by UserContactCollection with source = 'shared', scoped per collection_id. Simpler model; ship without rebuilding macros.

4. Search endpoint — does NOT include connections

The workhorse search API (POST /pub/api/sandbox/matches). Worth understanding because 007 reviewers will ask "can we just use this?" — the answer is no, and this section documents why.

F-S-1 /pub/api/sandbox/matches — the search workhorse Not for connections

Takes inline ICP requirements + ~30 filter knobs. Returns matched profiles with hydration, logo population, CRM source decoration. Does not call any connection function. No evalOptimalReach, no computeOverlapScores. Connections are computed downstream when the user drills in.

Source: fc pub_svc_matches.ts:105 (route)
fc pub_svc_matches.ts:715 (handleSandboxMatches)
fc sandbox_matches.ts:57 (fetchSandboxMatchResultsInternal)
Implication: 007 cannot piggy-back on this endpoint. Connection paths require a separate computation (precomputed edges + summary rollup, per spec-technical §6).

F-S-2 CRM-context decoration pattern Pattern reusable

After search results are returned, a context-aware decoration pass annotates each profile with CRM-specific fields (SandboxProfileUtils.populateCrmProfileSources). Lean search + late decoration.

Source: fc pub_svc_matches.ts:859-862
Implication: Apply same idea to 007: the org-list endpoint returns bare summaries; a decoration pass joins SharedListNetworkSummary for the connection counts. Decouples search from connection enrichment.

5. Profile data model — what's stored vs what's computed

Findem's ICandidateProfile has connection-related fields that look richer than they are. This section documents the actual data so we don't over-rely on what Findem nominally has.

F-D-1 candidate.connections[] — sparse, per-uploader, frozen Lower coverage than expected

Field exists on every profile but populated only when a user manually uploads their LinkedIn export via LiConnectionUploader. Coverage likely <10% of profiles. Tagged per-uploader ('linkedin connection <upload_id>'); never refreshed; not exposed via pub_api.

Sources: fc datamodels.ts:269 (ICandidateProfile.connections — optional field)
fc datamodels.ts:280-289 (IConnectionInfo schema)
fc candidate_upload_ingestor.ts:1853 (the only writer)
Implication: Don't rely on candidate.connections[] as a primary signal. Confirm production coverage via the Mongo aggregate query before any decision; if <20%, deprioritize entirely.

F-D-2 OverlapSummaryInfo.score — declared but never populated Misleading field

Schema has a score field on the summary type, but it's not written anywhere in the codebase. The actual score is computed at query time inside match_engine, not stored on the profile.

Source: fc datamodels.ts:3146-3157
Implication: Findem does not have stored relationship-strength scores. 008 cannot fetch this from Findem; we compute it ourselves. (See spec-technical §5 rule engine.)

F-D-3 ConnectionType enum — broader than work/edu Future expansion hint

Enum includes Education = 0, WorkExperience = 1, Publication = 2, Patent = 3. Findem already models four bridge types even if the kernel only uses two.

Source: fc connection_datamodels.ts
Implication: Getro's edge-type list (coemployment, coeducation, coinvestment, board-peer) doesn't need to map 1:1 with Findem's. Our list comes from Getro product needs; Findem's enum is reference, not constraint.

6. Findem capability gaps (F1–F9)

Capabilities Getro needs from Findem to fill out 008/007 v2/v3. Status reflects current understanding; some answers may have changed since last sync with Findem.

ID	Capability	Used by	Status
F1	Person lookup by email — canonical identity resolution	008 dedup oracle (DR-02)	Pending Findem
F3	Enrich-by-email-only (no LinkedIn handle required)	008 thin V1 fallback	Pending Findem
F4	User enrichment — same APIs as contact enrichment, applied to team members	008 W3/K6 work overlap; 007 v2 path generation	Pending Findem
F6	Investor / cap-table data on companies and people	007 v3 investor edges; 008 investor-overlap signal	Pending Findem
F7	Profile-update webhooks for cache invalidation	008 EnrichedEmailLookup invalidation (DR-03)	Pending Findem
F8	Expose `candidate.connections[]` via pub_api with `ext_src` filter	Was: potential Cold/Known signal source	Downgrade — coverage too sparse (see F-D-1)
F9	Expose `match_engine` overlap scoring as a service for arbitrary PRID pairs	Could replace Getro reimplementing the kernel	Open question

Reframe of F8

Original F8 asked Findem to expose LinkedIn-source connections[] via pub_api. Investigation (see F-D-1) showed coverage is <10% of profiles and per-uploader scoped, not global. Reframe to: "Confirm production coverage of candidate.connections[]; if Findem starts ambient enrichment with global LinkedIn data, revisit."

7. Implications summary

Bringing it all together — what Findem gives us, what we build ourselves, and where the lines are.

Area	What Findem provides	What Getro builds
Overlap algorithm	The kernel formula (F-K-1, F-K-2, F-K-3)	Port into `WorkOverlapCalculator` + `CoemploymentEdgeMaintainer`
Network definition	Macro-based (F-L-2)	UCC-based (simpler, no macro engine)
Connection batch precompute	connection_svc with 15k cap (F-C-1, F-C-2)	Sidekiq workers with no cap; per-cache reconciliation
Connection drill-in (live)	LoadConnections runtime merge (F-L-1)	Mirror the merge shape; read precomputed Postgres edges
Search endpoint	sandbox/matches — does NOT include connections (F-S-1)	Separate per-list endpoint; connection counts via summary rollup
Profile data	Sparse `candidate.connections[]` (F-D-1); no stored strength score (F-D-2)	Use Findem profile data for enrichment (work/edu history); do not depend on connections[]
Investor + board data	Pending F6	007 v3 + 008 Phase 10 — gated on F6 answer

One-line takeaway

Findem provides the algorithm; Getro provides the storage and scale. Reuse the overlap-scoring formulas verbatim. Reuse architectural patterns (master/agent split, two-population asymmetry, late decoration). Do not depend on Findem's storage shapes (in-memory maps, 15k caps, sparse upload tables) — those don't fit Getro's list-view trigger or coverage requirements.