GET-109 · Findem code findings

What Findem already does, with file references

A consolidated audit of the relevant systems in the Findem codebase (firstcut + app-next), with direct links to source. Used to inform 007 and 008 architecture decisions.

Drafted 2026-04-28 · Status: review · Audience: Getro engineering team

0. About this document

Repo access required. firstcut lives at github.com/findemdev/firstcut (private); app-next at bitbucket.org/rimsaw/app-next (private). All file links below assume reviewer access. External readers will get 404s — file paths are still valid for anyone who pulls the repos locally.

This document captures the reverse-engineering done on Findem's existing implementation while planning Getro's Network Intelligence System. It is not a Findem doc; it is a Getro engineering reference. Each finding is tagged with implications for our architecture.

Conventions

  • All file links point to main (firstcut) or main (app-next). Lines are accurate as of 2026-04-28; verify with current main if drift is suspected.
  • Each finding has an ID like F-K-1 (Kernel finding 1) or F-S-2 (Search finding 2) for cross-reference.
  • Reusable means we can copy the algorithm or pattern into Getro. Diverge means the implementation doesn't fit our scale or shape. Open means an unresolved question for Findem.

Cross-references

1. The overlap kernel

The piece worth copying. Findem's match_engine already implements the "did A and B share an employer or school with positive time overlap" check, with scoring. Both 007 (intro paths) and 008 (relationship-strength W3/K6) need exactly this.

F-K-1 computeOverlapScores — the entry point Reusable algorithm

Takes two profile populations (profiles × target_profiles) and returns overlap scores per pair. Self-exclusion via LinkedIn ID; duplicate-profile guard via candidate_tags.

Source
fc backend/query_svc/engine/match_engine.ts:626
Implication
Port the loop shape into WorkOverlapCalculator + CoemploymentEdgeMaintainer. The two-loop structure (build maps, then probe) is portable to Postgres SQL: index contact_work_experiences on (organization_id) for the inner probe.
F-K-2 Work overlap scoring formula Reusable formula

Scoring: 1 (company match) + (1 + overlap_fraction) (timeline) + dept/title fraction. Match requires comp_lnkd_id OR comp_id OR comp_name equality (three-way fallback).

Source
fc backend/query_svc/engine/match_engine.ts:759 (computeExpOverlapInfo)
Implication
Adopt this formula for 007's strength tuple (FR-006a). Same shape carries to SharedListNetworkSummary.strength_score.
F-K-3 Education overlap scoring formula Reusable formula

Mirror of the work formula: 1 (school match) + (1 + overlap_fraction) (timeline) + degree fraction + major fraction. School matched by inst_name equality (no canonical school ID — known fragility).

Source
fc backend/query_svc/engine/match_engine.ts:820 (computeEduOverlapInfo)
Implication
Adopt for 007 v2 + 008 EducationOverlapCache. Plan: link contact_educations to a canonical schools table to avoid Findem's name-equality fragility.
F-K-4 Self-exclusion + duplicate guard Reusable

Lines 657–664: refuses to compute overlap between a profile and itself (via LinkedIn ID); skips overlap for known-duplicate profiles tagged possibleDuplicate.

Source
fc backend/query_svc/engine/match_engine.ts:657-674
Implication
007 needs the same: a contact shouldn't be its own intro path. Implement at edge-write time in CoemploymentEdgeMaintainer.

2. Connection precompute (batch path)

Findem's async batch system that precomputes overlap results per saved-search macro. The architecture has both useful patterns and known scale failures. We borrow the patterns and avoid the failures.

F-C-1 Master/agent split with mutex-serialized dispatch Pattern only

Master polls every 5 min, holds a mutex, hands work units to agents. Per-task state in Mongo connection_task. Threshold job recovers stuck tasks.

Sources
fc connection_svc_master.ts:189 (mutex + dispatch)
fc connection_svc_master.ts:224 (resync + create tasks)
fc connection_svc_mongomodels.ts:1 (task ledger)
Implication
The master/agent + task-ledger pattern maps cleanly onto Sidekiq + sidekiq-unique-jobs. Pattern reusable; no need to copy the polling loop.
F-C-2 15k profile cap with silent truncation Anti-pattern at our scale

MAX_IMPORT_PROFILES = 15000. Network side hard-truncated to first 15k results from /profile-matches; the rest silently dropped. Logged but not enforced. No paging, no master-level splitting.

Sources
fc connection_svc_agent.ts:149 (the constant)
fc connection_svc_agent.ts:605 (loadProfiles — single non-paged request)
fc connection_svc_agent.ts:643 (the truncation log line)
Implication
Getro cannot adopt this cap — SC-003 requires complete coverage on customer collections. Use Postgres-driven precompute (no in-memory map) so storage is the only bound.
F-C-3 Two-population asymmetry: small index in memory + large stream Pattern reusable

Network side (B) is loaded once into work_info_map + education_info_map keyed on company/school. ICP side (A) streams in batches via fetchProfiles(batch_info), probing the maps. Different storage decisions per side.

Sources
fc connection_svc_agent.ts:505 (runTask batch loop)
fc connection_svc_agent.ts:644-705 (build maps from network)
fc connection_svc_agent.ts:1141 (storeOverlappingProfiles — write IDs only)
Implication
Same shape applies in Getro: CollectionOrgCurrentSharedContact is the small in-memory-equivalent index (Postgres lookup table); the large side (network contacts) streams via Sidekiq batches.
F-C-4 Compact result format — IDs only Reusable

Output is newline-delimited PRIDs to a fileserver. No hydrated profiles in the result; consumers re-fetch on demand. Avoids storing huge payloads.

Source
fc connection_svc_agent.ts:1141-1169
Implication
Mirror in Getro: ContactCoemploymentEdge stores contact IDs and bridge IDs only. UI fetches contact details on drill-in.

3. Live connection path (per-profile)

The user-facing endpoint Findem uses to populate the "Connections" panel on a candidate-detail page in app-next. Combines stored social edges + reverse-upload lookup + computed overlap. Useful as a model for 007's drill-in.

F-L-1 handleLoadConnections — the runtime merge Pattern reusable

POST /hm/api/profile with type: 'LoadConnections'. Merges (a) explicit candidate.connections[] + reverse upload lookup, (b) evalOptimalReach overlap results, (c) optional LinkedIn nests. Sorted by score; "Social" tier hardcoded at score 2000 to top-rank explicit edges.

Backend
fc profile_api_handler.ts:1251 (dispatch)
fc profile_api_handler.ts:1876 (handler entry)
fc profile_api_handler.ts:2000 (Java path)
fc profile_api_handler.ts:2032 (TS path fallback)
fc candidate_upload_manager.ts:777 (reverse upload lookup)
Frontend
an src/components/MegaEnrichedProfile/Panels/Connections.tsx:305 (useGetConnectionsQuery)
an src/services/matches.ts:474 (RTK Query definition)
an src/components/MegaEnrichedProfile/AboutTab/Connection.tsx (row renderer)
Implication
007 drill-in mirrors this shape: read precomputed edges + per-pair caches, merge at request time, return ranked list. Score 2000 trick — adopt for direct connections so they always rank above intro paths.
F-L-2 Connection macro is the network definition Diverge

Findem uses macros (saved searches) tagged categories.includes('Connection') + is_private to define "your network." Auto-picks "Connections - Employee Connection" or "linkedin connections" as defaults. Fully customer-configurable.

Source
fc profile_api_handler.ts:1962-1996
Implication
Getro does not need a macro engine. The network is defined statically by UserContactCollection with source = 'shared', scoped per collection_id. Simpler model; ship without rebuilding macros.

4. Search endpoint — does NOT include connections

The workhorse search API (POST /pub/api/sandbox/matches). Worth understanding because 007 reviewers will ask "can we just use this?" — the answer is no, and this section documents why.

F-S-1 /pub/api/sandbox/matches — the search workhorse Not for connections

Takes inline ICP requirements + ~30 filter knobs. Returns matched profiles with hydration, logo population, CRM source decoration. Does not call any connection function. No evalOptimalReach, no computeOverlapScores. Connections are computed downstream when the user drills in.

Source
fc pub_svc_matches.ts:105 (route)
fc pub_svc_matches.ts:715 (handleSandboxMatches)
fc sandbox_matches.ts:57 (fetchSandboxMatchResultsInternal)
Implication
007 cannot piggy-back on this endpoint. Connection paths require a separate computation (precomputed edges + summary rollup, per spec-technical §6).
F-S-2 CRM-context decoration pattern Pattern reusable

After search results are returned, a context-aware decoration pass annotates each profile with CRM-specific fields (SandboxProfileUtils.populateCrmProfileSources). Lean search + late decoration.

Source
fc pub_svc_matches.ts:859-862
Implication
Apply same idea to 007: the org-list endpoint returns bare summaries; a decoration pass joins SharedListNetworkSummary for the connection counts. Decouples search from connection enrichment.

5. Profile data model — what's stored vs what's computed

Findem's ICandidateProfile has connection-related fields that look richer than they are. This section documents the actual data so we don't over-rely on what Findem nominally has.

F-D-1 candidate.connections[] — sparse, per-uploader, frozen Lower coverage than expected

Field exists on every profile but populated only when a user manually uploads their LinkedIn export via LiConnectionUploader. Coverage likely <10% of profiles. Tagged per-uploader ('linkedin connection <upload_id>'); never refreshed; not exposed via pub_api.

Sources
fc datamodels.ts:269 (ICandidateProfile.connections — optional field)
fc datamodels.ts:280-289 (IConnectionInfo schema)
fc candidate_upload_ingestor.ts:1853 (the only writer)
Implication
Don't rely on candidate.connections[] as a primary signal. Confirm production coverage via the Mongo aggregate query before any decision; if <20%, deprioritize entirely.
F-D-2 OverlapSummaryInfo.score — declared but never populated Misleading field

Schema has a score field on the summary type, but it's not written anywhere in the codebase. The actual score is computed at query time inside match_engine, not stored on the profile.

Source
fc datamodels.ts:3146-3157
Implication
Findem does not have stored relationship-strength scores. 008 cannot fetch this from Findem; we compute it ourselves. (See spec-technical §5 rule engine.)
F-D-3 ConnectionType enum — broader than work/edu Future expansion hint

Enum includes Education = 0, WorkExperience = 1, Publication = 2, Patent = 3. Findem already models four bridge types even if the kernel only uses two.

Source
fc connection_datamodels.ts
Implication
Getro's edge-type list (coemployment, coeducation, coinvestment, board-peer) doesn't need to map 1:1 with Findem's. Our list comes from Getro product needs; Findem's enum is reference, not constraint.

6. Findem capability gaps (F1–F9)

Capabilities Getro needs from Findem to fill out 008/007 v2/v3. Status reflects current understanding; some answers may have changed since last sync with Findem.

ID Capability Used by Status
F1 Person lookup by email — canonical identity resolution 008 dedup oracle (DR-02) Pending Findem
F3 Enrich-by-email-only (no LinkedIn handle required) 008 thin V1 fallback Pending Findem
F4 User enrichment — same APIs as contact enrichment, applied to team members 008 W3/K6 work overlap; 007 v2 path generation Pending Findem
F6 Investor / cap-table data on companies and people 007 v3 investor edges; 008 investor-overlap signal Pending Findem
F7 Profile-update webhooks for cache invalidation 008 EnrichedEmailLookup invalidation (DR-03) Pending Findem
F8 Expose candidate.connections[] via pub_api with ext_src filter Was: potential Cold/Known signal source Downgrade — coverage too sparse (see F-D-1)
F9 Expose match_engine overlap scoring as a service for arbitrary PRID pairs Could replace Getro reimplementing the kernel Open question

Reframe of F8

Original F8 asked Findem to expose LinkedIn-source connections[] via pub_api. Investigation (see F-D-1) showed coverage is <10% of profiles and per-uploader scoped, not global. Reframe to: "Confirm production coverage of candidate.connections[]; if Findem starts ambient enrichment with global LinkedIn data, revisit."

7. Implications summary

Bringing it all together — what Findem gives us, what we build ourselves, and where the lines are.

Area What Findem provides What Getro builds
Overlap algorithm The kernel formula (F-K-1, F-K-2, F-K-3) Port into WorkOverlapCalculator + CoemploymentEdgeMaintainer
Network definition Macro-based (F-L-2) UCC-based (simpler, no macro engine)
Connection batch precompute connection_svc with 15k cap (F-C-1, F-C-2) Sidekiq workers with no cap; per-cache reconciliation
Connection drill-in (live) LoadConnections runtime merge (F-L-1) Mirror the merge shape; read precomputed Postgres edges
Search endpoint sandbox/matches — does NOT include connections (F-S-1) Separate per-list endpoint; connection counts via summary rollup
Profile data Sparse candidate.connections[] (F-D-1); no stored strength score (F-D-2) Use Findem profile data for enrichment (work/edu history); do not depend on connections[]
Investor + board data Pending F6 007 v3 + 008 Phase 10 — gated on F6 answer

One-line takeaway

Findem provides the algorithm; Getro provides the storage and scale. Reuse the overlap-scoring formulas verbatim. Reuse architectural patterns (master/agent split, two-population asymmetry, late decoration). Do not depend on Findem's storage shapes (in-memory maps, 15k caps, sparse upload tables) — those don't fit Getro's list-view trigger or coverage requirements.