What Findem already does, with file references
A consolidated audit of the relevant systems in the Findem codebase (firstcut + app-next), with direct links to source. Used to inform 007 and 008 architecture decisions.
Drafted 2026-04-28 · Status: review · Audience: Getro engineering team
0. About this document
github.com/findemdev/firstcut (private); app-next at bitbucket.org/rimsaw/app-next (private). All file links below assume reviewer access. External readers will get 404s — file paths are still valid for anyone who pulls the repos locally.
This document captures the reverse-engineering done on Findem's existing implementation while planning Getro's Network Intelligence System. It is not a Findem doc; it is a Getro engineering reference. Each finding is tagged with implications for our architecture.
Conventions
- All file links point to
main(firstcut) ormain(app-next). Lines are accurate as of 2026-04-28; verify with current main if drift is suspected. - Each finding has an ID like
F-K-1(Kernel finding 1) orF-S-2(Search finding 2) for cross-reference. - Reusable means we can copy the algorithm or pattern into Getro. Diverge means the implementation doesn't fit our scale or shape. Open means an unresolved question for Findem.
Cross-references
- spec-technical.html — unified architecture for 007 + 008 (the consumer of these findings)
- spec-graph-vs-postgres.html — storage substrate ADR (driven partly by these findings)
- spec-integrations.html — the email/calendar provider integration plan (different scope)
1. The overlap kernel
The piece worth copying. Findem's match_engine already implements the "did A and B share an employer or school with positive time overlap" check, with scoring. Both 007 (intro paths) and 008 (relationship-strength W3/K6) need exactly this.
Takes two profile populations (profiles × target_profiles) and returns overlap scores per pair. Self-exclusion via LinkedIn ID; duplicate-profile guard via candidate_tags.
- Source
- fc backend/query_svc/engine/match_engine.ts:626
- Implication
- Port the loop shape into
WorkOverlapCalculator+CoemploymentEdgeMaintainer. The two-loop structure (build maps, then probe) is portable to Postgres SQL: indexcontact_work_experienceson(organization_id)for the inner probe.
Scoring: 1 (company match) + (1 + overlap_fraction) (timeline) + dept/title fraction. Match requires comp_lnkd_id OR comp_id OR comp_name equality (three-way fallback).
- Source
- fc backend/query_svc/engine/match_engine.ts:759 (computeExpOverlapInfo)
- Implication
- Adopt this formula for 007's strength tuple (FR-006a). Same shape carries to
SharedListNetworkSummary.strength_score.
Mirror of the work formula: 1 (school match) + (1 + overlap_fraction) (timeline) + degree fraction + major fraction. School matched by inst_name equality (no canonical school ID — known fragility).
- Source
- fc backend/query_svc/engine/match_engine.ts:820 (computeEduOverlapInfo)
- Implication
- Adopt for 007 v2 + 008 EducationOverlapCache. Plan: link
contact_educationsto a canonicalschoolstable to avoid Findem's name-equality fragility.
Lines 657–664: refuses to compute overlap between a profile and itself (via LinkedIn ID); skips overlap for known-duplicate profiles tagged possibleDuplicate.
- Source
- fc backend/query_svc/engine/match_engine.ts:657-674
- Implication
- 007 needs the same: a contact shouldn't be its own intro path. Implement at edge-write time in
CoemploymentEdgeMaintainer.
2. Connection precompute (batch path)
Findem's async batch system that precomputes overlap results per saved-search macro. The architecture has both useful patterns and known scale failures. We borrow the patterns and avoid the failures.
Master polls every 5 min, holds a mutex, hands work units to agents. Per-task state in Mongo connection_task. Threshold job recovers stuck tasks.
- Sources
-
fc connection_svc_master.ts:189 (mutex + dispatch)
fc connection_svc_master.ts:224 (resync + create tasks)
fc connection_svc_mongomodels.ts:1 (task ledger) - Implication
- The master/agent + task-ledger pattern maps cleanly onto Sidekiq + sidekiq-unique-jobs. Pattern reusable; no need to copy the polling loop.
MAX_IMPORT_PROFILES = 15000. Network side hard-truncated to first 15k results from /profile-matches; the rest silently dropped. Logged but not enforced. No paging, no master-level splitting.
- Sources
-
fc connection_svc_agent.ts:149 (the constant)
fc connection_svc_agent.ts:605 (loadProfiles — single non-paged request)
fc connection_svc_agent.ts:643 (the truncation log line) - Implication
- Getro cannot adopt this cap — SC-003 requires complete coverage on customer collections. Use Postgres-driven precompute (no in-memory map) so storage is the only bound.
Network side (B) is loaded once into work_info_map + education_info_map keyed on company/school. ICP side (A) streams in batches via fetchProfiles(batch_info), probing the maps. Different storage decisions per side.
- Sources
-
fc connection_svc_agent.ts:505 (runTask batch loop)
fc connection_svc_agent.ts:644-705 (build maps from network)
fc connection_svc_agent.ts:1141 (storeOverlappingProfiles — write IDs only) - Implication
- Same shape applies in Getro:
CollectionOrgCurrentSharedContactis the small in-memory-equivalent index (Postgres lookup table); the large side (network contacts) streams via Sidekiq batches.
Output is newline-delimited PRIDs to a fileserver. No hydrated profiles in the result; consumers re-fetch on demand. Avoids storing huge payloads.
- Source
- fc connection_svc_agent.ts:1141-1169
- Implication
- Mirror in Getro:
ContactCoemploymentEdgestores contact IDs and bridge IDs only. UI fetches contact details on drill-in.
3. Live connection path (per-profile)
The user-facing endpoint Findem uses to populate the "Connections" panel on a candidate-detail page in app-next. Combines stored social edges + reverse-upload lookup + computed overlap. Useful as a model for 007's drill-in.
POST /hm/api/profile with type: 'LoadConnections'. Merges (a) explicit candidate.connections[] + reverse upload lookup, (b) evalOptimalReach overlap results, (c) optional LinkedIn nests. Sorted by score; "Social" tier hardcoded at score 2000 to top-rank explicit edges.
- Backend
-
fc profile_api_handler.ts:1251 (dispatch)
fc profile_api_handler.ts:1876 (handler entry)
fc profile_api_handler.ts:2000 (Java path)
fc profile_api_handler.ts:2032 (TS path fallback)
fc candidate_upload_manager.ts:777 (reverse upload lookup) - Frontend
-
an src/components/MegaEnrichedProfile/Panels/Connections.tsx:305 (useGetConnectionsQuery)
an src/services/matches.ts:474 (RTK Query definition)
an src/components/MegaEnrichedProfile/AboutTab/Connection.tsx (row renderer) - Implication
- 007 drill-in mirrors this shape: read precomputed edges + per-pair caches, merge at request time, return ranked list. Score 2000 trick — adopt for direct connections so they always rank above intro paths.
Findem uses macros (saved searches) tagged categories.includes('Connection') + is_private to define "your network." Auto-picks "Connections - Employee Connection" or "linkedin connections" as defaults. Fully customer-configurable.
- Source
- fc profile_api_handler.ts:1962-1996
- Implication
- Getro does not need a macro engine. The network is defined statically by
UserContactCollectionwithsource = 'shared', scoped percollection_id. Simpler model; ship without rebuilding macros.
4. Search endpoint — does NOT include connections
The workhorse search API (POST /pub/api/sandbox/matches). Worth understanding because 007 reviewers will ask "can we just use this?" — the answer is no, and this section documents why.
Takes inline ICP requirements + ~30 filter knobs. Returns matched profiles with hydration, logo population, CRM source decoration. Does not call any connection function. No evalOptimalReach, no computeOverlapScores. Connections are computed downstream when the user drills in.
- Source
-
fc pub_svc_matches.ts:105 (route)
fc pub_svc_matches.ts:715 (handleSandboxMatches)
fc sandbox_matches.ts:57 (fetchSandboxMatchResultsInternal) - Implication
- 007 cannot piggy-back on this endpoint. Connection paths require a separate computation (precomputed edges + summary rollup, per spec-technical §6).
After search results are returned, a context-aware decoration pass annotates each profile with CRM-specific fields (SandboxProfileUtils.populateCrmProfileSources). Lean search + late decoration.
- Source
- fc pub_svc_matches.ts:859-862
- Implication
- Apply same idea to 007: the org-list endpoint returns bare summaries; a decoration pass joins
SharedListNetworkSummaryfor the connection counts. Decouples search from connection enrichment.
5. Profile data model — what's stored vs what's computed
Findem's ICandidateProfile has connection-related fields that look richer than they are. This section documents the actual data so we don't over-rely on what Findem nominally has.
Field exists on every profile but populated only when a user manually uploads their LinkedIn export via LiConnectionUploader. Coverage likely <10% of profiles. Tagged per-uploader ('linkedin connection <upload_id>'); never refreshed; not exposed via pub_api.
- Sources
-
fc datamodels.ts:269 (ICandidateProfile.connections — optional field)
fc datamodels.ts:280-289 (IConnectionInfo schema)
fc candidate_upload_ingestor.ts:1853 (the only writer) - Implication
- Don't rely on
candidate.connections[]as a primary signal. Confirm production coverage via the Mongo aggregate query before any decision; if <20%, deprioritize entirely.
Schema has a score field on the summary type, but it's not written anywhere in the codebase. The actual score is computed at query time inside match_engine, not stored on the profile.
- Source
- fc datamodels.ts:3146-3157
- Implication
- Findem does not have stored relationship-strength scores. 008 cannot fetch this from Findem; we compute it ourselves. (See spec-technical §5 rule engine.)
Enum includes Education = 0, WorkExperience = 1, Publication = 2, Patent = 3. Findem already models four bridge types even if the kernel only uses two.
- Source
- fc connection_datamodels.ts
- Implication
- Getro's edge-type list (coemployment, coeducation, coinvestment, board-peer) doesn't need to map 1:1 with Findem's. Our list comes from Getro product needs; Findem's enum is reference, not constraint.
6. Findem capability gaps (F1–F9)
Capabilities Getro needs from Findem to fill out 008/007 v2/v3. Status reflects current understanding; some answers may have changed since last sync with Findem.
| ID | Capability | Used by | Status |
|---|---|---|---|
| F1 | Person lookup by email — canonical identity resolution | 008 dedup oracle (DR-02) | Pending Findem |
| F3 | Enrich-by-email-only (no LinkedIn handle required) | 008 thin V1 fallback | Pending Findem |
| F4 | User enrichment — same APIs as contact enrichment, applied to team members | 008 W3/K6 work overlap; 007 v2 path generation | Pending Findem |
| F6 | Investor / cap-table data on companies and people | 007 v3 investor edges; 008 investor-overlap signal | Pending Findem |
| F7 | Profile-update webhooks for cache invalidation | 008 EnrichedEmailLookup invalidation (DR-03) | Pending Findem |
| F8 | Expose candidate.connections[] via pub_api with ext_src filter |
Was: potential Cold/Known signal source | Downgrade — coverage too sparse (see F-D-1) |
| F9 | Expose match_engine overlap scoring as a service for arbitrary PRID pairs |
Could replace Getro reimplementing the kernel | Open question |
Reframe of F8
Original F8 asked Findem to expose LinkedIn-source connections[] via pub_api. Investigation (see F-D-1) showed coverage is <10% of profiles and per-uploader scoped, not global. Reframe to: "Confirm production coverage of candidate.connections[]; if Findem starts ambient enrichment with global LinkedIn data, revisit."
7. Implications summary
Bringing it all together — what Findem gives us, what we build ourselves, and where the lines are.
| Area | What Findem provides | What Getro builds |
|---|---|---|
| Overlap algorithm | The kernel formula (F-K-1, F-K-2, F-K-3) | Port into WorkOverlapCalculator + CoemploymentEdgeMaintainer |
| Network definition | Macro-based (F-L-2) | UCC-based (simpler, no macro engine) |
| Connection batch precompute | connection_svc with 15k cap (F-C-1, F-C-2) | Sidekiq workers with no cap; per-cache reconciliation |
| Connection drill-in (live) | LoadConnections runtime merge (F-L-1) | Mirror the merge shape; read precomputed Postgres edges |
| Search endpoint | sandbox/matches — does NOT include connections (F-S-1) | Separate per-list endpoint; connection counts via summary rollup |
| Profile data | Sparse candidate.connections[] (F-D-1); no stored strength score (F-D-2) |
Use Findem profile data for enrichment (work/edu history); do not depend on connections[] |
| Investor + board data | Pending F6 | 007 v3 + 008 Phase 10 — gated on F6 answer |
One-line takeaway
Findem provides the algorithm; Getro provides the storage and scale. Reuse the overlap-scoring formulas verbatim. Reuse architectural patterns (master/agent split, two-population asymmetry, late decoration). Do not depend on Findem's storage shapes (in-memory maps, 15k caps, sparse upload tables) — those don't fit Getro's list-view trigger or coverage requirements.