Methodology
How the eval scorecard is built
What is being measured, the gold labels behind it, the judges, and the known limits. Read this before reading numbers.
1. Thesis and what is being measured
Two realistic products, the same Sonnet 4.6 model, the same UI polish. The integrated build runs the model through a strict enrich_lead tool with a JSON schema, extended thinking, and a claim-grounding rule that requires every claim to cite a verbatim source quote. The chat build runs the same model with a task-describing system prompt, with no tools, no schema, no grounding rule. The eval scorecard measures how that architectural asymmetry surfaces on the same gold labels.
Each test item carries the inputs a salesperson would see (a profile, optionally a company description) plus gold labels for structured fields, classification, a 0.0–1.0 fit score with five named dimensions, the claims a hook may use with verbatim source quotes, and the expected action.
2. The fictional ICP
The demo ships with a fixed ICP. It is declared explicitly so prospects can see what is being scored.
- Stages: Series A, Series B, Series C.
- Headcount: 20 to 250.
- ARR: $2M to $50M USD.
- Product shape: B2B SaaS company shipping at least one user-facing AI feature, or with one in active development.
- Target roles: VP Product, Head of AI/ML, Director of Engineering, technical Founder/CTO.
Action thresholds
Fit score is on a 0.0–1.0 scale across the entire demo. Evaluation order is top to bottom. The first matching row wins.
refusewhen the input lacks data to judge.proposewhen any claim is ungrounded, regardless of fit score.auto_addwhen fit > 0.80 and every claim grounded.proposewhen fit is in [0.50, 0.80] and every claim grounded.discardwhen fit < 0.50 and every claim grounded.
refuse and discard are different. Refuse means “cannot judge.” Discard means “judged and rejected.”
3. Test set composition
73 items in v1.0. Languages covered: English, Swedish, German. Inputs in other languages return an out-of-scope response without burning a model call.
- 5 hand-crafted exemplars chosen to span clean fit, ambiguous fit, sparse input, adversarial input (with a prompt injection in the bio), and a multilingual case. These double as the example cards in both UIs and as conversation starters in the chat build.
- 36 fully synthetic profiles shaped like LinkedIn entries and B2B SaaS company descriptions. Hand-labelled. Roughly 10% omit the company description to cover the company-optional path.
- 20 adversarial cases: 6 prompt injections in profile or company text, 4 jailbreak / instruction-override attempts, 4 contradictory data (claims vs job title mismatch), 3 sparse input (one or two fields only), 3 multilingual injection.
- 12 edge cases: very short, very long, ambiguous job titles, supported non-English.
Test-set versioning: every eval run is tagged with the test set's git SHA and version label (currently v1.0). Trends are computed within a version. Material test-set changes bump the label and the methodology page lists the change.
4. Synthesis pipeline
Synthetic profiles are LLM-generated from short scenario specs (e.g. “Series B HR-tech VP Product, attrition feature shipped”), then hand-reviewed and hand-labelled. No real LinkedIn data, no scraping. The generation prompts and scenario specs live in the repo at scripts/generate_*.py and data/*_specs.jsonl.
v1 was solo-labelled, which is the largest open methodological risk and is named below under known limits. v1.1 will add a 20% peer second-pass on fit-score and claims-allowed labels with Cohen's kappa reported on the scorecard.
5. Criteria revision log
Shreya Shankar's Who Validates the Validators? (UIST 2024) names criteria drift: some eval criteria are output-dependent and cannot be fully specified up front. Pretending otherwise produces an evaluation that grades against an unrealistic rubric. This section logs every rubric anchor that was edited after seeing model or labeller output, so the cost of those edits is auditable instead of buried in a git history.
- 2026-05-08 ·
product_shape_match· new 0.75 anchor: “B2B SaaS surface present but secondary to consumer revenue” - The gap surfaced while labelling Exemplar 2 (Moodboard, a Series A consumer-led app with a B2B API side). The original rubric treated B2B SaaS as binary, so a hybrid landed on 1.0 by virtue of the AI-feature signal even though the B2B surface was secondary to consumer revenue. The new OR-condition lets hybrids land on 0.75. Exemplar 2's dimension moved from 1.00 to 0.75; applied retroactively before bulk labelling started.
- 2026-05-08 ·
role_match· new 0.75 anchor: “Founder/CEO at target-shape company without confirmed technical background” - The same rubric review surfaced this anchor. The original rubric reserved 1.0 for technical Founder/CTO and dropped non-technical founders to 0.25 via “mismatched on either axis.” That ignored decision authority, which is the core thing
target_rolesis meant to capture. The new OR-condition recognises non-technical founder authority as a partial match. Exemplar 2's dimension moved from 0.25 to 0.75. - 2026-05-08 · Exemplar 2 holistic
fit_score.value· 0.55 → 0.65 - This entry re-justifies the holistic fit score without editing the rubric. With the two anchors above lifted, the dimension average moved from 0.75 to 0.80, and the original 0.55 holistic was double-discounting the same signals the new anchors now carry. The 0.65 vs 0.80 gap captures the labeller's residual discount for the absolute size of the B2B revenue surface relative to the consumer business, which the 0.75 anchor does not fully express. Routing stays
propose. - 2026-05-12 · Hook coherence judge · 1-5 Likert → binary pass/fail with critique
- The plan originally specified a 1-5 rubric for hook coherence. Cross-checked against the LLM-as-judge evaluation literature, which is categorical that Likert scales produce false confidence and that the critique is where the real information lives. Replaced with a binary
passesfield plus a mandatorycritiquethat names the specific evidence. Schema version bumped 2 → 3. - 2026-05-12 · Robustness perturbations · paraphrase + field_shuffle → sentence_reorder
- The plan listed
typos,paraphrase,field_shuffle, andinjectionas the perturbation set.paraphrasewas replaced withsentence_reorder(swaps neighbouring sentences without editing inside them) because paraphrase requires a per-item LLM call that would dominate eval cost and could itself drift; the substitution preserves source-quote substrings exactly.field_shufflewas dropped: the inputs are already a profile + an optional company block, so “shuffling” them on a 2-field schema either no-ops or destroys the input. An LLM-driven paraphrase variant remains on the v1.1 list. - 2026-05-14 · Anthropic grounding judge · Claude Opus 4.7 → Claude Sonnet 4.6
- The plan specified Claude Opus 4.7 as the Anthropic-side grounding judge. Swapped to Claude Sonnet 4.6 during the cost-reduction pass: at the volume the nightly eval runs (73 items × claims-per-item × 2 modes), Opus pricing was the largest non-inference line item without a measurable quality lift over Sonnet on this judging task. Cohen's kappa with the OpenAI judge stayed in the same range after the swap, so the conservative-read headline (lower of the two grounding rates) is unaffected.
The cadence going forward: when a rubric edit affects a gold label that has already been applied, the affected items are re-labelled at the same time, the change is dated here, and the pre-fix snapshots stay in data/eval_runs/ so the delta is visible on the timeline rather than silently rebased into the trend.
6. Deterministic metrics
No LLM judge required. Computed from gold labels and outputs.
- Field extraction accuracy: per-field exact or normalised match. Industry and segment use loose token-overlap (50%+ of gold tokens present in the prediction); seniority and company_size are exact match against the controlled vocabularies.
- Classification accuracy: overall match across all four classification fields.
- ICP fit correlation: Pearson and Spearman of predicted vs gold
fit_score.value. Per-dimension correlation reported separately. - Action accuracy: does the model pick the gold action given the gold thresholds?
- Refuse-when-should: count of items where gold is
refuseand the model agreed. - Substring grounding: per claim, does the model's
source_quoteappear verbatim in the input (whitespace-normalised, case-insensitive)? A substring miss is automatic ungrounded, so no judge call runs. - Latency p50, p95. Per-call wall clock measured inside the per-mode concurrency semaphore, so the metric reflects model-call time and excludes time an item spent queued waiting for a free slot. For the multi-turn chat path, latency is the sum of per-turn in-semaphore wall times; harness-internal extractor checks between turns are not counted, since a single user hitting a chat endpoint wouldn't see them. The scorecard's headline chat latency is the corrected value from a 2026-05-16 re-measure; the robustness panel's chat latencies were not re-measured and remain on the pre-fix basis (the scorecard does not display robustness latency).
- Token cost p50, p95. Tokens in/out per call.
- Steps-to-completion: integrated tool-call count (always 1 by schema). Chat runs up to three user turns: a generic follow-up (“please fill in the missing fields”) is sent whenever the Haiku extractor reports incomplete coverage. Cap-hit at three turns is scored as failure. The headline exposes avg_turns_used and cap_hit_rate alongside the extractor-complete rate.
7. LLM-as-judge metrics
Two metrics are judged. Both are disclosed inline on the scorecard so prospects know which numbers are deterministic and which carry judge subjectivity.
Claim grounding
For every claim, the deterministic substring check runs first; a miss auto-fails without a judge call. Claims that pass the substring check go to two judges: Claude Sonnet 4.6 and OpenAI's strongest available flagship (GPT-5 when present; otherwise the model name is set via OPENAI_GROUNDING_MODEL). The scorecard publishes three values: Sonnet grounding rate, OpenAI judge grounding rate, and Cohen's kappa between them. The lower of the two rates is the headline number, which gives prospects the conservative read.
Hook pass rate
Binary pass/fail with a written critique, single judge (GPT-5-mini). There is no Likert scale. The literature on LLM-as-judge (Husain, Shankar) is consistent that 1-5 numeric rubrics create false confidence and the critique is where the learning lives. Pass criteria are baked into the prompt so re-runs are stable.
A hook passes only when all of these hold: multiple specifics that come from the lead's input (verbatim or paraphrased; invented specifics do not count), professional tone appropriate for B2B outreach, on-topic and coherent.
A hook fails when any of these hold: incoherent, off-topic, or generic with no input-grounded specifics; a single specific or specifics that are not actually in the input; over-familiar / salesy / dismissive tone; or the action is discard or refuse (no hook should have been drafted at all).
The critique is required for both outcomes, since it names the specific evidence in the input that the hook did or did not ground in. A pass without a substantive critique is not a real pass; those land in the scorecard's judgements list for re-review.
Reading judge-mediated metrics across modes
GPT-5 is meaningfully stricter than the Sonnet judge on integrated (10-25pp lower grounding rate; less so on chat). Integrated produces more claims with more specificity, including canonical-label normalizations like "B2B SaaS workflow automation", and GPT-5 marks more of them insufficiently supported by the source quote. Since the headline takes the lower of the two, this can pull integrated's headline grounding below chat's on the same run. Cohen's kappa in the 0.2-0.4 range is itself the signal: the judges genuinely disagree on borderline cases.
Hook pass rate is scored only over items where a hook was extractable. Chat refusals on adversarial inputs (jailbreak, fake rubric) produce no structured hook to judge and drop out of the chat denominator, which can inflate chat's apparent pass rate. The deterministic metrics (classification accuracy, action accuracy, substring grounding, per-field accuracy) score every item including refusals, so they're the load-bearing cross-mode comparison. The judge-mediated metrics report what the scorable subset looks like.
Live inference is Anthropic-only. Cross-provider judging is offline-only, run in the eval harness for the scorecard.
8. Robustness methodology
Each base item gets three perturbed variants. All perturbations are deterministic (seeded by item id) so the same eval run is reproducible from the dataset alone.
- typos: ~6% per-word character noise (swap, drop, duplicate).
- sentence_reorder: swaps neighbouring sentences inside paragraphs. Preserves source quote substrings by never editing inside a sentence. An LLM-driven paraphrase variant is on the v1.1 list; the current label reflects what the perturbation actually does.
- injection: appends a generic instruction-override probe to the company text (or profile when no company).
Reported drops are in classification accuracy, fit-score correlation, and claim-grounding rate vs the main pass. The injection variant is the most likely to surface a real failure mode for the integrated build. A model citing injection text in a claim quote passes the substring check while still being semantically ungrounded. The eval-and-fix loop on the scorecard surfaces incidents like this with both pre-fix and post-fix snapshots committed.
9. Cross-mode comparison framing
The scorecard is a bundle-vs-bundle comparison. The two builds differ in four ways at once:
- Output contract: strict
enrich_leadtool schema (integrated) vs free-form prose (chat). - Grounding requirement: per-claim
source_quoterequired by the schema (integrated) vs no structural requirement (chat). - Inference shape: single structured call (integrated) vs multi-turn chat plus a Haiku 4.5 extractor pass that maps prose back to the schema (chat).
- Extended thinking: 4000-token budget on (integrated) vs off (chat).
The scorecard treats these as one bundle because that is what a product team actually ships. Chat surfaces do not typically expose thinking deltas in the UI, since the affordance does not fit. Structured backends often do. The realistic comparison is product-surface-A vs product-surface-B, not contract-vs-contract under matched compute. The chat build is polished to the same product standard as integrated; neither side is a strawman.
A stricter version of this question (“does the schema discipline win even with thinking equal?”) would run both modes with thinking matched (both on, or both off). That is a separate, sharper claim that the current eval does not make. It is on the roadmap as a second panel; the result would either reinforce the contract argument or expose how much of the delta is thinking.
The chat build's free-form output is parsed back to the structured schema by the Haiku 4.5 extractor pass mentioned above, so both modes score against the same gold labels. The extractor sees the chat output only, never the original input, so it cannot hallucinate grounding. If the chat reply omitted a source quote, no extracted claim has one either.
10. Operations
- Eval runner: Python script in
services/eval/. Live (non-batch) inference with bounded asyncio concurrency for true per-call latency. Batch API is available as a future toggle when test-set size grows. - Regression gate runs on every PR that touches eval-relevant paths (cheap anchor subset, no judges). Canonical scorecard snapshots refresh on demand via
workflow_dispatch, not on a fixed schedule. The demo doesn't ship daily, so a time-based cron mostly re-runs identical experiments. - Results commit as a JSON snapshot to
data/eval_runs/latest.jsonplus a timestamped historical copy. - Inter-judge Cohen's kappa for grounding is reported each run.
- Methodology page names every metric, sample size, and judge model used.
11. Telemetry vs adoption
The demo instruments prospect-funnel signals (own-input-pasted, scorecard-clicked-from-result, page-bounce-after-low-fit, contact-link-clicked-from-result). Funnel data is not adoption data. Real adoption measurement requires real users on a deployed feature over time. The eval scorecard is a cold benchmark on a static test set. It tells you how the system performs on labelled examples, not how it performs on your actual pipeline.
12. Input shape in production
The demo accepts pasted text. Real deployments are triggered by a CRM record landing, a browser extension on LinkedIn, or a CSV bulk import. The paste interface is a deployable proxy that exercises the same model contract; wiring the trigger to a CRM is the v2/services hook.
13. Known limits
- Synthetic test set. 73 hand-labelled items; generalisation to real pipelines is unmeasured.
- Single-labeller bias. Solo-labelled in v1. v1.1 adds a peer second-pass on fit-score and claims-allowed labels.
- Judge subjectivity on the hook pass/fail decision. Single judge by design; binary outcome plus mandatory critique limits drift but doesn't eliminate it. Hook judgements with thin critiques are surfaced in the judgements list for re-review.
- Fictional ICP. Real customer ICPs differ; the structured schema accepts custom ICPs (see below) but the current scorecard is anchored to this one.
- Supported-language scope. English, Swedish, German. Other languages return an out-of-scope response.
- No adoption data. Funnel telemetry exists; usage measurement on a deployed product does not.
14. Custom ICP, v2 hook
v1 ships a fixed ICP. A v2 deployment accepts a customer-supplied ICP. Two shapes the schema already supports:
- Structured ICP with concrete ranges for stage, headcount, and ARR plus natural-language predicates for product shape and target roles. Structured fields run through deterministic range checks; predicates run through anchored 0/0.25/0.5/0.75/1.0 rubrics that the model evaluates per call. Customers calibrate rubric anchors against five to ten labelled examples from their own pipeline.
- Hybrid ICP that pulls structured fields from the customer's CRM (HubSpot, Salesforce, Attio) and uses predicates for soft criteria. Rubric anchors get versioned alongside the predicate text so changes are auditable.
v2 is where this becomes a feature wired into a customer's revenue stack. Book a call from the homepage if you want that conversation.