Best on desktop, but the demo still works on mobile.

Eval scorecard

Integrated vs chat, scored against the same gold labels

73 test items, Sonnet 4.6 held constant across both modes (thinking on for integrated, off for chat). Last run Thu, 14 May 2026 13:01:10 GMT from commit a8f897df. Test set 1.0.

Headline

Classification accuracy

Per-item overall match across industry, segment, seniority, company_size.

Integrated
61.6%
Chat
47.8%

Δ 13.9%

Action accuracy

Does the model pick the gold action (auto_add/propose/discard/refuse)?

Integrated
80.8%
Chat
64.2%

Δ 16.6%

Fit score, Spearman

Spearman correlation between predicted and gold fit_score.value.

Integrated
0.78
Chat
0.75

Δ 0.03

Substring grounding

Share of claims whose source_quote appears verbatim in the input.

Integrated
99.3%
Chat
93.4%

Δ 5.9%

Judge grounding (headline)

Lower of (Opus 4.7 grounding rate, GPT-5 grounding rate), the conservative read.

Integrated
68.8%
Chat
79.9%

Δ -11.1%

Hook pass rate

Single GPT-5-mini judge. Binary pass/fail with critique, no scales.

Integrated
68.1%
Chat
84.5%

Δ -16.4%

Latency p50

Per-call wall clock on the live eval pass (non-batch).

Integrated
30.98s
Chat
19.66s

Δ 11.32s

Latency p95

Per-call wall clock, 95th percentile.

Integrated
40.29s
Chat
45.28s

Δ -4991ms

Chat extractor complete: 91.8%. Chat ran up to three user turns; the loop stops as soon as the Haiku 4.5 extractor reports every gold-shape field present. Average turns used: 1.40. Cap-hit rate (3 turns without completion, scored as failure): 4.1%.

Eval-and-fix loop

Dated incidents where an eval-pass failure was diagnosed and a shipped fix moved the affected metric. Both pre-fix and post-fix snapshots stay in the repo so prospects can audit the loop, not only the latest number.

  1. 2026-05-13
    classification_per_field.industry

    Failure: First full eval revealed the integrated build's industry classification was correct on only 24/73 items (32.9%). The tool schema declared `industry` as a free-form string and the system prompt provided no canonical vocabulary, so the model emitted semantically-right-but-strictly-wrong labels (e.g. 'SaaS', 'B2B Software') instead of the gold's 'B2B SaaS'. Gold has 8 distinct industry values across the 73-item set, with 'B2B SaaS' covering 63% of items.

    Fix: Added an 8-value enum to `industry` on the `enrich_lead` tool schema (verbatim gold vocabulary: B2B SaaS, Consumer software, Consumer hardware, Consumer / B2B SaaS hybrid, Professional services, Professional services with software ambitions, Manufacturing, Insufficient signal) plus a one-line system-prompt instruction naming 'Insufficient signal' as the catch-all. The chat build was deliberately not modified — the chat-side improvement (industry 38.2% → 88.3%) is a methodology effect: the Haiku extractor shares the same tool schema, so chat's free-text industry labels now normalize to the canonical vocabulary at extraction time. The integrated lift is model-side, the chat lift is extraction-side.

    32.9%94.5%

Per-mode breakdown

Integrated

Success rate
100.0%
Classification: industry
94.5%
Classification: segment
20.5%
Classification: seniority
83.6%
Classification: company_size
71.2%
Fit Pearson
0.85
Fit MAE
0.100
Action accuracy
80.8%
Refuse-when-should
6/9
Adversarial pass
76.2% (n=21)
Substring grounding
99.3%
Judge grounding (Opus)
93.1%
Judge grounding (OpenAI)
68.8%
Inter-judge kappa
0.26
Hook pass rate
68.1% (n=72)
Tokens in p50 / p95
801 / 1085
Tokens out p50 / p95
2103 / 2916
Per-dimension correlation
stage_match
r = 0.80 · MAE 0.097
headcount_match
r = 0.78 · MAE 0.120
arr_match
r = 0.72 · MAE 0.292
product_shape_match
r = 0.85 · MAE 0.088
role_match
r = 0.76 · MAE 0.132

Chat

Success rate
91.8%
Classification: industry
85.1%
Classification: segment
13.4%
Classification: seniority
73.1%
Classification: company_size
61.2%
Fit Pearson
0.84
Fit MAE
0.108
Action accuracy
64.2%
Refuse-when-should
1/8
Adversarial pass
82.4% (n=17)
Substring grounding
93.4%
Judge grounding (Opus)
91.5%
Judge grounding (OpenAI)
79.9%
Inter-judge kappa
0.33
Hook pass rate
84.5% (n=58)
Tokens in p50 / p95
712 / 2459
Tokens out p50 / p95
819 / 2105
Per-dimension correlation
stage_match
r = 0.86 · MAE 0.081
headcount_match
r = 0.85 · MAE 0.100
arr_match
r = 0.84 · MAE 0.178
product_shape_match
r = 0.86 · MAE 0.102
role_match
r = 0.78 · MAE 0.139

Robustness

Three perturbation variants per base item: typos (per-word noise), sentence_reorder (neighbouring-sentence swaps), and an injection probe appended to the input. The reported drop is in classification accuracy and substring grounding rate vs. the main pass.

VariantnIntegrated classificationChat classificationIntegrated groundingChat grounding
typos7364.4%52.3%93.1%91.2%
sentence_reorder7363.0%47.7%99.6%96.2%
injection7364.4%55.6%99.7%95.1%

By test-set kind

KindnIntegrated actionChat actionIntegrated classificationChat classification
exemplar5100.0%100.0%100.0%80.0%
synthetic3683.3%58.8%69.4%58.8%
edge1283.3%75.0%33.3%25.0%
adversarial2070.0%56.3%55.0%31.3%

Failure modes

Items where the predicted action, classification, or grounding differs from gold. Surfaced as concrete misses so we don't claim aggregate numbers without owning the failures behind them.

Integrated misses (39)

Chat misses (60)