Item 65 · edge

Edge Very Long Strong

scenario edge_very_long_strong

Input

Theo Davenport
Head of AI at Clauselink

I came to AI through a fairly traditional NLP research path. I did my PhD at Cascadia State University on information extraction from semi-structured documents, then joined Helix Labs in Westchester for three years, where I worked on entity resolution pipelines for financial regulatory filings. The work was rigorous and I learned a lot about what production NLP actually requires — namely, that precision matters more than recall when the downstream consumer is a human who has to act on the output.

From IBM I moved to a Series B startup called Contegra, which built compliance automation tooling for insurance carriers. I led the NLP team there, growing it from two engineers to eight over two years. We built a clause-level extraction system on top of BERT-era models, and I made the call early to use ensemble methods rather than fine-tuning a single large model, mostly because our labeled data was thin and the ensembles were more auditable — something insurance clients cared about deeply. We got acquired by a larger RegTech firm in 2021, and I stayed through the integration for about eight months before deciding the pace wasn't right for me.

After Contegra I spent eighteen months at Verdigris AI as a member of technical staff focused on document understanding applications, which is where I got deep exposure to the current generation of large language models and their failure modes in structured extraction tasks. I worked closely with the evals team to build benchmarks for multi-page document comprehension, and that experience shaped how I think about measuring AI quality in domains where ground truth is genuinely ambiguous — something that comes up constantly in legal text.

I joined Clauselink about two years ago because the contract review problem is one of the hardest document AI problems I know of. Legal language is adversarially ambiguous by design. My team currently has four ML engineers, two research engineers, and a data annotation lead. We run a hybrid architecture: a retrieval-augmented pipeline handles clause identification and cross-reference resolution, while a fine-tuned LLM layer handles risk classification and deviation flagging against customer-specific playbooks. The hardest ongoing challenge is maintaining accuracy as customer playbooks diverge — we've been investing heavily in per-customer adapter layers rather than full model retraining, which has meaningfully reduced our update latency.

My current focus is on two things: getting our confidence calibration to a point where the product can surface reliable uncertainty signals to reviewers, and expanding coverage from MSAs and NDAs into more bespoke agreement types like JVs and IP licensing. Both require better training data than we have, so a non-trivial part of my time goes into annotation infrastructure and quality control.

Outside of Clauselink, I advise two early-stage startups in the document intelligence space, I contribute intermittently to the open-source RetrieveStack project, and I speak at legal tech conferences a few times a year, mostly on the gap between benchmark performance and real-world production behavior in legal AI systems.

Clauselink is a B2B SaaS company focused on AI-powered contract review for enterprise legal and procurement teams. The platform enables legal professionals to extract, classify, and flag risk across commercial agreements at scale, with support for customer-specific playbooks that reflect individual negotiation standards. Clauselink's AI layer combines retrieval-augmented extraction with fine-tuned classification models, providing reviewers with structured risk summaries and deviation alerts rather than raw model output. The company serves mid-market and enterprise customers in sectors including technology, financial services, and professional services. Clauselink has raised a Series C round and is expanding its coverage beyond standard agreement types into more complex deal structures.

Gold

Expected action: auto_add
Fit score: 0.92
Industry: B2B SaaS
Segment: Legaltech / AI contract review
Seniority: Director
Company size: —
Language: en
Notes: Very long bio but rich with structural signals. Headcount not explicitly stated in profile or company; Series C scale supports in-range inference but a labeller could ask for verification.

Predictions

Integratedmiss

action propose, expected auto_add
classification miss: segment, company_size

Action: propose ✗ (gold: auto_add)
Fit: 0.79 (gold 0.92, |Δ| 0.13)
Industry: B2B SaaS ✓
Segment: Legal AI / Contract Review SaaS ✗
Seniority: Director ✓
Company size: 51-200 ✗
Grounding: 7/7 claims grounded (100.0%)

Claims (7)

Clauselink is a B2B SaaS company with a live, user-facing AI product combining retrieval-augmented extraction and fine-tuned classification models.
✓ in input Clauselink is a B2B SaaS company focused on AI-powered contract review for enterprise legal and procurement teams. The platform enables legal professionals to extract, classify, and flag risk across commercial agreements at scale
Clauselink has raised a Series C round, placing it squarely within the ICP funding stage range.
✓ in input Clauselink has raised a Series C round and is expanding its coverage beyond standard agreement types into more complex deal structures.
Theo Davenport holds the Head of AI title at Clauselink, an exact match for the ICP target role.
✓ in input Head of AI at Clauselink
The AI architecture in production spans a hybrid RAG pipeline and a fine-tuned LLM layer for risk classification against customer-specific playbooks.
✓ in input a retrieval-augmented pipeline handles clause identification and cross-reference resolution, while a fine-tuned LLM layer handles risk classification and deviation flagging against customer-specific playbooks
Theo's current technical priority is confidence calibration — surfacing reliable uncertainty signals to human reviewers.
✓ in input getting our confidence calibration to a point where the product can surface reliable uncertainty signals to reviewers
The team is investing in per-customer adapter layers to reduce update latency without full model retraining.
✓ in input we've been investing heavily in per-customer adapter layers rather than full model retraining, which has meaningfully reduced our update latency
Theo directly manages a team of four ML engineers, two research engineers, and a data annotation lead.
✓ in input My team currently has four ML engineers, two research engineers, and a data annotation lead.

Draft hook

Hi Theo — your note on confidence calibration caught my eye: getting a production legal-AI system to 'surface reliable uncertainty signals to reviewers' is one of the harder alignment problems in the space, especially when ground truth in legal text is genuinely ambiguous. Given that you're running a hybrid RAG + fine-tuned LLM stack and managing per-customer adapter layers at Series C scale, I'd love to share how we've seen teams at a similar stage tackle calibration without sacrificing the audit-ability your insurance-era work likely still informs. Worth a 20-minute conversation?

Reasoning

**Stage** (1.0): "Clauselink has raised a Series C round" — exact ICP match. **Role** (1.0): "Head of AI at Clauselink" is listed verbatim in the ICP target roles (Head of AI / Head of ML). Perfect match. **Product shape** (1.0): Clauselink is explicitly "a B2B SaaS company" with a live hybrid RAG + fine-tuned LLM architecture delivering user-facing AI output. Textbook ICP product shape. **Headcount** (0.65): No explicit company headcount is given. The ML org alone is 7 people ("four ML engineers, two research engineers, and a data annotation lead"). For a Series C enterprise SaaS company, total headcount plausibly falls in the 51–200 ICP range, but this is an inference, not a confirmed figure — hence a discounted score. **ARR** (0.55): No ARR figure is disclosed. Series C enterprise SaaS can span a wide range; there is meaningful probability the company's ARR exceeds the $50M ICP ceiling, particularly given the enterprise customer base. Scored at the midpoint to reflect genuine uncertainty in both directions. **Holistic score (0.79)**: The three confirmable dimensions (stage, role, product shape) are all perfect 1.0 matches. The two uncertain dimensions (headcount, ARR) prevent a higher holistic score. The result sits just inside the [0.50, 0.80] propose band. **Action — propose (not auto_add)**: All claims are fully grounded with verbatim quotes, which rules out a second-tier `propose` trigger. The fit score of 0.79 is ≤ 0.80, so the auto_add threshold is not met. Under the evaluation order, the grounded-claims + [0.50, 0.80] rule fires → `propose`. A human reviewer should confirm company headcount and ARR before auto-adding; if those fall inside the ICP range, this lead would be a very strong fit.

Chatmiss

classification miss: segment, company_size

Action: auto_add ✓
Fit: 0.85 (gold 0.92, |Δ| 0.07)
Industry: B2B SaaS ✓
Segment: Legal AI / Contract Review ✗
Seniority: Director ✓
Company size: 51-200 ✗
Grounding: 6/6 claims grounded (100.0%)

Claims (6)

Current focus on confidence calibration — getting reliable uncertainty signals to surface to reviewers
✓ in input My current focus is on two things: getting our confidence calibration to a point where the product can surface reliable uncertainty signals to reviewers
Annotation infrastructure and data quality is a significant time investment and bottleneck
✓ in input a non-trivial part of my time goes into annotation infrastructure and quality control
Has invested in per-customer adapter layers as alternative to full model retraining to handle playbook divergence
✓ in input we've been investing heavily in per-customer adapter layers rather than full model retraining, which has meaningfully reduced our update latency
Actively expanding coverage from MSAs/NDAs to more complex agreement types (JVs, IP licensing), which requires better training data
✓ in input expanding coverage from MSAs and NDAs into more bespoke agreement types like JVs and IP licensing. Both require better training data than we have
Speaks publicly on gap between benchmark performance and production behavior in legal AI systems
✓ in input I speak at legal tech conferences a few times a year, mostly on the gap between benchmark performance and real-world production behavior in legal AI systems
Deep technical expertise in production NLP and document AI with 10+ years experience, not a generalist
✓ in input I came to AI through a fairly traditional NLP research path. I did my PhD at Cascadia State University on information extraction from semi-structured documents, then joined Helix Labs in Westchester for three years

Draft hook

Hi Theo — I've been following your work on the gap between benchmark performance and production behavior in legal AI, and your framing of that problem resonates with what we're seeing from teams in similar positions. The confidence calibration challenge you're focused on is one of the harder ones in human-in-the-loop review systems — especially when ground truth is genuinely ambiguous, which in legal text it often is by design. I'd be curious whether the bottleneck for you right now is on the model output side, the annotation side, or in how uncertainty gets surfaced in the product UI. Worth a 20-minute conversation to compare notes? Happy to share what we're seeing from other teams working on similar problems.