Methodology

How we benchmarked Climb-Wren against raw frontier AI on SAT® teaching tasks. Full numbers, anti-cherry-picked, reproducible.

Measured

Even our worst run beats every raw frontier model

We benchmarked Climb-Wren against five raw frontier AIs: Anthropic's claude-sonnet-4-5 (Climb's model), Anthropic's claude-opus-4-7 (the flagship — most expensive model on the market), OpenAI's gpt-5 (behind ChatGPT), and Google's gemini-2.5. Four independent eval runs. The numbers below report each condition's worst single run — an anti-cherry-picked floor. If a parent or competitor re-runs the benchmark, they will mathematically land at or above what we publish here.

Shape-appropriate teaching · regex-scored · 17 mode scenarios · worst of 4 runs

2.65 vs 0.35

Out of 3. Climb-Wren's worst run versus raw Opus 4.7's worst run. Opus is Anthropic's flagship and the most expensive frontier model on the market. Climb beats it by a factor of 7.5×, on a model class that costs 5× less. The same comparison versus raw GPT-5 (worst 0.53) and raw Gemini-2.5 (worst 0.53) is +2.12. Raw Opus alone is the worst of the four raw models on this dimension — the most expensive model is not the best at unaided tutoring shape.

Mode-shape · un-fakeable

2.65 / 3 worst of 4 runs

+2.12 vs raw-Sonnet (0.53) · +2.30 vs raw-Opus (0.35) · +2.12 vs raw-GPT-5 (0.53) · +2.12 vs raw-Gemini (0.53)

Pure regex over the reply text. Worked example? Numbered steps. Diagnose? Ends with a question. Acknowledge? ≤ 2 sentences, no follow-up. Reassure-reset? Opens with a pause cue. No LLM judge — no judge bias. The most defensible single number we have, and the one we lead with.

Pedagogical quality

1.86 / 3 worst of 4 runs

+0.54 vs raw-Sonnet (1.32) · +0.91 vs raw-Opus (0.95) · +0.77 vs raw-GPT-5 (1.09) · +0.86 vs raw-Gemini (1.00)

Is the teaching move appropriate to the moment? Whether the student walks away understanding something they didn't a minute ago. LLM-judged against a rubric — disclosed because judge-mediated scoring has bias risk. The 4-run floor framing absorbs judge variance.

Error localization

1.41 / 3 worst of 4 runs

+0.18 vs raw-Sonnet (1.23) · +0.27 vs raw-Opus (1.14) · +0.37 vs raw-GPT-5 (1.04) · +0.41 vs raw-Gemini (1.00)

Does the reply name the root-cause subtopic? Climb wins here despite deliberately staying quiet on acknowledge and reassure-reset turns — modes where naming the subtopic is the wrong move.

Climb-Wren on Opus — the ceiling

3.00 / 3 every run perfect

Mode-shape 3.00 / 3 across all 4 runs · Pedagogy 2.09 (worst) → 2.23 (best)

When we route Wren's scaffolding through Opus 4.7 instead of Sonnet 4.5, mode-shape is perfect on every single run. Pedagogy lifts another ~0.1 above Wren-on-Sonnet. The system is what wins — but it scales with the brain. We ship on Sonnet because the gap is small and the price is 5× lower. A premium tier remains an option.

How the measurement works

Twenty-two scenarios — seventeen of them tagged with a teaching mode (worked example, diagnose, explain, acknowledge, reassure-reset), five generic. Three student-history fixtures (forgotten-mastery, baseline-mid, distractor-trap-rusher). For each scenario, six assistants reply: Climb-Wren on Sonnet, Climb-Wren on Opus, plus raw-Sonnet, raw-Opus, raw-GPT-5, and raw-Gemini (no system prompt, no tools, no engine recommendation — just the user message). The Climb-Wren paths also get full session history, the engine's mode recommendation, and tools wired to the database.

Four scoring dimensions per reply. One is pure regex (mode-shape) and immune to judge bias. Three rely on an LLM judge against a written rubric — we disclose this. To absorb judge variance and provider non-determinism, the full eval is run four independent times and we publish the worst per-condition score across runs (4-run floor). A skeptic re-running the benchmark lands at or above these numbers.

The system around Climb-Wren is what changes — not the model underneath. Climb ships on Sonnet 4.5. Across all four raw brand baselines and all four runs, the system wins.

Reproducible: swift run ClimbEval --live against the open-source ClimbEval harness. Live API cost per full 6-condition × 22-scenario run: ~$3. Total spend for the 4-run aggregate: ~$13.

What we don't claim

n = 17 mode scenarios × 4 runs is a real sample size, but not a meta-analysis. The worst-of-4 framing is anti-cherry-picked: a skeptic running the eval lands at or above what we publish.
The control conditions are "frontier model alone," not "ChatGPT.com's actual UI." Your kid can paste a question into ChatGPT and get a response. They just won't get one that knows what they got wrong three weeks ago and chose the right teaching shape for this exact moment.
Pedagogical-quality and error-localization scores rely on an LLM judge. Mode-shape (the headline) is regex-scored with no judge involvement — that's where Climb's margin is widest and most defensible.
Gemini was tested on the 2.5-Flash tier (Pro requires Tier 1 billing promotion that's still propagating). Flash already loses to Climb by +2.12; Pro is the larger model and unlikely to close the gap on shape-appropriate teaching, but we'll re-publish with Pro once it's accessible.
Climb-Wren-on-Opus is the absolute ceiling (perfect mode-shape every run), but we ship Climb on Sonnet because the marginal gain at 5× the cost doesn't justify a higher product price for the average student. We'd rather your kid have Wren than a 5× markup for a ceiling effect.
Climb-Wren scored fractionally below raw-GPT-5 on prior-context recall in some runs. Disclosed; not hidden. Climb's edge isn't in name-dropping past episodes — it's in picking the right teaching shape and finding the right anchor at the right moment.

Why these design choices

The research base

Climb's design is grounded in the educational-psychology literature on what actually moves learning outcomes. Every claim below links to the underlying work.

One-on-one tutoring is the gold standard. Climb's Wren approximates it.

Bloom (1984) found that one-on-one tutoring produces a two-standard-deviation improvement over conventional classroom instruction. The mechanism is not more time — it is that the tutor models the student's mental state, identifies the specific misconception, and intervenes against that misconception directly. Wren does this through persistent memory, mastery-state queries, and diagnostic questions that ask the student to surface his reasoning before the explanation lands.

Bloom, B. (1984). The 2 Sigma Problem. Overview · Carnegie Learning's Cognitive Tutor lineage: Anderson, Corbett, Koedinger & Pelletier — Cognitive Tutors: Lessons Learned.

Mild confusion correlates with learning gain. The tutor's job is not to dissolve it.

Kapur's productive-failure work shows that students who struggle with a problem before receiving instruction outperform students given direct instruction first — by effect sizes equivalent to two to three years of typical schooling on transfer tasks. Wren's diagnostic mode is engineered around this finding. When she asks "walk me through how you got there," she sustains productive disequilibrium long enough for the student to construct the right concept himself, rather than resolving the confusion immediately.

Kapur, M. — Productive Failure (overview). Roediger & Karpicke on the testing effect: The Power of Testing Memory (2006).

Spaced retrieval beats massed cramming, by substantial margins.

Bjork's "desirable difficulties" framework, supported by decades of cognitive-science replication, shows that spaced retrieval practice produces durable learning while massed cramming produces short-term recognition that does not survive the actual test. Climb's Leitner queue is a working spaced-repetition implementation; the adaptive sampler weights against recency to enforce spacing.

Bjork & Bjork — Making Things Hard on Yourself, But in a Good Way. Background: Desirable difficulty.

Extrinsic rewards can crowd out intrinsic motivation. So we do not add them.

Lepper, Greene & Nisbett (1973) demonstrated the over-justification effect: bolting extrinsic rewards (XP, badges, points) onto an activity in which a student might develop intrinsic interest reduces that intrinsic motivation when the rewards are removed. The implication for an app preparing students for the SAT® exam is direct: if we want a student to develop genuine interest in mastering the material, layering XP and badges on top is the most counterproductive thing we can do. Self-determination theory (Deci & Ryan) names the actual drivers of sustained motivation — autonomy, competence, relatedness — none of which are points.

Lepper, Greene & Nisbett (1973): Undermining children's intrinsic interest with extrinsic reward · Ryan & Deci (2000): Self-Determination Theory and the Facilitation of Intrinsic Motivation.

The standard gamification trio does not measurably improve learning competence.

A 2023 meta-analysis of educational gamification found that points, badges, and leaderboards — the most-borrowed mechanics in ed-tech — do not measurably improve felt competence. What does move competence: appropriate-difficulty challenges with visible feedback. That is precisely what Beta-posterior mastery, Leitner spacing, and Wren's anchored explanations deliver, without badges.

2023 meta-analysis of educational gamification — Educational Technology Research and Development.

Push notifications reduce learning performance. So Climb does not send any.

A 2022 study found that mobile push notifications, even non-engagement-oriented ones, measurably reduce learning performance. Climb sends no push notifications. The spaced-repetition queue creates a natural daily rhythm based on memory science; we do not need to interrupt the student to drive return visits.

Push notifications and learning performance — Computers & Education: Open, 2022.

Try it

See it on a real student

The methodology above describes how Climb behaves. The product page describes what it does, who it's for, and how to get TestFlight access.

← Back to Climb