The SWE-Bench Verified leaderboard turned over four times in 2025 alone. Anthropic's Claude Sonnet 4.5 pushed the public ceiling past 77%, OpenAI's GPT-5 agent traces matched it, and Cognition's Devin shipped solo runs that closed real GitHub issues at scale. Cursor ($7.6B raised, $29.3B valuation) and GitHub Copilot still own the IDE seat, but the open question for 132 tracked startups is whether anything outside coding ever reaches the same reliability bar.
The reliability ceiling
GAIA, WebArena, and OS-World show the gap. Coding clears 70%+ on Verified; general computer-use agents — Anthropic's Computer Use beta, OpenAI's Operator, Manus AI's autonomous browser — sit in the 40-60% band on multi-step web tasks. Decagon ($296M Series B) and Sierra ($1.27B Series C) sidestep the ceiling by narrowing scope: they run customer-facing dialogue inside a tight tool surface where every action is logged and reversible. Cognition's Devin chose the harder path and competes head-on with Cursor and Augment Code ($252M) on async engineering work where a wrong commit costs real money.
Where capital is concentrating
The top of the funding table is mostly cross-listed labs — xAI's $56B and Mistral AI's $6.3B carry over from Foundation Models because both ship agent surfaces. Among pure agent plays, Cursor sits ahead of Cognition ($1.14B Series C, $10.2B valuation), Glean ($750M Series F), Notion AI ($343M Series C), and Imbue ($232M Series B). Andreessen Horowitz, Sequoia, Thrive, and General Catalyst lead most rounds. Cline raised $4M seed and ships a VS Code agent that competes with $7B-funded incumbents — a useful proof that distribution and BYO-model flexibility still beat raw capital in this segment.
What 2026 actually decides
Whether agent reliability transfers. Coding works because compilers fail loudly, tests are cheap, and rollback is git reset. Booking a flight, filing an expense, or modifying a Salesforce record has none of those properties. The startups building eval harnesses, recovery loops, and explicit human-handoff UX — not the ones shipping flashier browser demos — are the ones likely to be alive in 18 months.