Lemma is a reliability and observability platform built specifically for AI agents running in production. It addresses the well documented problem that agent performance silently degrades over time as user behavior drifts and new edge cases appear, with some teams seeing roughly a 40 percent quality drop within weeks of launch. Lemma was founded in 2025 by Jerry Zhang and Cole Gawin, who left college to join the Y Combinator F25 batch after watching every AI company they spoke with rebuild the same internal evaluation and prompt-tuning infrastructure from scratch.
The product closes the loop between deployment and improvement. Lemma combines automated online evaluations with real user signals such as task adherence, frustration, and recovery rate, then detects failed outcomes directly from live traffic. When a regression appears, Lemma automatically traces the root cause to a specific span or model output, proposes a fix as a prompt change, guardrail, or config update, and can ship that fix as a one-click pull request. A cluster discovery feature continuously embeds traces to surface emerging failure patterns without any manual labeling.
Lemma integrates natively with Vercel AI SDK, OpenAI Agents, Langfuse, Arize Phoenix, Azure Monitor, and LangGraph, with Claude SDK support on the roadmap. Customer reports cite roughly 90 percent less manual prompt iteration, production drift resolution in minutes rather than days, and two to five percent model performance gains per optimization cycle. Pricing is contact-sales today, which fits the platform positioning toward engineering teams running mission-critical agent deployments rather than hobby projects.