OpenAI shipped GPT-5.5, its most capable model yet, achieving 82.7% accuracy on Terminal-Bench 2.0 coding benchmarks while maintaining the same per-token latency as its predecessor.

The model excels at agentic coding, computer use, and knowledge work. On SWE-Bench Pro, which tests real-world GitHub issue resolution, GPT-5.5 reached 58.6% accuracy, solving more tasks end-to-end in a single pass than previous models.

Efficiency gains across benchmarks

GPT-5.5 delivers higher performance while using significantly fewer tokens to complete the same tasks. On Expert-SWE, OpenAI's internal evaluation for long-horizon coding tasks with a median 20-hour human completion time, the model outperformed GPT-5.4 across all metrics.

The model also scored 73.1% on Expert-SWE and 78.7% on OSWorld-Verified, both improvements over GPT-5.4's 68.5% and 75.0% respectively. On cybersecurity tasks measured by CyberGym, GPT-5.5 achieved 81.8% compared to GPT-5.4's 79.0%.

According to Artificial Analysis's Coding Index, GPT-5.5 delivers state-of-the-art intelligence at half the cost of competitive frontier coding models. The model shows particular strength in understanding system architecture and reasoning through complex debugging scenarios.

Rollout and safety measures

GPT-5.5 is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex today. GPT-5.5 Pro is available to Pro, Business, and Enterprise tiers. API access requires additional safety protocols and will launch "very soon," according to the company.

OpenAI evaluated the model across its full safety framework, working with internal and external red teams. The company collected feedback from nearly 200 trusted early-access partners before release, with targeted testing for advanced cybersecurity and biology capabilities.

Dan Shipper, founder and CEO of Every, described GPT-5.5 as "the first coding model I've used that has serious conceptual clarity." Early testers reported the model shows stronger ability to understand system failures and identify where fixes need to be implemented across codebases.

The model competes directly with Anthropic's Claude Opus 4.7, which scored 69.4% on Terminal-Bench 2.0, and Google's Gemini 3.1 Pro at 68.5%.