Researchers from the University of British Columbia have released IntentGrasp, a comprehensive benchmark that exposes significant weaknesses in how large language models understand human intent across conversations and text.
The benchmark, derived from 49 open-licensed datasets spanning 12 domains, contains 262,759 training instances and two test sets designed to measure intent comprehension capabilities.
Frontier Models Struggle With Basic Intent Recognition
Extensive testing of 20 models across seven families revealed concerning performance gaps. Even frontier models like GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7 scored below 60% on the standard All Set of 12,909 test cases.
The results were more striking on the challenging Gem Set of 470 carefully curated cases. Here, 17 out of 20 models performed worse than random guessing, which achieves 15.2% accuracy.
The best-performing models reached only 25% accuracy on Gem Set, while estimated human performance sits at 81.1%. This 56-point gap highlights a fundamental limitation in current AI systems.
Training Method Shows Promise
The researchers developed Intentional Fine-Tuning (IFT), which trains models specifically on intent understanding tasks using the IntentGrasp training data.
IFT delivered substantial improvements, boosting performance by over 30 F1 points on All Set and more than 20 points on Gem Set across tested models.
Leave-one-domain-out experiments demonstrated strong cross-domain generalization, suggesting the approach could work across different applications and contexts.
The benchmark and training code are available on Hugging Face and GitHub respectively.
The research team plans to expand IntentGrasp with additional domains and more complex intent scenarios as the field develops better evaluation methods.
💬 Discussion
Sign in to join the discussion.
Sign in →No comments yet — be the first.