Academic researchers have developed TurnGate, a defense system designed to detect the precise moment when multi-turn conversations with AI models become dangerous enough to enable harmful actions.

The system addresses a growing vulnerability where attackers spread malicious intent across multiple seemingly innocent conversation turns rather than exposing harmful objectives in a single prompt. Current commercial models with advanced safety measures remain susceptible to these distributed attacks.

TurnGate identifies what researchers call the "harm-enabling closure point" — the earliest turn where delivering a response would provide sufficient information for harmful action. The approach aims to prevent premature blocking of legitimate exploratory conversations while stopping genuinely dangerous exchanges.

Multi-Turn Intent Dataset enables training

The research team constructed the Multi-Turn Intent Dataset (MTID) to support training and evaluation. The dataset contains branching attack scenarios, matched benign conversations, and annotations marking the earliest turns that enable harm.

Testing showed TurnGate substantially outperformed existing detection methods while maintaining low rates of incorrectly blocking benign conversations. The system demonstrated effectiveness across different domains, attack methods, and target AI models.

The vulnerability stems from attackers' ability to gradually build toward harmful objectives through conversations that appear innocent when examined turn-by-turn. Traditional safety systems often fail to detect the cumulative risk as conversations progress.

Researchers from multiple institutions collaborated on the project, including teams focused on AI safety and security. The work highlights ongoing challenges in defending deployed language models against sophisticated attack strategies.

The research code and dataset are available on GitHub for further development by the AI safety community. The findings suggest that turn-level monitoring represents a promising approach for detecting distributed malicious intent in AI conversations.