Academic researchers have released MIST (Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic dataset designed to train AI assistants for voice-controlled smart home environments.

The dataset addresses a key challenge in IoT device control: combining speech recognition with real-world constraints like device locations, states, and multi-turn conversations. Current voice assistants struggle with complex scenarios that require understanding both what users say and the physical context of their requests.

MIST generates synthetic multi-turn conversations where users interact with smart home devices through voice commands. The dataset includes scenarios like adjusting thermostats in specific rooms, coordinating multiple devices, and handling interruptions or clarifications during tasks.

Researchers from multiple institutions tested both open-source and proprietary multimodal large language models on MIST tasks. They found significant performance gaps between open-weight models and closed systems like those from major AI labs.

Even frontier closed-weight models showed substantial room for improvement on the benchmark. The results suggest current AI systems aren't ready for complex, real-world smart home interactions that require spatial reasoning and state tracking.

The team released both the dataset and an extensible framework for generating similar benchmarks. This allows other researchers to create domain-specific versions for different IoT environments or device types.

"The rise of Internet of Things devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences," the researchers wrote in their paper.

The work highlights a growing challenge as smart homes become more sophisticated. Users expect natural conversations with AI assistants, but current systems often fail when requests involve multiple devices, spatial relationships, or context from previous interactions.

The dataset is available through arXiv and includes code for researchers to build upon the framework for their own IoT-focused AI training needs.