Chinese researchers have released VITA-QinYu, the first end-to-end spoken language model designed to generate expressive speech beyond natural conversation, including role-playing scenarios and singing capabilities.

The model, developed by a team of 11 researchers led by Jiacheng Xu, uses a hybrid speech-text architecture that combines interleaved text-audio modeling with multi-codebook audio tokens. This design enables richer paralinguistic representation while maintaining clear separation between modalities to prevent interference.

VITA-QinYu was trained on 15,800 hours of synthesized data covering natural conversation, role-playing, and singing scenarios. The researchers developed a comprehensive data generation pipeline to create this training corpus.

Performance benchmarks show significant improvements

The model outperformed existing spoken language models by 7 percentage points on objective role-playing benchmarks. For singing generation, VITA-QinYu surpassed peer models by 0.13 points on a 5-point Mean Opinion Score (MOS) scale.

VITA-QinYu also achieved state-of-the-art performance in conversational tasks, exceeding prior models by 1.38 percentage points on the C3 benchmark and 4.98 percentage points on the URO benchmark for accuracy and fluency.

The research addresses a gap in current AI voice technology, which typically focuses on natural conversation rather than expressive speech that conveys personality, mood, or performance elements like comforting tones or musical humming.

The team has open-sourced both the code and trained models, providing an easy-to-use demo with full-stack support for streaming and full-duplex interaction. The work represents a significant step toward AI systems that can match human expressiveness in speech generation.

The model's architecture preserves clear boundaries between text and audio processing, a design choice that prevents the interference issues that have plagued previous multimodal approaches to expressive speech synthesis.