Researchers from Alibaba Group introduced Qwen-Audio, which addresses the challenge of limited pre-trained audio models for diverse tasks. A hierarchical tag-based multi-task framework is designed to avoid interference issues from co-training. Qwen-Audio achieves impressive performance across benchmark tasks without task-specific fine-tuning. Qwen-Audio-Chat, built upon Qwen-Audio, supports multi-turn dialogues and diverse audio-central scenarios, demonstrating its universal audio understanding abilities.
Qwen-Audio overcomes the limitations of previous audio-language models by handling diverse audio types and tasks. Unlike prior works on speech alone, Qwen-Audio incorporates human speech, natural sounds, music, and songs, allowing co-training on datasets with varying granularities. The model excels in speech perception and recognition tasks without task-specific modifications. Qwen-Audio-Chat extends these capabilities to align with human intent, supporting multilingual, multi-turn dialogues from audio and text inputs, showcasing robust and comprehensive audio understanding.
LLMs excel in general artificial intelligence but lack audio comprehension. Qwen-Audio addresses this by scaling pre-training to cover 30 tasks and diverse audio types. A multi-task framework mitigates interference, enabling knowledge sharing. Qwen-Audio performs impressively across benchmarks without task-specific fine-tuning. Qwen-Audio-Chat, an extension, supports multi-turn dialogues and diverse audio-centric scenarios, showcasing comprehensive audio interaction capabilities in LLMs.
Qwen-Audio and Qwen-Audio-Chat are models for universal audio understanding and flexible human interaction. Qwen-Audio adopts a multi-task pre-training approach, optimizing the audio encoder while freezing language model weights. In contrast, Qwen-Audio-Chat employs supervised fine-tuning, optimizing the language model while fixing audio encoder weights. The training process includes multi-task pre-training and supervised fine-tuning. Qwen-Audio-Chat enables versatile human interaction, supporting multilingual, multi-turn dialogues from audio and text inputs, showcasing its adaptability and comprehensive audio understanding.
Qwen-Audio demonstrates remarkable performance across diverse benchmark tasks, surpassing counterparts without task-specific fine-tuning. It consistently outperforms baselines by a substantial margin on jobs like AAC, SWRT ASC, SER, AQA, VSC, and MNA. The model establishes state-of-the-art results on CochlScene, ClothoAQA, and VocalSound, showcasing robust audio understanding capabilities. Qwen-Audio’s superior performance across various analyses highlights its effectiveness and competence in achieving state-of-the-art results in challenging audio tasks.
The Qwen-Audio series introduces large-scale audio-language models with universal understanding across diverse audio types and tasks. Developed through a multi-task training framework, these models facilitate knowledge sharing and overcome interference from varying textual labels in different datasets. Achieving impressive performance across benchmarks without task-specific fine-tuning, Qwen-Audio surpasses prior works. Qwen-Audio-Chat extends these capabilities, enabling multi-turn dialogues and supporting diverse audio scenarios, showcasing robust alignment with human intent and facilitating multilingual interactions.
Qwen-Audio’s future exploration includes expanding capabilities for different audio types, languages, and specific tasks. Refining the multi-task framework or exploring alternative knowledge-sharing approaches could address interference issues in co-training. Investigating task-specific fine-tuning can enhance performance. Continuous updates based on new benchmarks, datasets, and user feedback aim to improve universal audio understanding. Qwen-Audio-Chat is refined to align with human intent, support multilingual interactions, and enable dynamic multi-turn dialogues.
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.