Qwen Open-Sources Advanced ASR And Forced Alignment Models With Multi-Language Capabilities

Alibaba Cloud introduced that it has made its Qwen3-ASR and Qwen3-ForcedAligner AI fashions open-source, providing superior instruments for speech recognition and compelled alignment.
The Qwen3-ASR household contains two all-in-one fashions, Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, which help language identification and transcription throughout 52 languages and accents, leveraging large-scale speech information and the Qwen3-Omni basis mannequin.
Internal testing signifies that the 1.7B mannequin delivers state-of-the-art accuracy amongst open-source ASR techniques, whereas the 0.6B model balances efficiency and effectivity, able to transcribing 2,000 seconds of speech in a single second with high concurrency.
The Qwen3-ForcedAligner-0.6B mannequin makes use of a non-autoregressive LLM strategy to align textual content and speech in 11 languages, outperforming main force-alignment options in each pace and accuracy.
Alibaba Cloud has additionally launched a complete inference framework beneath the Apache 2.0 license, supporting streaming, batch processing, timestamp prediction, and fine-tuning, aimed toward accelerating analysis and sensible functions in audio understanding.
Qwen3-ASR And Qwen3-ForcedAligner Models Demonstrate Leading Accuracy And Efficiency
Alibaba Cloud has launched efficiency outcomes for its Qwen3-ASR and Qwen3-ForcedAligner fashions, demonstrating main accuracy and effectivity throughout various speech recognition duties.
The Qwen3-ASR-1.7B mannequin achieves state-of-the-art outcomes amongst open-source techniques, outperforming industrial APIs and different open-source fashions in English, multilingual, and Chinese dialect recognition, together with Cantonese and 22 regional variants.
It maintains dependable accuracy in difficult acoustic situations, reminiscent of low signal-to-noise environments, youngster or aged speech, and even singing voice transcription, attaining common phrase error charges of 13.91% in Chinese and 14.60% in English with background music.
The smaller Qwen3-ASR-0.6B balances accuracy and effectivity, delivering high throughput and low latency beneath high concurrency, able to transcribing as much as 5 hours of speech in on-line asynchronous mode at a concurrency of 128.
Meanwhile, the Qwen3-ForcedAligner-0.6B outperforms main end-to-end compelled alignment fashions together with Nemo-Forced-Aligner, WhisperX, and Monotonic-Aligner, providing superior language protection, timestamp accuracy, and help for various speech and audio lengths.
The submit Qwen Open-Sources Advanced ASR And Forced Alignment Models With Multi-Language Capabilities appeared first on Metaverse Post.
