MoLE : Mixture of Language Experts for Multi-Lingual Automatic Speech Recognition

Author	Yoohwan Kwon, Soo-Whan Chung
Publication	International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Year	2023
Link	[Paper] [arXiv]

ABSTRACT

Multi-lingual speech recognition aims to distinguish linguistic expressions in different languages and integrate acoustic processing simultaneously. In contrast, current multi-lingual speech recognition research follows a language-aware paradigm, mainly targeted to improve recognition performance rather than discriminate language characteristics. In this paper, we present a multi-lingual speech recognition network named Mixture-of-Language-Expert (MoLE), which digests speech in a variety of languages. Specifically, MoLE analyzes linguistic expression from input speech in arbitrary languages, activating a language-specific expert with a lightweight language tokenizer. The tokenizer not only activates experts, but also estimates the reliability of the activation. Based on the reliability, the activated expert and the language-agnostic expert are aggregated to represent language-conditioned embedding for efficient speech recognition. Our proposed model is evaluated in 5 languages scenario, and the experimental results show that our structure is advantageous on multi-lingual recognition, especially for speech in low-resource language.

Share on

Twitter Facebook LinkedIn

Soo-Whan Chung

MoLE : Mixture of Language Experts for Multi-Lingual Automatic Speech Recognition

Share on

You may also enjoy

Speak in the Scene: Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation

HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders

MF-PAM: Accurate Pitch Estimation through Periodicity Analysis and Multi-level Feature Fusion

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech