Intra-Class Variation Reduction of Speaker Representation in Disentanglement Framework

Author	Yoohwan Kwon, Soo-Whan Chung, Hong-Goo Kang
Publication	INTERSPEECH
Month	October
Year	2020
Link	[Paper]

ABSTRACT

In this paper, we propose an effective training strategy to extract robust speaker representations from a speech signal. One of the key challenges in speaker recognition tasks is to learn latent representations or embeddings containing solely speaker characteristic information in order to be robust in terms of intra-speaker variations. By modifying the network architecture to generate both speaker-related and speaker-unrelated representations, we exploit a learning criterion which minimizes the mutual information between these disentangled embeddings. We also introduce an identity change loss criterion which utilizes a reconstruction error to different utterances spoken by the same speaker. Since the proposed criteria reduce the variation of speaker characteristics caused by changes in background environment or spoken content, the resulting embeddings of each speaker become more consistent. The effectiveness of the proposed method is demonstrated through two tasks; disentanglement performance, and improvement of speaker recognition accuracy compared to the baseline model on a benchmark dataset, VoxCeleb1. Ablation studies also show the impact of each criterion on overall performance.

Share on

Twitter Facebook LinkedIn

Soo-Whan Chung

Intra-Class Variation Reduction of Speaker Representation in Disentanglement Framework

Share on

You may also enjoy

Speak in the Scene: Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation

HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders

MF-PAM: Accurate Pitch Estimation through Periodicity Analysis and Multi-level Feature Fusion

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech