Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval

Author	Soo-Whan Chung, Joon Son Chung, Hong-Goo Kang
Publication	Journal of Selected Topics in Signal Processing
Volume	14
Issue	3
Month	March
Year	2020
Link	[Paper] [Github]

ABSTRACT

This paper proposes a new strategy for learning effective cross-modal joint embeddings using self-supervision. We set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant data in one domain given input in another. The method builds on the recent advances in learning representations from cross-modal self-supervision using contrastive or binary cross-entropy loss functions. To investigate the robustness of the proposed learning strategy across multi-modal applications, we perform experiments for two applications - audio-visual synchronisation and cross-modal biometrics. The audio-visual synchronisation task requires temporal correspondence between modalities to obtain joint representation of phonemes and visemes, and the cross-modal biometrics task requires common speakers representations given their face images and audio tracks. Experiments show that the performance of systems trained using proposed method far exceed that of existing methods on both tasks, whilst allowing significantly faster training.

Share on

Twitter Facebook LinkedIn

Soo-Whan Chung

Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval

Share on

You may also enjoy

HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders

MF-PAM: Accurate Pitch Estimation through Periodicity Analysis and Multi-level Feature Fusion

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

MoLE : Mixture of Language Experts for Multi-Lingual Automatic Speech Recognition