Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation

Author	Soo-Whan Chung, Joon Son Chung, Hong-Goo Kang
Publication	International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Month	May
Year	2019
Link	[Paper] [Github]

ABSTRACT

This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronisation. Here, we set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant audio segment given a short video clip. The method builds on the recent advances in learning representations from cross-modal self-supervision. The main contributions of this paper are as follows: (1) we propose a new learning strategy where the embeddings are learnt via a multi-way matching problem, as opposed to a binary classification (matching or non-matching) problem as proposed by recent papers; (2) we demonstrate that performance of this method far exceeds the existing baselines on the synchronisation task; (3) we use the learnt embeddings for visual speech recognition in self-supervision, and show that the performance matches the representations learnt end-to-end in a fully-supervised manner.

Share on

Twitter Facebook LinkedIn

Soo-Whan Chung

Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation

Share on

You may also enjoy

Speak in the Scene: Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation

HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders

MF-PAM: Accurate Pitch Estimation through Periodicity Analysis and Multi-level Feature Fusion

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech