In this post, we describe the end-to-end process of training speech recognition systems using wav2vec 2.0 using audio only with only a tiny dataset of transcribed audio.