Facebook AI Releases Revolutionary Speech Recognition Model

With speech recognition acting as the fundamental architecture to a wide variety of technologies used today, it was the need of the hour that a model that works efficiently with little pre-training and self-supervision method be introduced. So, Facebook AI came up with wav2vec to broaden the use of AI to classify data on its platform.

Speech-to-text conversion method researchers have always been in the pursuit of a better Automatic Speech Recognition (ASR) model that performs efficiently with lower training time and accurate results. However, current ASR models require a huge amount of labeled speech data to reach even a moderate level of accuracy. The lack of this transcribed audio data for not very largely spoken 7,000 languages and dialects makes any model difficult to train over these languages. 

Thus, researchers modeled a framework for learning raw data from audio based on self-supervision. To advance the studies in ASR, Facebook AI has open-sourced the new wav2vec 2.0 algorithm for learning the low-resource languages and dialects using the self-supervision method. The new model can enable the ASR model with just 10 minutes of transcribed data. 

Fundamental mechanism 

The approach for designing a self-supervised learning model is by encoding speech audio data using a multi-layer convolutional neural network.

Researchers can then feed the latent representations, spanned by masking, to a transformer network to build such representations that capture information from the entire sequence.

The architecture enables the model to learn the context representation over continuously represented speech and the dependencies over the sequence of the captured latent representations. 

The chief notion is to learn representations in a setup where a large amount of unlabeled and labeled data is available and leverage the learned representations to make the performance for such models better that have only a limited amount of data.

There have been previous works that used unsupervised learning models but the results of these models were not applied to improve the supervised learning models for speech.

So, in this proposed model, the system applies the results of unsupervised learning to improve the supervised speech learning model. This exploits the easy availability of unlabeled data. 

This model proposes using a convolutional neural network that takes input in the form of raw audio and computes a general representation that can be given as input to the speech recognition system. 

How does it work?

Wav2vec processes beyond the classification based on frame-wise phonemes and aims to apply the learned representations to improve supervised ASR models.

Wav2vec is designed to train the model to learn the difference between original audio data and modified versions of it.


The objective of the model is to predict future samples from a particular signal context. For this, the model takes raw audio as input which passes through two neural networks: encoder network and context network.

The encoder network samples the audio signal at multiple time-steps in latent space and the context network clubs these multiple time-step signals together. The output of the encoder network s a low-frequency feature representation.

The layers in the encoder, as well as the context layer, consist of a causal convolution having 512 channels, a non-linearity ReLU, and a group combination layer. Then normalize both the layers across the temporal and feature dimension for each sample choosing a normalization scheme such that it is invariant to the offset and scaling of the input. 


The datasets used for building the model include WSJ, TIMIT, Librispeech, and the evaluation of final models is done based on the word error rate (WER) and letter error rate (LER).

Also, for the acoustic mode, the wav2letter++ toolkit is used for training and evaluation purposes. The character-based setup of wav2letter++ is used for the TIMIT task which results in the projection of a 39-dimensional phoneme probability. And for the WSJ task, the 17 layer model of wav2letter++ with gated convolution is used 

Moving further the decoding of the emissions from the acoustic model is performed. For this, a lexicon along with a separate language model trained upon the WSJ language modeling data is used.

The hyperparameter tuning for WSJ decoding works on a random search. Finally, pre-training is performed, which is implemented in PyTorch in the fairseq toolkit.

The optimizer used is the Adam optimizer with a cosine learning rate schedule that is annealed over 40,000 update steps for the clean Librispeech training dataset and WSJ and over 400,000 steps for full Librispeech. The first wav2vec variant is trained on 8 GPUs with an audio sequence summing up to 1.5M frames on each GPU.

The sequences are then grouped followed by a crop to the size that is smaller out of the 150,000 frames or the length of the shortest sequence in the batch, either from the beginning of the end of the sequences.

The cropping offsets are randomly decided after which each epoch is re-sampled. 

The results of this were impressive. The Facebook AI team trained this model on just 1,000 hours of unlabeled speech samples from the LibriSpeech dataset post this, the training was performed on 81 hours of labeled speech from WSJ1.

Compared to the baseline system trained 12,000 hours of labeled data with a WER of 3.1%, wav2vec achieved a WER of 2.43% on DeepSpeech2.

On further testing, it was observed that wav2vec trained model resulted in a better performance than when pre-training on the labeled version of the LibriSPeech dataset with a 30% improvement in WER against a model without pre-trained representations. 


Analyzing the results, it was understood that increasing the number of negative samples helps only up to ten samples, after which the performance plateaus.

Further studies indicated that predicting more than 12 steps in the future only increases the training time and no improvement in the performance. Also, it is seen that choosing a crop size of 150,000 frames shows the best performance as opposed to not restricting the maximum length which resulted in the worst performance. 

Applications of Wav2Vec

This research opens many doors to models based on self-supervised training algorithms for advancing ASR to languages with limited datasets of labeled data and annotated speech samples. It can be used for generating captions for videos, detecting policy violations. Also, this can be used to make speech recognition less English-centric and apply to the vast pool of languages all over the world. This can be used for acoustic-event detection and keyword spotting too. 

The development of wav2vec has just begun which means there is a great future for progress. Wav2vec depends on a fully convolutional network that can easily be parallelized with time using modern hardware as opposed to the recurrent network models used in previous researches. The goal is to use wav2vec to produce better audio sample representation for a wide range of training techniques. This is likely to help improve the use of AI and flag harmful data, keeping platforms safe. 


Although wav2vec is a revolutionary model requiring just 10 minutes of training on transcribed data and using the results from training on unsupervised models on supervised learning models, it does still have a long way to go. Facebook AI has added this model as a simple, lightweight extension to their open-source modeling toolkit, fairseq, looking for development on the model collectively from the wide AI community.


You may also like...

2 Responses

  1. Preetish Pede says:

    Woah! I had a doubt about facebook AI, but thank you! This article cleared that doubt.

  1. October 24, 2020

    […] AI has been releasing a lot of revolutionary models in recent times which include the wav2vec model, KILT, and Dynabench, an addition to this is the M2M-100 […]


Leave a Reply

Your email address will not be published. Required fields are marked *

DMCA.com Protection Status