- New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
- Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
- New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
- CUDA-enabled binaries
[Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights
HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from [wav2vec 2.0](https://arxiv.org/abs/2006.11477), [Unsupervised Cross-lingual Representation Learning](https://arxiv.org/abs/2006.13979) and [HuBERT](https://arxiv.org/abs/2106.07447) are added.
These pretrained weights can be used for feature extractions and downstream task adaptation.
python
>>> import torchaudio
>>>
>>> Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> Pass the features to downstream task
>>> ...
Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)
python
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD
[Beta] Tacotron2 and TTS Pipeline
A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under `torchaudio.pipelines` module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.
python
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> Encode text
>>> input, lengths = processor(text)
>>>
>>> Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)
[Beta] RNN Transducer Loss
The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (`torchaudio.functional.rnnt_loss` or `torchaudio.transforms.RNNTLoss`) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.
[Beta] MVDR Beamforming
This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio.
Please refer to the [MVDR tutorial](https://pytorch.org/audio/0.10.0/tutorials/mvdr_tutorial.html).
GPU Build
This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.
Additional Features
`torchaudio.functional.lfilter` now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.
Backward Incompatible Changes
I/O
- Default to PCM_16 for flac on soundfile backend (1604)
- When saving FLAC format with “soundfile” backend, `PCM_24` (the previous default) could cause warping. The default has been changed to `PCM_16`, which does not suffer this.
Ops
- Default to native complex type when returning raw spectrogram (1549)
- When `power=None`, `torchaudio.functional.spectrogram` and `torchaudio.transforms.Spectrogram` now defaults to `return_complex=True`, which returns Tensor of native complex type (such as `torch.cfloat` and `torch.cdouble`). To use a pseudo complex type, pass the resulting tensor to `torch.view_as_real`.
- Remove deprecated kaldi.resample_waveform (1555)
- Please use `torchaudio.functional.resample`.
- Replace waveform with specgram in SlidingWindowCmn (1859)
- The argument name was corrected to `specgram`.
- Ensure integer input frequencies for resample (1857)
- Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.
Wav2Vec2
- Update `extract_features` of Wav2Vec2Model (1776)
- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use `Wav2Vec2Model.feature_extractor()`.
- Move fine-tune specific module out of wav2vec2 encoder (1782)
- The internal structure of `Wav2Vec2Model` was updated. `Wav2Vec2Model.encoder.read_out` module is moved to `Wav2Vec2Model.aux`. If you have serialized state dict, please replace the key `encoder.read_out` with `aux`.
- Updated wav2vec2 factory functions for more customizability (1783, 1804, 1830)
- The signatures of wav2vec2 factory functions are changed. `num_out` parameter has been changed to `aux_num_out` and other parameters are added before it. Please update the code from `wav2vec2_base(num_out)` to `wav2vec2_base(aux_num_out=num_out)`.
Deprecations
- Add `melscale_fbanks` and deprecate `create_fb_matrix` (1653)
- As `linear_fbanks` is introduced, `create_fb_matrix` is renamed to `melscale_fbanks`. The original `create_fb_matrix` is now deprecated. Please use `melscale_fbanks`.
- Deprecate `VCTK` dataset (1810)
- This dataset has been taken down and is no longer available. Please use `VCTK_092` dataset.
- Deprecate data utils (1809)
- `bg_iterator` and `diskcache_iterator` are known to not improve the throughput of data loaders. Please cease their usage.
New Features
Models
**Tacotron2**
- Add Tacotron2 model (1621, 1647, 1844)
- Add Tacotron2 loss function (1764)
- Add Tacotron2 inference method (1648, 1839, 1849)
- Add phoneme text preprocessing for Tacotron2 (1668)
- Move Tacotron2 out of prototype (1714)
**HuBERT**
- Add HuBERT model architectures (1769, 1811)
Pretrained Weights and Pipelines
* Add pretrained weights for wavernn (1612)
* Add Tacotron2 pretrained models (1693)
* Add HUBERT pretrained weights (1821, 1824)
* Add pretrained weights from wav2vec2.0 and XLSR papers (1827)
* Add customization support to wav2vec2 labels (1834)
* Default pretrained weights to eval mode (1843)
* Move wav2vec2 pretrained models to pipelines module (1876)
* Add TTS bundle/pipelines (1872)
* Fix vocoder interface (1895)
* Fix Phonemizer download (1897)
RNN Transducer Loss
* Add reduction parameter for RNNT loss (1590)
* Rename RNNT loss C++ parameters (1602)
* Rename transducer to RNNT (1603)
* Remove gradient variable from RNNT loss Python code (1616)
* Remove reuse_logits_for_grads option for RNNT loss (1610)
* Remove fused_log_softmax option from RNNT loss (1615)
* RNNT loss resolve null gradient (1707)
* Move RNNT loss out of prototype (1711)
MVDR Beamforming
* Add MVDR module to example (1709)
* Add normalization to steering vector solutions in MVDR Module (1765)
* Move MVDR and PSD modules to transforms (1771)
* Add MVDR beamforming tutorial to example directory (1768)
Ops
* Add edit_distance (1601)
* Add PitchShift to functional and transform (1629)
* Add LFCC feature to transforms (1611)
* Add InverseSpectrogram to transforms and functional (1652)
Datasets
* Add CMUDict dataset (1627)
* Move LibriMix dataset to datasets directory (1833)
Improvements
I/O
* Make buffer size for function info configurable (1634)
Ops
* Replace deprecated AutoNonVariableTypeMode (1583)
* Remove lazy behavior from MelScale (1636)
* Simplify axis value checks (1501)
* Use at::parallel_for in lfilter core loop (1557)
* Add filterbanks support to lfilter (1587)
* Add batch support to lfilter (1638)
* Use integer rates in pitch shift resample (1861)
Models
* Rename infer method to forward for WaveRNNInferenceWrapper (1650)
* Refactor WaveRNN infer and move it to the codebase (1704)
* Make the core wav2vec2 factory function public (1829)
* Refactor WaveRNNInferenceWrapper (1845)
* Store n_bits in WaveRNN (1847)
* Replace custom padding with torch’s native impl (1846)
* Avoid concatenation in loop (1850)
* Add lengths param to WaveRNN.infer (1851)
* Add sample rate to wav2vec2 bundle (1878)
* Remove factory functions of Tacotron2 and WaveRNN (1874)
Datasets
* Fix encoding of CMUDict data reading (1665)
* Rename utterance to transcript in datasets (1841)
* Clean up constructor of CMUDict (1852)
Performance
* Refactor transforms.Fade on GPU computation (1871)
**CUDA**
Tensor shape | [1,4,8000] | [1,4,16000] | [1,4,32000]
-- | -- | -- | --