\* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.
Special thanks to adefossez for the guidance.
ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.
[Beta] Datasets and Metadata Mode for SUPERB Benchmarks
With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the [SUPERB benchmark](https://superbbenchmark.org/). Furthermore, these datasets support metadata mode through a `get_metadata` function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.
Datasets with metadata functionality:
- LIBRISPEECH ([docs](https://pytorch.org/audio/0.13.0/generated/torchaudio.datasets.LIBRISPEECH.html#torchaudio.datasets.LIBRISPEECH))
- LibriMix ([docs](https://pytorch.org/audio/0.13.0/generated/torchaudio.datasets.LibriMix.html#torchaudio.datasets.LibriMix))
- QUESST14 ([docs](https://pytorch.org/audio/0.13.0/generated/torchaudio.datasets.QUESST14.html#torchaudio.datasets.QUESST14))
- SPEECHCOMMANDS ([docs](https://pytorch.org/audio/0.13.0/generated/torchaudio.datasets.SPEECHCOMMANDS.html#torchaudio.datasets.SPEECHCOMMANDS))
- (new) FluentSpeechCommands ([docs](https://pytorch.org/audio/0.13.0/generated/torchaudio.datasets.FluentSpeechCommands.html#torchaudio.datasets.FluentSpeechCommands))
- (new) Snips ([docs](https://pytorch.org/audio/0.13.0/generated/torchaudio.datasets.Snips.html#torchaudio.datasets.Snips))
- (new) IEMOCAP ([docs](https://pytorch.org/audio/0.13.0/generated/torchaudio.datasets.IEMOCAP.html#torchaudio.datasets.IEMOCAP))
- (new) VoxCeleb1 ([Identification](https://pytorch.org/audio/0.13.0/generated/torchaudio.datasets.VoxCeleb1Identification.html#torchaudio.datasets.VoxCeleb1Identification), [Verification](https://pytorch.org/audio/0.13.0/generated/torchaudio.datasets.VoxCeleb1Verification.html#torchaudio.datasets.VoxCeleb1Verification))
[Beta] Custom Language Model support in CTC Beam Search Decoding
In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the `torchaudio.models.decoder.CTCDecoderLM` wrapper.
[Beta] StreamWriter
`torchaudio.io.StreamWriter` is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.
Backward-incompatible changes
- [BC-breaking] Fix momentum in transforms.GriffinLim (2568)
The `GriffinLim` implementations in transforms and functional used the `momentum` parameter differently, resulting in inconsistent results between the two implementations. The `transforms.GriffinLim` usage of `momentum` is updated to resolve this discrepancy.
- Make `torchaudio.info` decode audio to compute `num_frames` if it is not found in metadata (2740).
In such cases, `torchaudio.info` may now return non-zero values for `num_frames`.
Bug Fixes
- Fix random Gaussian generation (2639)
`torchaudio.compliance.kaldi.fbank` with dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead.
- Update download link for speech commands (2777)
The previous download link for SpeechCommands v2 did not include data for the valid and test sets, resulting in errors when trying to use those subsets. Update the download link to correctly download the whole dataset.
New Features
IO
- Add metadata to source stream info (2461, 2464)
- Add utility function to fetch FFmpeg library versions (2467)
- Add YUV444P support to StreamReader (2516)
- Add StreamWriter (2628, 2648, 2505)
- Support in-memory decoding via Tensor wrapper in StreamReader (2694)
- Add StreamReader Tensor Binding to src (2699)
- Add StreamWriter media device/streaming tutorial (2708)
- Add StreamWriter tutorial (2698)
Ops
- Add ITU-R BS.1770-4 loudness recommendation (2472)
- Add convolution operator (2602)
- Add additive noise function (2608)
Models
- Hybrid Demucs model implementation (2506)
- Docstring change for Hybrid Demucs (2542, 2570)
- Add NNLM support to CTC Decoder (2528, 2658)
- Move hybrid demucs model out of prototype (2668)
- Move conv_tasnet_base doc out of prototype (2675)
- Add custom lm example to decoder tutorial (2762)
Pipelines
- Add SourceSeparationBundle to prototype (2440, 2559)
- Adding pipeline changes, factory functions to HDemucs (2547, 2565)
- Create tutorial for HDemucs (2572)
- Add HDEMUCS_HIGH_MUSDB (2601)
- Move SourceSeparationBundle and pre-trained ConvTasNet pipeline into Beta (2669)
- Move Hybrid Demucs pipeline to beta (2673)
- Update description of HDemucs pipelines
Datasets
- Add fluent speech commands (2480, 2510)
- Add musdb dataset and tests (2484)
- Add VoxCeleb1 dataset (2349)
- Add metadata function for LibriSpeech (2653)
- Add Speech Commands metadata function (2687)
- Add metadata mode for various datasets (2697)
- Add IEMOCAP dataset (2732)
- Add Snips Dataset (2738)
- Add metadata for Librimix (2751)
- Add file name to returned item in Snips dataset (2775)
- Update IEMOCAP variants and labels (2778)
Improvements
IO
- Replace `runtime_error` exception with `TORCH_CHECK` (2550, 2551, 2592)
- Refactor StreamReader (2507, 2508, 2512, 2530, 2531, 2533, 2534)
- Refactor sox C++ (2636, 2663)
- Delay the import of kaldi_io (2573)
Ops
- Speed up resample with kernel generation modification (2553, 2561)
The kernel generation for resampling is optimized in this release. The following table illustrates the performance improvements from the previous release for the `torchaudio.functional.resample` function using the sinc resampling method, on `float32` tensor with two channels and one second duration.
CPU
| torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
| ----- | ----- | ----- | ----- | ----- |