Recipe
- Fixed DDP training in HuBERT recipes (3068)
If `shuffle` is set `True` in `BucketizeBatchSampler`, the seed is only the same for the first epoch. In later epochs, each `BucketizeBatchSampler` object will generate a different shuffled iteration list, which may cause DPP training to hang forever if the lengths of iteration lists are different across nodes. In the 2.0.0 release, the issue is fixed by using the same seed for RNG in all nodes.
IO
- Fixed signature mismatch on `_fail_info_fileobj` (3032)
- Remove unnecessary AVFrame allocation (3021)
This fixes the memory leak reported in `torchaudio.io.StreamReader`.
New Features
Ops
- Added CUDA kernel for `torchaudio.functional.lfilter` (3018)
- Added data augmentation ops (2801, 2809, 2829, 2811, 2871, 2874, 2892, 2935, 2977, 3001, 3009, 3061, 3072)
Introduces `AddNoise`, `Convolve`, `FFTConvolve`, `Speed`, `SpeedPerturbation`, `Deemphasis`, and `Preemphasis` in `torchaudio.transforms`, and `add_noise`, `fftconvolve`, `convolve`, `speed`, `preemphasis`, and `deemphasis` in `torchaudio.functional`.
Models
- Added WavLM model (2822, 2842)
- Added XLS-R models (2959)
Pipelines
- Added WavLM bundles (2833, 2895)
- Added pre-trained pipelines for XLS-R models (2978)
I/O
- Added rgb48le and CUDA p010 support (HDR/10bit) to StreamReader (3023)
- Added `fill_buffer` method to `torchaudio.io.StreamReader` (2954, 2971)
- Added `buffer_chunk_size=-1` option to `torchaudio.io.StreamReader` (2969)
When `buffer_chunk_size=-1`, `StreamReader` does not drop any buffered frame. Together with the `fill_buffer` method, this is a recommended way to load the entire media.
python
reader = StreamReader("video.mp4")
reader.add_basic_audio_stream(buffer_chunk_size=-1)
reader.add_basic_video_stream(buffer_chunk_size=-1)
reader.fill_buffer()
audio, video = reader.pop_chunks()
- Added PTS support to `torchaudio.io.StreamReader` (2975)
`torchaudio.io.SteramReader` now gives PTS (presentation time stamp) of the media chunk it is returning. To maintain backward compatibility, the timestamp information is attached to the returned media chunk.
python
reader = StreamReader(...)
reader.add_basic_audio_stream(...)
reader.add_basic_video_stream(...)
for audio_chunk, video_chunk in reader.stream():
Fetch timestamp
print(audio_chunk.pts)
print(video_chunk.pts)
Chunks behave the same as torch.Tensor.
audio_chunk.mean(dim=1)
- Added playback function `torchaudio.io.play_audio` (3026, 3051)
You can play audio with the `torchaudio.io.play_audio` function. (macOS only)
- Added new dispatcher (3015, 3058, 3073)
Other
- Add utility functions to check information about FFmpeg (2958, 3014)
The following functions are added to `torchaudio.utils.ffmpeg_utils`, which can be used to query into the dynamically linked FFmpeg libraries.
- `get_demuxers()`
- `get_muxers()`
- `get_audio_decoders()`
- `get_audio_encoders()`
- `get_video_decoders()`
- `get_video_encoders()`
- `get_input_devices()`
- `get_output_devices()`
- `get_input_protocols()`
- `get_output_protocols()`
- `get_build_config()`
Recipes
- Add modularized SSL training recipe (2876)
Improvements
I/O
- Refactor StreamReader/Writer implementation
- Refactored StreamProcessor interface (2791)
- Refactored Buffer implementation (2939, 2943, 2962, 2984, 2988)
- Refactored AVFrame to Tensor conversions (2940, 2946)
- Refactored and optimize yuv420p and nv12 processing (2945)
- Abstracted away AVFormatContext from constructor (3007)
- Removed unused/redundant things (2995)
- Replaced `torchaudio::ffmpeg` namespace with `torchaudio::io` (3013)
- Merged `pop_chunks` implementations (3002)
- Cleaned up private methods (3030)
- Moved drain method to private (2996)
- Added logging to `torchaudio.io.StreamReader/Writer` (2878)
- Fixed the threads used by FilterGraph to 1 (2985)
- Fixed the default threads used by decoder to 1 in `torchaudio.io.StreamReader` (2949)
- Moved libsox integration from `libtorchaudio` to `libtorchaudio_sox` (2929)
- Added query methods to FilterGraph (2976)
Ops
- Added logging to MelSpectrogram and Spectrogram (2861)
- Fixed filtering function fallback mechanism (2953)
- Enabled log probs input for RNN-T loss (2798)
- Refactored extension modules initialization (2968)
- Updated the guard mechanism for FFmpeg-related features (3028)
- Updated the guard mechanism for `cuda_version` (2952)
Models
- Renamed generator to vocoder in HiFiGAN model and factory functions (2955)
- Enforces contiguous tensor in CTC decoder (3074)
Datasets
- Validates the input path in LibriMix dataset (2944)
Documentation
- Fixed docs warnings for conformer w2v2 (2900)
- Updated model documentation structure (2902)
- Fixed document for MelScale and InverseMelScale (2967)
- Updated highlighting in doc (3000)
- Added installation / build instruction to doc (3038)
- Redirect build instruction to official doc (3053)
- Tweak docs around IO (3064)
- Improved docstring about input path to LibriMix (2937)
Recipes
- Simplify train step in Conformer RNN-T LibriSpeech recipe (2981)
- Update WER results for CTC n-gram decoding (3070)
- Update ssl example (3060)
- fix import bug in global_stats.py (2858)
- Fixes examples/source_separation for WSJ0_2mix dataset (2987)
Tutorials
- Added mel spectrogram visualization to Streaming ASR tutorial (2974)
- Fixed mel spectrogram visualization in TTS tutorial (2989)
- Updated data augmentation tutorial to use new operators (3062)
- Fixed hybrid demucs tutorial for CUDA (3017)
- Updated hardware accelerated video processing tutorial (3050)
Builds
- Fixed `USE_CUDA` detection (3005)
- Fixed `USE_ROCM` detection (3008)
- Added M1 Conda builds (2840)
- Added M1 Wheels builds (2839)
- Added CUDA 11.8 builds (2951)
- Switched CI to CUDA 11.7 from CUDA 11.6 (3031, 3034)
- Added python 3.11 support (3039, 3071)
- Updated C++ standard to 17 (2973)
Tests
- Fix integration test for WAV2VEC2_ASR_LARGE_LV60K_10M (2910)
- Fix CI tests on gpu machines (2982)
- Remove function input parameters from data aug functional tests (3011)
- Reduce the sample rate of some tests (2963)
Style
- Fix type of arguments in torchaudio.io classes (2913)