Torchaudio

Latest version: v2.6.0

Safety actively analyzes 723882 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 16

2005.08100

The release adds an implementation of Conformer ([docs](https://pytorch.org/audio/0.11.0/models.html#conformer)), a convolution-augmented transformer architecture that has achieved state-of-the-art results on speech recognition benchmarks.

Backward-incompatible changes
Ops
- Removed deprecated `F.magphase`, `F.angle`, `F.complex_norm`, and `T.ComplexNorm`. (1934, 1935, 1942)
- Utility functions for pseudo complex types were deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to 1337.
- Dropped pseudo complex support from `F.spectrogram`, `T.Spectrogram`, `F.phase_vocoder`, and `T.TimeStretch` (1957, 1958)
- The support for the pseudo complex type was deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to 1337.
- Removed deprecated `create_fb_matrix` (1998)
- `create_fb_matrix` was replaced by `melscale_fbanks` in release 0.10. It is removed in 0.11. Please use `melscale_fbanks`.
Datasets
- Removed deprecated VCTK (1825)
- The original VCTK archive file is no longer accessible. Please migrate to `VCTK_092` class for the latest version of the dataset.
- Removed deprecated dataset utils (1826)
- Undocumented methods `diskcache_iterator` and `bg_iterator` were deprecated in 0.10. They are removed in 0.11. Please cease the usage of them.
Models
- Removed unused dimension from pretrained Wav2Vec2 ASR (1914)
- The final linear layer of Wav2Vec2 ASR models included dimensions (`<s>`, `<pad>`, `</s>`, `<unk>`) that were not related to ASR tasks and not used. These dimensions were removed.
Build
- Dropped support for Python3.6 (2119, 2139)
- Following the lifecycle of Python-3.6, torchaudio dropped the support for Python 3.6.

New Features
RNN-T Emformer
* Introduced Emformer (1801)
* Added Emformer RNN-T model (2003)
* Added RNN-T beam search decoder (2028)
* Cleaned up Emformer module (2091)
* Added pretrained Emformer RNN-T streaming ASR inference pipeline (2093)
* Reorganized RNN-T components in prototype module (2110)
* Added integration test for Emformer RNN-T LibriSpeech pipeline (2172)
* Registered RNN-T pipeline global stats constants as buffers (2175)
* Refactored RNN-T factory function to support num_symbols argument (2178)
* Fixed output shape description in RNN-T docstrings (2179)
* Removed invalid token blanking logic from RNN-T decoder (2180)
* Updated stale prototype references (2189)
* Revised RNN-T pipeline streaming decoding logic (2192)
* Cleaned up Emformer (2207)
* Applied minor fixes to Emformer implementation (2252)

Conformer
* Introduced Conformer (2068)
* Removed subsampling and positional embedding logic from Conformer (2171)
* Moved ASR features out of prototype (2187)
* Passed bias and dropout args to Conformer convolution block (2215)
* Adjusted Conformer args (2223)

Datasets
* Added DR-VCTK dataset (1819)

Models
* Added HuBERT pretrain model to enable training from scratch (2064)
* Added feature mean square value to HuBERT Pretrain model output (2128)

Pipelines
* Added wav2vec2 ASR French pretrained from voxpopuli (1919)
* Added wav2vec2 ASR Spanish pretrained model from voxpopuli (1924)
* Added wav2vec2 ASR German pretrained model from voxpopuli (1953)
* Added wav2vec2 ASR Italian pretrained model from voxpopuli (1954)
* Added wav2vec2 ASR English pretrained model from voxpopuli (1956)

Build
* Added CUDA-11.5 builds to torchaudio (2067)

Improvements
I/O
* Fixed load behavior for 24-bit input (2084)

Ops
* Added OpenMP support (1761)
* Improved MVDR stability (2004)
* Relaxed dtype for MVDR (2024)
* Added warnings in mu_law* for the wrong input type (2034)
* Added parameter p to TimeMasking (2090)
* Removed unused vars from RNN-T loss (2142)
* Removed complex32 dtype in F.griffinlim (2233)

Datasets
* Deprecated data utils (2073)
* Updated URLs for libritts (2074)
* Added subset support for TEDLIUM release3 dataset (2157)

Models
* Replaced dropout with Dropout (1815)
* Inplace initialization of RNN weights (2010)
* Updated to xavier_uniform and avoid legacy data.uniform_ initialization (2018)
* Allowed Tacotron2 decode batch_size 1 examples (2156)

Pipelines
* Added tool to convert voxpopuli model (1923)
* Refactored wav2vec2 pipeline util (1925)
* Allowed the customization of axis exclusion for ASR head (1932)
* Tweaked wav2vec2 checkpoint conversion tool (1938)
* Added melkwargs setting for MFCC in HuBERT pipeline (1949)

Documentation
* Added 0.10.0 to version compatibility matrix (1862)
* Removed MACOSX_DEPLOYMENT_TARGET (1880)
* Updated intersphinx inventory (1893)
* Updated compatibility matrix to include LTS version (1896)
* Updated CONTRIBUTING with doc conventions (1898)
* Added anaconda stats to README (1910)
* Updated README.md (1916)
* Added citation information (1947)
* Updated CONTRIBUTING.md (1975)
* Doc fixes (1982)
* Added tutorial to CONTRIBUTING (1990)
* Fixed docstring (2002)
* Fixed minor typo (2012)
* Updated audio augmentation tutorial (2082)
* Added Sphinx gallery automatically (2101)
* Disabled matplotlib warning in tutorial rendering (2107)
* Updated prototype documentations (2108)
* Added custom CSS to make signatures appear in multi-line (2123)
* Updated prototype pipeline documentation (2148)
* Tweaked documentation (2152)

Tests
* Refactored integration test (1922)
* Enabled integration tests on CI (1939)
* Removed facebook folder in wav2vec unit tests (2015)
* Temporarily skipped threadpool test (2025)
* Revised Griffin-Lim transform test to reduce execution time (2037)
* Fixed CircleCI test failures (2069)
* Do not auto-skip tests on CI (2127)
* Relaxed absolute tolerance for Kaldi compat tests (2165)
* Added tacotron2 unit test with different batch_size (2176)

Build
* Updated GPU resource class (1791)
* Updated the main version to 0.11.0 (1793)
* Updated windows cuda installer 11.1.0 to 11.1.1 (1795)
* Renamed build_tools to tools (1812)
* Limit Windows GPU testing to CUDA-11.3 only (1842)
* Used cu113 for unittest_windows_gpu (1853)
* USE_CUDA in windows and reduce one vcvarsall (1854)
* Check torch installation before building package (1867)
* Install tools from conda instead of brew (1873)
* Cleaned up setup.py (1900)
* Moved TorchAudio conda package to use pytorch-mutex (1904)
* Updated smoke test docker image (1905)
* Fixed formatting CIRCLECI_TAG when building docs (1915)
* Fetch third party sources automatically (1966)
* Disabled SPHINXOPT=-W for local env (2013)
* Improved installing nightly pytorch (2026)
* Improved cuda installation on windows (2032)
* Refactored the library loading mechanism (2038)
* Cleaned up libtorchaudio customization logic (2039)
* Refactored and functionize the library definition (2040)
* Introduced helper function to define extension (2077)
* Standardized the location of third-party source code (2086)
* Show lint diff with color (2102)
* Updated third party submodule setup (2132)
* Suppressed stderr from subprocess in setup.py (2133)
* Fixed header include (2135)
* Updated ROCM version 4.1 -> 4.3.1 and 4.5 (2186)
* Added "cu102" back (2190)
* Pinned flake8 version (2191)

Style
* Removed trailing whitespace (1803)
* Fixed style checks (1913)
* Resolved lint warning (1971)
* Enabled CLANGFORMAT (1999)
* Fixed style checks in examples/tutorials (2006)
* OSS config for lint checks (2066)
* Excluded sphinx-gallery examples (2071)
* Reverted linting exemptions introduced in 2071 (2087)
* Applied arc lint to pytorch audio (2096)
* Enforced lint checks and fix/mute lint errors (2116)

Other
* Replaced issue templates with new issue forms (1802)
* Notify merger if PR is incorrectly labeled (1937)
* Added script to collect PRs between commits (1943)
* Fixed PR labeling requirement (1946)
* Refactored collecting-PR script for release note (1951)
* Fixed bandit failure (1960)
* Renamed bug fix label (1961)
* Updated PR label notifier (1964)
* Reverted "Update PR label notifier (1964)" (1965)
* Consolidated network utils (1974)
* Added PR collecting script (2008)
* Re-sync with internal repository (2017)
* Updated script for getting PR merger and labels (2030)
* Fixed third party archive fetch job (2095)
* Use python:3.X Docker image for build doc (2151)
* Updated PR labeling workflow (2160)
* Fixed librosa calls (2208)

Examples
Ops
* Removed the MVDR tutorial in examples (2109)
* Abstracted BucketizeSampler to be usable outside of HuBERT example (2147)
* Refactored BucketizeBatchSampler and HuBERTDataset (2150)
* Removed multiprocessing from audio dataset tutorial (2163)

Models
* Added training recipe for RNN-T Emformer ASR model (2052)
* Added global stats script and new json for LibriSpeech RNN-T training recipe (2183)

Pipelines
* Added preprocessing scripts for HuBERT model training (1911)
* Supported multi-node training for source separation pipeline (1968)
* Added bucketize sampler and dataset for HuBERT Base model training pipeline (2000)
* Added librispeech inference script (2130)

Other
* Added unmaintained warnings (1813)
* torch.quantization -> torch.ao.quantization (1823)
* Use download.pytorch.org for asset URL (2182)
* Added deprecation path for renamed training type plugins (11227)
* Renamed DDPPlugin to DDPStrategy (11142)

60.2

</td>
</tr>
</table>

Unit: msec

Improved Autograd Support

Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.

Functionals

* `lfilter`
* `allpass_biquad`
* `biquad`
* `band_biquad`
* `bandpass_biquad`
* `bandrefect_biquad`
* `bass_biquad`
* `equalizer_biquad`
* `treble_biquad`
* `highpass_biquad`
* `lowpass_biquad`

Transforms

* `AmplitudeToDB`
* `ComputeDeltas`
* `Fade`
* `GriffinLim`
* `TimeMasking`
* `FrequencyMasking`
* `MFCC`
* `MelScale`
* `MelSpectrogram`
* `Resample`
* `SpectralCentroid`
* `Spectrogram`
* `SlidingWindowCmn`
* `TimeStretch`[*](note-complex)
* `Vol`

**NOTE**:

1. Autograd test for transforms also covers the following functionals.
* `amplitude_to_DB`
* `spectrogram`
* `griffinlim`
* `resample`
* `phase_vocoder`[*](note-complex)
* `mask_along_axis_iid`
* `mask_along_axis`
* `gain`
* `spectral_centroid`
2. <a name="note-complex"></a>`torchaudio.transforms.TimeStretch` and `torchaudio.functional.phase_vocoder` call `atan2`, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.

[Beta] Resampling Improvement

In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.

* Kaiser window has been added for a wider range of resampling quality.
* `rolloff` parameter has been added for anti-aliasing control.
* `torchaudio.transforms.Resample` precomputes the kernel using `float64` precision and caches it for even faster operation.
* New entry point, `torchaudio.functional.resample` has been added and the original entry point, `torchaudio.compliance.kaldi.resample_waveform` is deprecated.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for `torchaudio.transforms.Resample` to complete the operation on `float32` tensor with two channels and one-second duration.

CPU

<table>
<tr>
<td>torchaudio version
</td>
<td>8k → 16k [Hz]
</td>
<td>16k → 8k
</td>
<td>16k → 44.1k
</td>
<td>44.1k → 16k
</td>
</tr>
<tr>
<td>0.9
</td>
<td><p style="text-align: right">

46.7

</td>
</tr>
</table>

Unit: msec

Improved Windows Support

torchaudio implements some operations in C++ for reasons such as performance and integration with third-party libraries. This C++ module was only available on Linux and macOS. In this release, Windows packages also come with C++ module.

This C++ module in Windows package includes the efficient filtering implementation mentioned above, however, `“sox_io”` backend and `torchaudio.functional.compute_kaldi_pitch` are not included.

I/O Functions Migration

Since the 0.6 release, we have continuously improved I/O functionality. Specifically, in 0.8 the default backend has been changed from `“sox”` to `“sox_io”`, and the similar API change has been applied to `“soundfile”` backend. The 0.9 release concludes this migration by removing the deprecated backends. For the detail please refer to [903](https://github.com/pytorch/audio/issues/903).

Backward Incompatible Changes

I/O

* Deprecated backends and functions were removed (1311, 1329, 1362)
* Please see 903 for the migration.
* Added validation of the number of channels when saving GSM (1384)
* Please make sure that signal has only one channel when saving into GSM.

Ops

* Removed deprecated `normalized` argument from `torchaudio.functional.griffinlim` (1369)
* This argument was never used. Please remove the argument from your call.
* Renamed `torchaudio.functional.sliding_window_cmn` arg for correctness (1347)
* The first argument is supposed to spectrogram. If you have used keyword argument `waveform=...`, please change it to `specgram=...`
* Changed `torchaudio.transforms.Resample` to precompute and cache the resampling kernel. (1499, 1514)
* To use the transform in devices other than CPU, please move the instantiated object to the target device.
python
resampler = torchaudio.transforms.Resample(orig_freq=8000, new_freq=44100)
resampler.to(torch.device("cuda"))

Dataset

* Removed deprecated arguments from CommonVoice (1534)
* `torchaudio` no longer supports programmatic download of Common Voice dataset. Please remove the arguments from your code.

Deprecations

* Deprecated the use of pseudo complex type (1445, 1492)
* `torchaudio` is adopting native complex type and the use of pseudo complex type and the related utility functions are now deprecated. Please refer to 1337 for the migration process.
* Deprecated `torchaudio.compliance.kaldi.resample_waveform` (1533)
* Please use `torchaudio.functional.resample`.
* `torchaudio.transforms.MelScale` now expects valid `n_stft` value (1515)
* Please provide a valid value to `n_stft`.

New Features

[Beta] Wav2Vec2.0

* Added wav2vec2.0 model (1529)
* Added wav2vec2.0 HuggingFace importer (1530)
* Added wav2vec2.0 fairseq importer (1531)
* Added speech recognition C++ example (1538)
* Please refer to [C++ example](https://github.com/pytorch/audio/tree/master/examples/libtorchaudio/speech_recognition) for the detail.

Filtering

* Added C++ implementation of `torchaudio.functional.lfilter` (1319)
* Added autograd support to `torchaudio.functional.lfilter` (1310, 1441)

[Beta] Resampling

* Added `torchaudio.functional.resample` (1402)
* Added `rolloff` parameter (1488)
* Added kaiser window support to resampling (1509)
* Added kernel caching mechanism in `torchaudio.transforms.Resample` (1499, 1514, 1556)
* Skip resampling when sampling rate is not changed (1537)

Native Complex Tensor

* Added complex tensor support to `torchaudio.functional.phase_vocoder` and `torchaudio.transforms.TimeStretch` (1410)
* Added `return_complex` to `torchaudio.functional.spectrogram` and `torchaudio.transforms.Spectrogram` (1366, 1551)

Improvements

I/O

* Added file path to I/O error messages (1523)
* Added `__str__` override to `AudioMetaData` for easy print (1339)
* Fixed uninitialized variable in `sox/utils.cpp` (1306)
* Replaced UB sox conversion macros with tensor op (1370)
* Removed `check_length` from `validate_input_file` (1312)

Ops

* Added warning for non-integer resampling frequencies (1490)
* Adopted native complex tensors in `torchaudio.functional.griffinlim` (1368)
* Prohibited scripting `torchaudio.transforms.MelScale` when `n_stft` is invalid (1505)
* Added input dimension check to VAD (1513)
* Added HTK-compatible option to Mel-scale conversion (593)

Models

* Added vanilla DeepSpeech model (1399)

Datasets

* Fixed checksum for the YESNO dataset (1405)

Misc

* Added missing transforms to `__all__` (1458)
* Removed `reference_cast` in `make_boxed_from_unboxed_functor` (1300)
* Removed unused normalized constant from `torchaudio.transforms.GriffinLim` (1433)
* Removed unused helper function (1396)

Examples

* Added libtorchaudio C++ example (1349)
* Refactored libtorchaudio example (1486)
* Replaced `librosa`'s Mel scale conversion with `torchaudio`’s in WaveRNN example (1444)

Build

* Updated `config.guess` to support source build in recent architectures (1484)
* Explicitly disabled wavpack when building SoX (1462)
* Added ROCm support to source build (1411)
* Added Windows C++ binary build (1345, 1371)
* Made kaldi selective in build (1342)
* Made sox selective (1338)

Testing

* Added autograd test for `torchaudio.functional.lfilter` and `biquad` variants (1400, 1438)
* Added autograd test for transforms (overview: 1414)
* `torchaudio.transforms.FrequencyMasking` (1498)
* `torchaudio.transforms.SlidingWindowCmn` (1482)
* `torchaudio.transforms.MelScale` (1467)
* `torchaudio.transforms.Vol` (1460)
* `torchaudio.transforms.TimeStretch` (1420)
* `torchaudio.transforms.AmplitudeToDB` (1447)
* `torchaudio.transforms.GriffinLim` (1421)
* `torchaudio.transforms.SpectralCentroid` (1425)
* `torchaudio.transforms.ComputeDeltas` (1422)
* `torchaudio.transforms.Fade` (1424)
* `torchaudio.transforms.Resample` (1416)
* `torchaudio.transforms.MFCC` (1415)
* `torchaudio.transforms.Spectrogram` / `MelSpectrogram` (1340)
* Added test for a batch of different items in the functional batch consistency test. (1315)
* Added test for validating `torchaudio.functional.lfilter` shape (1360)
* Added TorchScript test for `torchaudio.functional.resample` (1516)
* Added TorchScript test for `torchaudio.functional.phase_vocoder` (1379)
* Added steps to save and load the scripted object in TorchScript (1446)
* Added GPU support to functional tests (1475)
* Added GPU support to transform librosa compatibility test (1439)
* Added GPU support to functional librosa compatibility test (1436)
* Improved HTTP fetch test reliability (1512)
* Refactored functional batch consistency test (1341)
* Refactored test classes for complex (1491)
* Refactored sox_io load test (1394)
* Refactored Kaldi compatibility tests (1359)
* Refactored functional test (1435, 1463)
* Refactored transform tests (1356)
* Refactored librosa compatibility test (1350)
* Refactored sox compatibility test (1344)
* Refactored librosa compatibility test (1259)
* Removed the use I/O functions in batch consistency test (1521)
* Removed skipIfNoSoxBackend (1390)
* Removed VAD from batch consistency tests (1451)
* Replaced deprecated `floor_divide` with `div` (1455)
* Replaced `torch.assert_allclose` with `assertEqual` (1387)
* Shortened `torchaudio.functional.lfilter` autograd tests input size (1443)
* Updated `torchaudio.transforms.InverseMelScale` comparison test (1437)

Bug Fixes

* Updated `torchaudio.transforms.TimeMasking` and `torchaudio.transforms.FrequencyMasking` to perform out-of-place masking (1481)
* Annotate `power` of `torchaudio.transforms.MelSpectrogram` as float only (1572)

Performance

* Adopted `torch.nn.functional.conv1d` in `torchaudio.functional.lfilter` (1318)
* Added C++ implementation of `torchaudio.functional.overdrive` (1299)

Documentation

* Update docs (1550)
* Reformat resample docs (1548)
* Updated resampling documentation (1519)
* Added the clarification that `sox_effects.apply_effects_tensor` is CPU-only (1459)
* Removed instructions on using external sox (1365, 1281)
* Added navigation with left/right arrow keys (1336)
* Fixed docstring of `sliding_window_cmn` (1383)
* Update contributing guide (1372)
* Fix broken links in contribution guide (1361)
* Added Windows build instructions (1440)
* Fixed typo (1471, 1397, 1293)
* Added WER to readme in wav2letter pipeline (1470)
* Fixed wav2letter usage example (1060)
* Added Google Analytics support (1466)

43.9

</td>
<td><p style="text-align: right">

22.3

</td>
</tr>
</table>

Unit: msec

Complex Tensor Migration

`torchaudio` has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, `torchaudio` adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as `torch.cfloat` and `torch.cdouble` were introduced to represent complex values natively. (In the following, we refer to `torchaudio`’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)

As the native complex types have become mature and stable, `torchaudio` has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to 1337.

Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on `float32` Tensor with two channels and 256 frames.

CPU

<table>
<tr>
<td>torchaudio version
</td>
<td><code>Spectrogram</code>
</td>
<td><code>TimeStretch</code>
</td>
<td><code>GriffinLim</code>
</td>
</tr>
<tr>
<td>0.9
</td>
<td><p style="text-align: right">

17.6

</td>
</tr>
</table>

Unit: msec

CUDA

<table>
<tr>
<td>torchaudio version
</td>
<td>8k → 16k
</td>
<td>16k → 8k
</td>
<td>16k → 44.1k
</td>
<td>44.1k → 16k
</td>
</tr>
<tr>
<td>0.9
</td>
<td><p style="text-align: right">

Page 1 of 16

Releases

Has known vulnerabilities

Torchaudio

Page 1 of 16

2005.08100

60.2

46.7

43.9

22.3

17.6

Page 1 of 16

Links

Releases