Torchaudio

Latest version: v2.6.0

Safety actively analyzes 723607 Python packages for vulnerabilities to keep your Python projects secure.

Page 13 of 16

0.11.0

- Emformer ([paper](https://arxiv.org/abs/2010.10759)) RNN-T components, training recipe, and pre-trained pipeline for streaming ASR
- Voxpopuli pre-trained pipelines
- HuBERTPretrainModel for training HuBERT from scratch
- Conformer model for speech recognition
- Drop Python 3.6 support

[Beta] Emformer RNN-T
To support streaming ASR use cases, the release adds implementations of Emformer ([docs](https://pytorch.org/audio/0.11.0/models.html#emformer)), an RNN-T model that uses Emformer ([emformer_rnnt_base](https://pytorch.org/audio/0.11.0/models.html#emformer-rnnt-base)), and an RNN-T beam search decoder ([RNNTBeamSearch](https://pytorch.org/audio/0.11.0/models.html#rnntbeamsearch)). It also includes a pipeline bundle ([EMFORMER_RNNT_BASE_LIBRISPEECH](https://pytorch.org/audio/0.11.0/pipelines.html#emformer-rnnt-base-librispeech)) that wraps pre- and post-processing components, the beam search decoder, and the RNN-T Emformer model with weights pre-trained on LibriSpeech, which in whole allow for performing streaming ASR inference out of the box. For reference and reproducibility, the release provides the training recipe used to produce the pre-trained weights in the examples directory.

[Beta] HuBERT Pretrain Model
The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds [HuBERTPretrainModel](https://pytorch.org/audio/0.11.0/models.html#hubertpretrainmodel) and corresponding factory functions ([hubert_pretrain_base](https://pytorch.org/audio/0.11.0/models.html#hubert-pretrain-base), [hubert_pretrain_large](https://pytorch.org/audio/0.11.0/models.html#hubert-pretrain-large), and [hubert_pretrain_xlarge](https://pytorch.org/audio/0.11.0/models.html#hubert-pretrain-xlarge)) to enable training from scratch.

0.10.2

This is a minor release compatible with [PyTorch 1.10.2](https://github.com/pytorch/pytorch/releases/tag/v1.10.2).

There is no feature change in torchaudio from 0.10.1. For the full feature of v0.10, please refer to the [v0.10.0 release notes](https://github.com/pytorch/audio/releases/tag/v0.10.0).

0.10.1

This is a minor release, which is compatible with [PyTorch 1.10.1](https://github.com/pytorch/pytorch/releases/tag/v1.10.1) and include small bug fix, improvements and documentation update. There is no new feature added.

Bug Fix
- 2050 Allow whitespace as `TORCH_CUDA_ARCH_LIST` delimiter

Improvement
- 2054 Fetch third party source code automatically
The build process now fetches third party source code (git submodule and cmake external projects)
- 2059 Improve documentation

For the full feature of v0.10, please refer to [the v0.10.0 release note](https://github.com/pytorch/audio/releases/tag/v0.10.0).

0.10

0.10.0

- New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
- Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
- New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
- CUDA-enabled binaries

[Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights
HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from [wav2vec 2.0](https://arxiv.org/abs/2006.11477), [Unsupervised Cross-lingual Representation Learning](https://arxiv.org/abs/2006.13979) and [HuBERT](https://arxiv.org/abs/2106.07447) are added.

These pretrained weights can be used for feature extractions and downstream task adaptation.
python
>>> import torchaudio
>>>
>>> Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> Pass the features to downstream task
>>> ...

Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)
python
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD

[Beta] Tacotron2 and TTS Pipeline
A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under `torchaudio.pipelines` module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.
python
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> Encode text
>>> input, lengths = processor(text)
>>>
>>> Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)

[Beta] RNN Transducer Loss
The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (`torchaudio.functional.rnnt_loss` or `torchaudio.transforms.RNNTLoss`) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.

[Beta] MVDR Beamforming
This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio.
Please refer to the [MVDR tutorial](https://pytorch.org/audio/0.10.0/tutorials/mvdr_tutorial.html).

GPU Build
This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.

Additional Features
`torchaudio.functional.lfilter` now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.

Backward Incompatible Changes
I/O
- Default to PCM_16 for flac on soundfile backend (1604)
- When saving FLAC format with “soundfile” backend, `PCM_24` (the previous default) could cause warping. The default has been changed to `PCM_16`, which does not suffer this.
Ops
- Default to native complex type when returning raw spectrogram (1549)
- When `power=None`, `torchaudio.functional.spectrogram` and `torchaudio.transforms.Spectrogram` now defaults to `return_complex=True`, which returns Tensor of native complex type (such as `torch.cfloat` and `torch.cdouble`). To use a pseudo complex type, pass the resulting tensor to `torch.view_as_real`.
- Remove deprecated kaldi.resample_waveform (1555)
- Please use `torchaudio.functional.resample`.
- Replace waveform with specgram in SlidingWindowCmn (1859)
- The argument name was corrected to `specgram`.
- Ensure integer input frequencies for resample (1857)
- Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.
Wav2Vec2
- Update `extract_features` of Wav2Vec2Model (1776)
- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use `Wav2Vec2Model.feature_extractor()`.
- Move fine-tune specific module out of wav2vec2 encoder (1782)
- The internal structure of `Wav2Vec2Model` was updated. `Wav2Vec2Model.encoder.read_out` module is moved to `Wav2Vec2Model.aux`. If you have serialized state dict, please replace the key `encoder.read_out` with `aux`.
- Updated wav2vec2 factory functions for more customizability (1783, 1804, 1830)
- The signatures of wav2vec2 factory functions are changed. `num_out` parameter has been changed to `aux_num_out` and other parameters are added before it. Please update the code from `wav2vec2_base(num_out)` to `wav2vec2_base(aux_num_out=num_out)`.

Deprecations
- Add `melscale_fbanks` and deprecate `create_fb_matrix` (1653)
- As `linear_fbanks` is introduced, `create_fb_matrix` is renamed to `melscale_fbanks`. The original `create_fb_matrix` is now deprecated. Please use `melscale_fbanks`.
- Deprecate `VCTK` dataset (1810)
- This dataset has been taken down and is no longer available. Please use `VCTK_092` dataset.
- Deprecate data utils (1809)
- `bg_iterator` and `diskcache_iterator` are known to not improve the throughput of data loaders. Please cease their usage.

New Features
Models
**Tacotron2**
- Add Tacotron2 model (1621, 1647, 1844)
- Add Tacotron2 loss function (1764)
- Add Tacotron2 inference method (1648, 1839, 1849)
- Add phoneme text preprocessing for Tacotron2 (1668)
- Move Tacotron2 out of prototype (1714)

**HuBERT**
- Add HuBERT model architectures (1769, 1811)

Pretrained Weights and Pipelines
* Add pretrained weights for wavernn (1612)

* Add Tacotron2 pretrained models (1693)

* Add HUBERT pretrained weights (1821, 1824)

* Add pretrained weights from wav2vec2.0 and XLSR papers (1827)

* Add customization support to wav2vec2 labels (1834)

* Default pretrained weights to eval mode (1843)

* Move wav2vec2 pretrained models to pipelines module (1876)

* Add TTS bundle/pipelines (1872)

* Fix vocoder interface (1895)

* Fix Phonemizer download (1897)

RNN Transducer Loss
* Add reduction parameter for RNNT loss (1590)

* Rename RNNT loss C++ parameters (1602)

* Rename transducer to RNNT (1603)

* Remove gradient variable from RNNT loss Python code (1616)

* Remove reuse_logits_for_grads option for RNNT loss (1610)

* Remove fused_log_softmax option from RNNT loss (1615)

* RNNT loss resolve null gradient (1707)

* Move RNNT loss out of prototype (1711)

MVDR Beamforming
* Add MVDR module to example (1709)

* Add normalization to steering vector solutions in MVDR Module (1765)

* Move MVDR and PSD modules to transforms (1771)

* Add MVDR beamforming tutorial to example directory (1768)

Ops
* Add edit_distance (1601)

* Add PitchShift to functional and transform (1629)

* Add LFCC feature to transforms (1611)

* Add InverseSpectrogram to transforms and functional (1652)

Datasets
* Add CMUDict dataset (1627)

* Move LibriMix dataset to datasets directory (1833)

Improvements
I/O
* Make buffer size for function info configurable (1634)

Ops
* Replace deprecated AutoNonVariableTypeMode (1583)

* Remove lazy behavior from MelScale (1636)

* Simplify axis value checks (1501)

* Use at::parallel_for in lfilter core loop (1557)

* Add filterbanks support to lfilter (1587)

* Add batch support to lfilter (1638)

* Use integer rates in pitch shift resample (1861)

Models
* Rename infer method to forward for WaveRNNInferenceWrapper (1650)

* Refactor WaveRNN infer and move it to the codebase (1704)

* Make the core wav2vec2 factory function public (1829)

* Refactor WaveRNNInferenceWrapper (1845)

* Store n_bits in WaveRNN (1847)

* Replace custom padding with torch’s native impl (1846)

* Avoid concatenation in loop (1850)

* Add lengths param to WaveRNN.infer (1851)

* Add sample rate to wav2vec2 bundle (1878)

* Remove factory functions of Tacotron2 and WaveRNN (1874)

Datasets
* Fix encoding of CMUDict data reading (1665)

* Rename utterance to transcript in datasets (1841)

* Clean up constructor of CMUDict (1852)

Performance
* Refactor transforms.Fade on GPU computation (1871)

**CUDA**
Tensor shape | [1,4,8000] | [1,4,16000] | [1,4,32000]
-- | -- | -- | --

0.9.1 Page 13 of 16

Releases

Has known vulnerabilities

Previous Next

Torchaudio

Page 13 of 16

0.11.0

0.10.2

0.10.1

0.10

0.10.0

0.9.1

Page 13 of 16

Links

Releases