Highlights
This release supports Python 3.9.
I/O Improvements
Continuing from the previous release, torchaudio improves the audio I/O mechanism. In this release, we have four major updates.
1. Backend migration.
We have migrated the default backend for audio I/O. The new default backend is “sox_io” (for Linux/macOS). The interface for “soundfile” backend has been also changed to align that of “sox_io”. Following the change of default backends, the legacy backend/interface have been marked as deprecated. The legacy backend/interface are still accessible, though it is strongly discouraged to use them. For the detail on the migration, please refer to 903.
1. File-like object support.
We have added file-like object support to I/O functions and sox_effects. You can perform the `info`, `load`, `save` and `apply_effects_file` operation on file-like objects.
python
Query audio metadata over HTTP
Will only fetch the first few kB
with requests.get(URL, stream=True) as response:
metadata = torchaudio.info(response.raw)
Load audio from tar file
No need to extract TAR file.
with tarfile.open(TAR_PATH, mode='r') as tarfile_:
fileobj = tarfile_.extractfile(SAMPLE_TAR_ITEM)
waveform, sample_rate = torchaudio.load(fileobj)
Saving to Bytes buffer
Using BytesIO, you can perform in-memory encoding/decoding.
buffer_ = io.BytesIO()
torchaudio.save(buffer_, waveform, sample_rate, format="wav")
Apply effects (lowpass filter / resampling) while loading audio from S3
client = boto3.client('s3')
response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY)
waveform, sample_rate = torchaudio.sox_effects.apply_effect_file(
response['Body'], [["lowpass", "-1", "300"], ["rate", "8000"]])
1. [Beta] Codec Application.
Built upon the file-like object support, we added `functional.apply_codec` function, which can degrades audio data by applying audio codecs supported by “sox_io” backend, in in-memory fashion.
python
Apply MP3 codec
degraded = F.apply_codec(
waveform, sample_rate, format="mp3", compression=-9)
Apply GSM codec
degraded = F.apply_codec(waveform, sample_rate, format="gsm")
1. Encoding options.
We have added encoding options to save function of new backends. Now you can change the format and encodings with `format`, `encoding` and `bits_per_sample` options
python
Save without any encoding option.
The function will pick the encoding which the provided data fit
For Tensor of float32 type, that is 32-bit floating-point PCM.
torchaudio.save("data.wav", waveform, sample_rate)
Save as 16-bit signed integer Linear PCM
The resulting file occupies half the storage but loses precision
torchaudio.save(
"data.wav", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)
1. More format support to "sox_io"’s save function.
We have added support for GSM, HTK, AMB, and AMR-NB formats to "sox_io"’s save function.
Switch to CMake-based build
torchaudio was utilizing CMake to build third party dependencies. Now torchaudio uses CMake to build its C++ extension. This will open the door to integrate torchaudio in non-Python environments (such as C++ applications and mobile). We will work on adding example applications and mobile integrations in upcoming releases.
Backwards Incompatible Changes
* Removed deprecated transform and target_transform arguments from VCTK and YESNO datasets. (1120) If you were relying on the previous behavior, we recommend that you apply the transforms in the collate function.
* Removed torchaudio.datasets.utils.walk_files (1111) and replaced by Path and glob. (1069, 1101). If you relied on the function, we recommend that you use glob instead.
* Removed torchaudio.data.utils.unicode_csv_reader. (1086) If you relied on the function, we recommend that you replace by csv.reader.
* Disabled CommonVoice download as users are required to sign user agreement. Please download and extract the dataset manually, and replace the root argument by the subfolder for the version and language of interest, see 1082 for more details. (1018, 1079, 1080, 1082)
* Removed legacy sox effects (977, 1001). Please migrate to apply_effects_file or apply_effects_tensor.
* Switched the default backend to the ones with new interfaces (978). If you were relying on the previous behavior, you can return to the previous behavior by following instructions in 975 for one more release.
New Features
* Added GSM, HTK, AMB, AMR-NB and AMR-WB format support to “sox_io” backend. (1276, 1291, 1277, 1275, 1066)
* Added encoding options (format, bits_per_sample and encoding) to save function. (1226, 1177, 1129, 1104)
* Added new attributes (bits_per_sample and encoding) to the info function return type (AudioMetaData) (1177, 1206, 1324)
* Added format override to libsox-based file input. (load, info, sox_effects.apply_effects_file) (1104)
* Added file-like object support in “sox_io”, and “soundfile” backend and sox_effects.apply_effects_file. (1115)
* [Beta] Added the Kaldi Pitch feature. (1243, 1260)
* [Beta] Added the SpectralCentroid transform. (1167, 1216, 1316)
* [Beta] Added codec transformation apply_codec. (1200)
Improvements
* Exposed normalization method to Mel transforms. (1212)
* Exposed additional STFT arguments to Spectrogram (892) and to MelSpectrogram (1211).
* Added support for pathlib.Path to apply_effects_file (1048) and to CMUARCTIC (1025), YESNO (1015), COMMONVOICE (1027), VCTK and LJSPEECH (1028), GTZAN (1032), SPEECHCOMMANDS (1039), TEDLIUM (1045), LIBRITTS and LIBRISPEECH (1046).
* Added SpeechCommands train/valid/test split. (966, 1012)
Internals
* Replaced if-elseif-else with switch in sox C++ code. (1270)
* Refactored C++ interface for sox_io's get_info_file (1232) and get_encodinginfo (1233).
* Add explicit functional import in init. (1228)
* Refactored YESNO dataset (1127), LJSPEECH dataset (1143).
* Removed Python 2.7 reference from setup.py. (1182)
* Merged flake8 configurations into single .flake8 file. (1172, 1214)
* Updated calls to torch.stft to use return_complex=True. (1096, 1013)
* Cleaned up handling of optional args in C++ with c10:optional. (1043)
* Removed unused imports in sox effects. (1052)
* Introduced functional submodule to organize functionals. (1003)
* [Testing] Refactored MelSpectrogram librosa compatibility test to decouple from other tests. (1267)
* [Testing] Moved batch tests for functionals. (1254)
* [Testing] Refactored tests for backend (1239) and for functionals (1237).
* [Testing] Removed dependency on pytest from testing (1157, 1188)
* [Testing] Refactored unitests for VCTK (1134), SPEECHCOMMANDS (1136), LIBRISPEECH (1140), TEDLIUM (1135), LJSPEECH (1138), LIBRITTS (1139), CMUARCTIC (1147), GTZAN(1148), COMMONVOICE and YESNO (1133).
* [Testing] Removed dependency on COMMONVOICE dataset from tests. (1132)
* [Build] Fixed Python 3.9 support (1242)
* [Build] Switched to cmake for build. (1187, 1246, 1249)
* [Build] Restructured C++ code to allow per file registration of custom ops. (1221)
* [Build] Added logging to sox/CMakeLists.txt. (1190)
* [Build] Disabled C++11 ABI when necessary for libtorch compatibility. (880)
* [Build] Reorganized libsox source and build directory to accommodate additional third party code. (1161, 1176)
* [Build] Refactored sox source files and moved into dedicated subfolder. (1106)
* [Build] Enabled custom clean function for python setup.py clean. (1142)
* [CI] Documented undocumented parameters. Added CI check. (1248)
* [CI] Fixed sphinx warnings in documentation. Turned warnings into errors. (1247)
* [CI] Print CPU info before running unit test. (1218)
* [CI] Fixed clang-format job and fixed newly detected formatting issues. (981, 1198, 1222)
* [CI] Updated unit test base Docker image. (1193)
* [CI] Disabled CCI cache which is now known to be flaky. (1189)
* [CI] Disabled torchscript BC test which is known to fail. (1192)
* [CI] Stripped version suffix for pytorch. (1185)
* [CI] Ran smoke test with CPU package for pytorch due to known issue with CUDA 11. (1105)
* [CI] Added missing empty line at the end of config.yml. (1020)
* [CI] Added automatic documentation build and push to branch in CI. (1006, 1034, 1041, 1049, 1091, 1093, 1098, 1100, 1121)
* [CI] Ran GPU test for all pull requests and fixed current setup. (998, 1014, 1191)
* [CI] Skipped tests that is known to fail on macOS Python 3.6/3.7. (999)
* [CI] Changed the order of installation and aligned with Windows. (987)
* [CI] Fixed documentation rendering by using Sphinx 2.4.4. (974)
* [Doc] Added subcategories to functional documentation. (1325)
* [Doc] Added a version selector in documentation. (1273)
* [Doc] Updated compilation recommendation in README. (1263)
* [Doc] Added CONTRIBUTING.md. (1241)
* [Doc] Added instructions to install parametrized package. (1164)
* [Doc] Fixed the return type for load functions. (1122)
* [Doc] Added missing modules and minor fixes. (1022, 1056, 1117)
* [Doc] Fixed spelling and links in README. (1029, 1037, 1062, 1110, 1261)
* [Doc] Grouped filtering functionals in documentation page. (1005, 1004)
* [Doc] Updated the compatibility matrix with torchaudio 0.7 (979)
* [Doc] Added description of prototype/beta/stable features. (968)
Bug Fixes
* Fixed amplitude_to_DB clamping behaviour on batches. (1113)
* Disabled audio devices in sox builds which could interfere in the build process when detected. (1153)
* Fixed COMMONVOICE for French where the audio file extension was missing on load. (1126)
* Disabled OpenMP support for libsox which can produce errors when used in DataLoader. (1026)
* Fixed noise_down_time argument in VAD by properly propagating it. (1017)
* Removed print-freq option to compute validation loss at each epoch in wav2letter pipeline. (997)
* Migrated from torch.rfft to torch.fft.rfft and cfloat following change in pytorch. (941)
* Fixed interactive ASR demo to aligned with latest version of FAIRSeq. (996)
Deprecations
* The normalized argument is unused and will be removed from griffinlim. (1036)
* The previous sox and soundfile backend remain available for one release, see 903 for details. (975)
Performance
* Added C++ lfilter core loop for faster iteration on CPU. (1244)
* Leveraged julius resampling implementation to make resampling faster. (1087)