New corpora
- GigaSpeech (283, thanks jimbozhang)
- Dihard 3 (287, thanks desh2608)
- GALE Arabic and Mandarin (296, thanks desh2608)
- CMU and CSLU Kids (297, thanks desh2608)
- MTedX (301, thanks m-wiesner)
- LibriTTS (306)
New features
- Reading huge manifests lazily with Apache Arrow (documentation and examples are coming) (286, 288, 289, 290, 292, 294)
- Sequential JSONL writer storing manifests on disk as they are created (302)
- Support for alignments in `SupervisionSegment` (304, 310, 313, thanks desh2608)
- PyTorch Kaldi-compatible feature extractors that support GPU, batching and autograd (307, thanks jesus-villalba)
- Reading, writing, and uploading features to URLs (e.g. S3 or GCP) (312)
- Store waveforms of cuts as audio recordings to disk (316, thanks entn-at)
- Support for importing Kaldi's feats.scp and reading features directly from scp/ark (318)
General improvements
- add multi thread to process AIShell data (259, thanks pingfengluo)
- tracking dev versions (291, thanks oplatek)
- Explicitly set UTF-8 encoding when reading README.md in setup.py (293, thanks entn-at)
- Auto-add link to source code in docs (295)
- `cut.resample()` (299)
- fixing flaky tests (300)
- fix AMI CLI mode (303, thanks desh2608)
- handle zero energy error in audio mixing (305)
- update Kaldi related docs (308)
- add a missing SpecAugment parameter (309)
- fixing edge cases for audio transforms (311)
- Add `drop_last` option in `*Set.split()` (315)
- Support h5py file modes in feature writers (317)
- don't using kaldi reco2dur and fix some error in bin/lhotse (318, thanks shanguanma)
- Fix cut num of samples bug (322, thanks dophist)
- use whitespace in kaldi field-splitting (323, thanks dophist)