[experimental] Lhotse Shar -- a modular, sharded, sequential I/O data storage format
This release has a major (experimental) feature called Lhotse Shar. It's a data format inspired by WebDataset tar files that's intended to be very fast for sequential reading of data stored in tarfile shards. It extends the ideas of WebDataset by allowing to store multiple types of features and metadata in separate tar archives that are iterated and loaded together with cuts. It allows to extend existing data with new fields (think different feature extractors, alignments, embeddings, etc.) without triggering a hard copy as would be the case with previous sequential formats supported by Lhotse. Preliminary benchmarking indicated it is as fast as WebDataset both with local disks and cloud storage.
A tutorial notebook about Lhotse Shar is planned to be released later this year.
What's Changed
* Sharded tar writers for Lhotse Shar format by pzelasko in https://github.com/lhotse-speech/lhotse/pull/850
* load ark directly in KaldiReader by csukuangfj in https://github.com/lhotse-speech/lhotse/pull/862
* Add a concrete example showing how to import a Kaldi data directory by csukuangfj in https://github.com/lhotse-speech/lhotse/pull/864
* Fixing shuffling of CutSet with a single cut by Tomiinek in https://github.com/lhotse-speech/lhotse/pull/869
* Fixed an erroneous assertion by JinZr in https://github.com/lhotse-speech/lhotse/pull/874
* Small changes to make channel attribute hashable by desh2608 in https://github.com/lhotse-speech/lhotse/pull/875
* Safe extract tarballs by desh2608 in https://github.com/lhotse-speech/lhotse/pull/876
* Shar: tarfiles now also contain metadata by pzelasko in https://github.com/lhotse-speech/lhotse/pull/870
* Shar: support dynamically attaching custom non-data attributes by pzelasko in https://github.com/lhotse-speech/lhotse/pull/877
* Option not to save cuts in SharWriter by pzelasko in https://github.com/lhotse-speech/lhotse/pull/878
* Minor changes in some recipes by desh2608 in https://github.com/lhotse-speech/lhotse/pull/880
* add ssl feature extractor by DongjiGao in https://github.com/lhotse-speech/lhotse/pull/881
* Shar: a way to attach shard-specific metadata to cuts from each shard by pzelasko in https://github.com/lhotse-speech/lhotse/pull/884
* Always return integer sampling rate when reading audio by pzelasko in https://github.com/lhotse-speech/lhotse/pull/885
* Add option to split AMI segments similar to Kaldi by desh2608 in https://github.com/lhotse-speech/lhotse/pull/889
New Contributors
* JinZr made their first contribution in https://github.com/lhotse-speech/lhotse/pull/874
* DongjiGao made their first contribution in https://github.com/lhotse-speech/lhotse/pull/881
**Full Changelog**: https://github.com/lhotse-speech/lhotse/compare/v1.9...v1.10