Feat
- benchmark interval decompression on cpu with numba vs. cpu with taichi vs. gpu with taichi
- optionally decompress intervals to tracks on gpu
- initial support for stranded regions
- option to cache fasta files as numpy arrays.
- implement BigWig intervals as Rust extension.
- finishing touches on multi-track implementation. Block is cryptic issue where writing genotypes is somehow preventing joblib from launching new processes.
- stop overwriting by default, add option.
- transforms directly on tracks. feat: intervals as array of structs for better data locality.
- let extra tracks get added via paths
- let extra tracks get added via paths
- initial support for indels in tracks and WIP on also returning auxiliary genome wide tracks.
- initial sparse genos -> haplotypes and sparse hap diffs.
- wip sparse genotypes.
- properties for getting haplotypes, references, or tracks only.
- properties for getting haplotypes, references, or tracks only.
- encourage num_workers <= 1 with GVL dataloader.
- freeze gvl.Dataset to prevent user from accidentally introducing invalid states. feat: warn if any query contigs have either no variatns or intervals associated with them.
- warn instead of error when no reference passed and genos present.
- disable overwriting by default, have no args be help.
- also report number of samples.
- add .from_table constructor for BigWigs.
- move CLI to script, include in package.
- use a table to specify bigwigs instead. fix: jittering.
- add script to write datasets to disk.
- more quality of life improvements. relax dependency version constraints.
- with_seed method
- quality of life methods for subsetting and converting to dataloaders.
- torch convenience functions fix: ensure genotypes and intervals written in sorted order wrt the BED file.
- pre-computed implementation.
Fix
- dependency typo
- remove taichi interval to track implementation since it did not improve performance, even on GPU
- need to subset arrays to be reverse complemented
- change argument order of subset_to to match the rest of the API. fix: simplify subset implementation.
- remove python 3.10 type hints
- dimension order on subsets.
- make variant indices absolute on write.
- sparse genotypes layout
- sparse genotypes layout
- wrong layout out genotypes and wrong max ends computation.
- ragged array layouts for correct concatenation when writing datasets one contig at a time.
- bug where init_intervals would not initialize all available tracks.
- track_to_intervals had wrong n_intervals and thus, wrong offsets.
- track_to_intervals had wrong n_intervals and thus, wrong offsets.
- bug in computing max ends.
- match serde for genome tracks.
- bug in open state management.
- bug when writing genotypes where the chromosome of the requested regions is not present in the VCF.
- bug getting intersection of samples available.
- bug getting intersection of samples available.
- sum wrong axis in adjust multi index.
- make GVLDataset __getitem__ API match torch Dataset API (i.e. use raveled index)
- QOL improvements.
- incorrect genotypes returned from VCF when queries have overlapping ranges.
- wrong shape.
- wrong shape.
Refactor
- move construct virtual data to loader so utils import faster.
- move construct virtual data to loader so utils import faster.
- rename util to utils.
- rename util to utils.
- move write under dataset directory. perf?: move indexing operations into numba.
- move cli to script outside package, faster help message.
- break up dataset implementation into smaller files. refactor!: condense with_ methods into single with_settings() methods. feat: sel() and isel() methods for eager retrieval by sample and region.
Perf
- when opening witih settings and providing a reference, but return_sequences is false, don't load the reference into memory.