-----------------------
New Features
~~~~~~~~~~~~
- Add support for Python 3.10.
(:user:`tomwhite`, :pr:`813`, :issue:`801`)
- Add pedigree support. This allows parent-child relationships to be
stored in sgkit, and provides a number of new pedigree methods:
:func:`pedigree_inbreeding`, :func:`pedigree_inverse_kinship`,
and :func:`pedigree_kinship`.
(:user:`timothymillar`, :issue:`786`)
- Implement a function to calculate the VanRaden genomic relationship matrix,
:func:`genomic_relationship`.
(:user:`timothymillar`, :pr:`903`, :issue:`874`)
- Generic functions for cohort sums and means.
(:user:`timothymillar`, :pr:`867`, :issue:`730`)
- Toggle numba caching by environment variable ``SGKIT_DISABLE_NUMBA_CACHE``.
(:user:`timothymillar`, :pr:`870`, :issue:`869`)
- Add :func:`window_by_genome` for computing whole-genome statistics.
(:user:`tomwhite`, :pr:`945`, :issue:`893`)
- Add :func:`window_by_interval` to create windows from arbitrary intervals.
(:user:`tomwhite`, :pr:`974`)
- Add ``contig_lengths`` dataset attribute if found in the VCF file.
(:user:`tomwhite`, :pr:`946`, :issue:`464`)
- Add VCF export functions.
(:user:`tomwhite`, :pr:`953`, :issue:`924`)
- Add ``auto_rechunk`` option to ``sgkit.save_dataset`` to automatically rechunk
the dataset before saving it to disk, if necessary, as zarr requires equal chunk
sizes. (:user:`benjeffery`, :pr:`988`, :issue:`981`)
- Implement gene-ε for gene set association analysis.
(:user:`tomwhite`, :pr:`975`, :issue:`692`)
- Add :func:`count_variant_genotypes` to count the occurrence of each possible
genotype.
(:user:`timothymillar`, :issue:`911`, :pr:`1002`)
Breaking changes
~~~~~~~~~~~~~~~~
- Remove support for Python 3.7.
(:user:`tomwhite`, :pr:`927`, :issue:`802`)
- The ``count_a1`` parameter to :func:`sgkit.io.plink.read_plink` previously
defaulted to ``True`` but now defaults to ``False``. Furthermore, ``True``
is no longer supported since it is not clear how it should behave.
(:user:`tomwhite`, :pr:`952`, :issue:`947`)
- The ``dosage`` variable specification has been removed and all references
to it have been replaced with :data:`sgkit.variables.call_dosage_spec`
which has been generalized to include integer encodings. Additionally,
the default value for the ``dosage`` parameter in :func:`ld_matrix` and
:func:`ld_prune` has been changed from ``'dosage'`` to ``'call_dosage'``.
(:user:`timothymillar`, :pr:`995`, :issue:`875`)
- The ``genotype_count`` variable has been removed in favour of
:data:`sgkit.variables.variant_genotype_count_spec` which follows VCF ordering
(i.e., homozygous reference, heterozygous, homozygous alternate for biallelic,
diploid genotypes).
:func:`hardy_weinberg_test` now defaults to using
:data:`sgkit.variables.variant_genotype_count_spec` for the ``genotype_count``
parameter. (:user:`timothymillar`, :issue:`911`, :pr:`1002`)
.. Deprecations
.. ~~~~~~~~~~~~
Improvements
~~~~~~~~~~~~
- Improvements to VCF parsing performance.
(:user:`benjeffery`, :pr:`933`)
- Improve default VCF compression.
(:user:`tomwhite`, :pr:`937`, :issue:`925`)
- Ensure chunking is not excessive in samples dimension.
(:user:`tomwhite`, :pr:`943`)
- Add asv benchmarks for VCF performance.
(:user:`tomwhite`, :pr:`976`)
- Add asv benchmarks for VCF compression size.
(:user:`tomwhite`, :pr:`978`)
Bug fixes
~~~~~~~~~
- Allow chunking in the samples dimension for :func:`identity_by_state`.
(:user:`timothymillar`, :pr:`837`, :issue:`836`)
- Remove VLenUTF8 from filters to avoid double encoding error.
(:user:`tomwhite`, :pr:`852`, :issue:`785`)
- Fix numpy input for ``Weir_Goudet_beta``.
(:user:`timothymillar`, :pr:`865`, :issue:`861`)
- Fix ``get_region_start`` to work with contig names that have colons and dashes.
(:user:`d-laub`, :pr:`883`, :issue:`882`)
- Fixes to VCF reading and writing found by hypothesis testing.
(:user:`tomwhite`, :pr:`972`)
.. Documentation
.. ~~~~~~~~~~~~~