This release includes a variety of improvements to CNVkit's calling accuracy and robustness. All CNVkit files built with previous versions will continue to work with this version, but for best results, I recommend rebuilding your reference.cnn file(s) from the targetcoverage.cnn and antitargetcoverage.cnn files.
`coverage`:
- Output target/antitarget coverage (.cnn) files are no longer median-centered. Read depths in each bin are still log2-scaled, but the observed read depth can now be easily recovered from .cnn files.
`reference`, `fix`:
- Include a "flat pseudocount" in addition to the given normals, making paired tumor-normal calling much more robust and accurate.
- Perform bias corrections on the input normal samples before calculating the average and spread of log2 values.
`fix`:
- Do bias corrections before subtracting the reference, instead of after, because the reference already includes bias corrections now.
- In addition to weighting bins by spread (which can only be observed with a pooled reference), also weight by bin size and deviation of reference log2 values in each bin from the global median. So, useful bin weights are now derived from "flat" and single-normal-sample references, too.
`segment`:
- Recalculate CBS segment means using bin weights (in the R library this simply the mean, arguably a bug).
- Set CBS segment start/end positions to match the underlying bin start/end positions.
- Improved centromere detection -- only exclude one "large gap", if any, from each chromosome.
- Tuned CBS calling parameters to improve accuracy (see benchmarks in the repo etal/cnvkit-examples).
`diagram`:
- Label genes using the same criteria as the `gainloss` command: if segments are given, use the segment value at each gene, otherwise calculate the weighted average of bin-level log2 values within each gene.
- New option `-m`/`--min-probes` to match `gainloss`.
- Guess gender from chrX more reliably, so that the same gender is called from the bin-level (.cnr) and segmented (.cns) values given.
`scatter`, `loh`:
- When plotting allele frequencies from a VCF, if segments are given (.cns), also apply those segments to allele frequencies to show LOH regions that match CNVs.
- Skip somatic variants identified in a VCF, and try to retain only germline variants, when plotting LOH. (This is not very well standardized across callers, so please watch for bad behavior from callers other than FreeBayes and MuTect, and let me know about it!)
- `scatter` only: Added options `--y-min`, `--y-max` to set y-axis limits on the plot.
- Removed the deprecated `-r` option. Use `-c` instead.
The long-deprecated `cbs` command has been removed. Use `segment` instead.
Bugs in parsing and writing empty and 1-line VCF, BED and CNVkit files, and other VCF quirks, have now been fixed (Thanks chapmanb!)