This is a larger release and the first update since our [publication](http://dx.doi.org/10.1371/journal.pcbi.1004873).
CNVkit now runs under Python 3 as well as 2.7. (3, 101; thanks mpschr)
File format changes:
- New "depth" column in .cnn, .cnr, .cns
- In .cns, "weight" is the sum, not mean, of bin-level weights within the segment
New script `cnn_updater.py` can be used to add the "depth" column to existing .cnn, .cnr and .cns files. However, most CNVkit commands should still work with pre-v0.8 files without using this script first. For best results, rebuild the .cnr and .cns for an ongoing study using the existing targetcoverage, antitargetcoverage and reference .cnn files.
Algorithmic changes:
- `reference`, `gender`, `call`, `diagram`, `export`: Gender, or chromosomal sex, is now inferred with a statistical test instead of a fixed threshold, significantly improving the inferences on noisy or aneuploid samples. (116)
- `reference`, `fix`, `call`: Center log2 values by median of chromosome medians, by default. (114)
- `reference`, `metrics`, `segmetrics`: Improve the calculation of biweight location and biweight midvariance (now in descriptives.py).
These deprecated components (since 0.7.x) have been removed:
- Commands `rescale` and `loh` -- use `call` and `scatter`, respectively, instead
- Some options in `export bed` and `export theta` -- use `call` first instead
- Script `genome2access.py` -- use `cnvkit.py access` instead
Updated commands:
`batch`:
- New option --method, with choices "hybrid" (default), "wgs", "amplicon", to simplify/streamline usage with whole-genome or amplicon sequencing protocols. See documentation for details; in short, "wgs" and "amplicon" do not use antitargets or the edge/density bias correction; "wgs" by default uses the sequencing-accessible genome as the targets, and uses a more stringent significance threshold for segmentation.
- Hide/deprecate --split option; it's always on now. To ensure bin coordinates do not change between `batch` runs (they generally won't anyway), use the -r/--reference option instead of specifying -t and -a in `batch`.
- Add --drop-low-coverage option, which is passed to `segment` internally.
- The -p/--processes option is also passed to `coverage` and `segment` internally (see below).
`antitarget`:
- Increase the default average bin size from 100kb to 200kb.
`coverage`:
- Parallelize coverage calculation over BED rows. The number of threads can be specified with the `-p` option. (121; thanks brentp)
`segment`:
- Parallelize CBS and Haar segmentation methods across chromosomes. (123, 125; thanks brentp)
`call`:
- New --filter option, with choices 'cn', 'ampdel', 'ci', 'sem' implemented.
- With VCF b-allele frequencies (`-v`, 'baf'), always calculate the allele-specific integer copy numbers 'cn1' and 'cn2' so that 'cn1' is the larger one. BAF mirror direction stays majority-rules. (105; thanks mpschr)
- If b-allele frequencies are used and total copy number is zero, report allelic copy numbers as 0, not NaN.
`scatter`:
- Add --title option.
- Allow selecting & labeling gene(s) w/ only segments as input.
`heatmap`, `scatter`:
- Allow saving plots in any image file format supported by matplotlib, not just The file format is determined by the output filename's extension, e.g. 'png' saves in PNG format -- making it easier to integrate CNVkit plots with HTML reports. (120; thanks chapmanb)
`diagram`:
- Add -g/--gender option to specify sample's known gender.
`gainloss`:
- Make output tables more consistent across options. Show individual gene names (rather than all genes grouped within a segment in 1 row); don't show rows with no gene name; report the segment probe count instead of number of probes within the gene; show any extra columns present in the input .cns file. (107, 108; thanks mpschr)
`gender`:
- Show column headers and Y-chromosome log2 values in the output table.
`segmetrics`:
- Add stats options for mean, median, mode
- Add MSE, SEM stats as options
`metrics`, `segmetrics`:
- Add --drop-low-coverage option (like in `segment` and `gainloss`)
Internals:
- New sub-package tabio: a more robust I/O framwork unifying support for tabular formats, including CNVkit's .cnn/.cnr/.cns, BED, SEG, VCF, GATK/Picard interval list, and text coordinates (chr:start:end). Base class GenomicArray and its derived classes CopyNumArray and VariantArray do not implement their own I/O, but rather are instantiated via tabio. The "import-" commands use this as well.
- Removed rary.RegionArray; all functionality is now in tabio and GenomicArray.
- New module "descriptives.py" implements descriptive statistics on plain numpy arrays or pandas Series instances, independent of CNVkit.
- Better testing on Travis, covering Python 2.7, 3.4 and 3.5, on both Linux and OS X (thanks kyleabeauchamp, rmcgibbo, and mpharrigan; 110)
Bug fixes:
- `batch`: Errors in parallel processes will immediately be raised as exceptions at the top level, rather than dying silently. Previously, no error would occur until a missing output file was needed later in the pipeline. (55)
- `segment`:
- Skip possible R warning text when parsing CBS output (106) and run Rscript with the --vanilla option (112; thanks jsmedmar). Non-isolated R processes were prone to add various warning messages to the expected SEG output, which could crash the "segment" command for some users.
- Handle zero-weight bins better (128; thanks chapmanb).
- `scatter`:
- Handle selected segments with an empty gene name (104; thanks mpschr).
- Don't crash on zero-length GenomicArray/CopyNumArray inputs.
- VCF parsing (now within tabio) improved:
- More robust to missing genotype (GT) & depth (DP) fields (102)
- Handle VCFs from MuTect2 (122)
- `export theta`: don't crash when SNP VCF is a single, unpaired sample, or if segmented input (.cns) is empty.
- `heatmap`: Avoid a possible crash if a sample is missing a chromosome.
Packaging:
- Universal wheels are enabled for installation with pip (setup.cfg).
New & updated dependencies:
- futures
- futurize
- numpy raised to version 1.9
- pandas raised to version 0.18.1
- pysam version 0.9.1.1 is specifically excluded