Cnvkit

Latest version: v0.9.11

Safety actively analyzes 723445 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 7

0.8.5

New 'autobin' command, replacing the script `coverage_bin_size.py`. Fix some bugs and usability issues. Unit tests improved, especially for the 'cnvlib.genome' sub-package.

Dependencies
------------

- Pandas 0.18.1 is once again supported. Previously the minimum version was 0.19.1. (chapmanb/bcbio-nextgen1836)
- Pysam minimum version is still 0.9.1.4, but slightly older versions in the 0.9 series may still work too. (192)

Commands
--------

`autobin`:

- New command, replacing and extending the script `coverage_bin_size.py`. The script is still included (and shares most of the same code), but is considered deprecated and will be removed in the 0.9.0 release. (170)
- In 'amplicon' and 'hybrid' modes, ensure sampling regions for coverage is the same in every run by set random seed. (191)

`antitarget`, `autobin`, `batch`:

- Fix an issue in GenomicArray.subtract() that caused some of the expected output regions to be missing. In cases where this caused an entire chromosome to be lost, the `coverage_bin_size.py` script` and autobin` and `batch` commands in `hybrid` mode would crash. (chapmanb/bcbio-nextgen1799)

`batch`, `diagram`:

- Fix creation of chromosomal diagrams with `--diagram` and the `diagram` command. (190)

`export`:

- In `export seg`, use 1-based indexing in the SEG output. (197)
- Fix `export cdt` format; it was generating Java TreeView (jtv) earlier.

0.8.4

This minor release focuses on improving usability and fixing some bugs.

Documentation is updated (thanks kyleabeauchamp for 186).

Dependencies
- Raise minimum pandas version from 0.18.1 to 0.19.0
- Raise minimum matplotlib version to 1.3.1

Commands

`fix`, `metrics`:
- Set PRNG seed to ensure reproducible results. The pipeline is now fully repeatable with identical results if run in serial, i.e. without `-p`.

`fix`, `reference`:
- Reduce boundary effects (expected log2 and spread values of 0 in some bins) when smoothing biases on very small gene panels, e.g. targeted amplicon sequencing of <5 genes, <100 bins. (181)

`fix`:
- Don't complain about mismatched sample IDs if antitargets are blank. This allows reusing a blank "MT" file in a shell loop for WGS and amplicon data.

`reference`:
- Make antitargets (antitarget.bed or *.antitargetcoverage.cnn) an optional argument. Previously this argument was required, so processing WGS or amplicon data, which has no off-target regions or reads, required the user to create and provide a blank BED file or appropriately named, empty .cnn files. (183)

`segment`:
- Don't log "Dropped 0 low-coverage bins". Only log when it actually drops bins.

`diagram`, `heatmap`:
- Add option `--no-shift-xy`. Shifting X and Y according reference and sample sex was done in diagram, but not heatmap. Now it's optional in both.

`heatmap`:
- Add a legend of log2 ratio colors to the plot. (36)
- Add options `-x`/`--sample-sex` and `-y`/`--male-reference`. (172)

`gender`/`sex`:
- Rename 'gender' command to 'sex', with shim for backward compatibility. (182)
- In other commands, the `-g`/`--gender``argument is renamed to`-x`/`--sample-sex`, also with a compatibility shim. Argument values`x`and`y`are accepted in addition to`f`/`female`and`m`/`male`, respectively.

`import-picard`:
- Deprecate searching a directory tree for files. It was a vestige of early lab work, and makes a shaky assumption about Picard CalculateHsMetrics `--PER_TARGET_COVERAGE` output filenames.

API
- The `do_*` function implementations moved to their named modules. The `do_*` functions can still be called or imported from the `cnvlib` and `cnvlib.commands` modules.
- All parsing and serialization of "chr:start-end" genomic region labels is consolidated under a new module, `cnvlib.genome.rangelabel`. These functions are used in in tabio.textcoord, GenomicArray.labels(), and elsewhere to ensure consistent behavior.

Internal
- `cnvlib.genome`: Handle nested bins correctly in the `merge`, `flatten`, and `intersect` modules, functions and GenomicArray methods. Verified with thorough unit tests.
- VCF: If the paired normal sample's genotypes are all 0/0 or missing, fall back to `--zygosity-freq` (inference from b-allele frequency) rather than marking all variants as somatic. Then infer and drop additional somatic SNVs based on genotype after parsing, and only if that wouldn't drop all records. This allows CNVkit to safely distinguish somatic vs. germline in VCFs from Mutect2, though Mutect2 is still not recommended. (184)

0.8.3

Bug fixes and a few usability improvements. Notably, for the whole-genome sequencing workflow (`batch -m wgs`), bin size is now inferred from a sample's genome-wide coverage depth instead of using a fixed value, which should yield better results by default.

Dependencies
- scipy: Raise minimum version to 0.15 (for the function `scipy.stats.median_test`)

New scripts
- `coverage_bin_size.py`: Quickly estimate on- and off-target read depths to suggest reasonable bin sizes to use with the `target` and `antitarget` or `batch` commands. (170)
- `guess_baits.py`: In case the baited regions for a target capture panel are not known, use sample BAM files from sequencing with that panel to infer the likely captured regions. Works either guided, given a list of potential targets (e.g. all exons in a genome), or unguided, scanning all sequencing-accessible bases in the genome to find areas with elevated coverage.

Both scripts are preliminary and may be removed in a future release.

Global changes
- Infer read lengths automatically from the given sample BAM files where needed (`coverage` and `batch`). Remove the hard-coded parameter `cnvlib.params.READ_LEN`. (74)
- Handle VCFs generated by [LoFreq](https://csb5.github.io/lofreq/). This program does not emit sample genotypes, but locus depths and allele frequencies can be found in the INFO column instead -- unusual but technically within the VCF spec. (#173)

Commands

`batch`, `coverage`, `segment`:
- The option `-p`/`--processes` can now be used without an argument to specify parallelizing across all available CPUs. The now-optional argument value is the maximum number of CPUs to use; the special value `-p 0` was previously used to specify all CPUs (this still works).

`batch`:
- Automatically estimate a reasonable average bin size in the whole-genome workflow, `-m wgs`, using a fast estimate of a given normal/control sample's genome-wide average coverage depth. (If multiple normals are given, the median-sized sample is used for this calculation.) This allows CNVkit to handle low-coverage/low-pass WGS data better by default. (170)

`coverage`:
- With `--count`, count all reads that overlap a region, but trim any portions of each read aligned outside the region from the number of bases counted. The result should now be closer to that without `--count`.

`scatter`:
- In chromosome-level plots, the displayed x-axis range now matches the specified region (via `-c` or `-g` + `-w`) exactly. Previously, the displayed range depended on the bin locations. (180)

Bug fixes
- `antitarget`: Handle empty off-target regions safely. (chapmanb/bcbio-nextgen1696)
- `export theta`: Rename argument `--min-depth` to `--min-variant-depth`, matching the equivalent argument in other commands. (178; thanks myronpeto)
- `scatter`: Warn, don't crash, if a region in `--region-list` covers no bins. (174; thanks gabeng)

API changes
- New module `cnvlib.samutil` for convenience functions on BAM files, using pysam.
- New module `cnvlib.autobin` supporting the script `coverage_bin_size.py`. (170)
- Removed sub-package `cnvlib.ngfrills`, moving most functionality to `samutil` and `tabio`.
- genome.GenomicArray: New method `total_range_size`, similar to pybedtools `total_coverage()`

0.8.2

This release covers a number of internal changes to improve the stability and consistency of CNVkit, as well as new and improved command options to make more features available from the command line.

Due to a slight change in the binning procedure (see `target` and `antitarget` below), newly generated target and antitarget BED files, or a reference generated with `batch`, may not use the same bin boundaries as earlier versions. CNVkit will check these files for consistency and alert you if your BED or .cnn files do not match because of this change, e.g. running `batch` from scratch with the same panel but with two different CNVkit versions. If you want to update CNVkit mid-project, either keep using the same reference.cnn file as before for all new samples (as always), or regenerate all your *.targetcoverage.cnn and *.antitargetcoverage.cnn files to build a new reference.

Dependencies
- pyvcf: No longer needed. Instead, parse VCFs with pysam, which is noticeably
faster and better able to handle newer VCF and gVCF features. (159)
- pysam: Raise minimum version to 0.9.1.4.

Global changes
- When extracting a sample ID from a filename, instead of trimming everything after the first '.' character, only drop known or single-part extensions. For example, "Case1.exome.tumor.bam" and "Case1.exome.tumor.vcf.gz" will now resolve to the sample ID "Case1.exome.tumor" instead of "Case1". Output files will be named like "Case1.exome.tumor.cnr" instead of "Case1.cnr", avoiding potential naming conflicts in the `batch` command when processing multiple samples. (48)
- Always sort regions by genomic coordinates after reading a file. This doesn't modify the input file in-place, but ensures the output files are always sorted the same way.
- Gender detection is more robust. It now uses Mood's median test instead of the Mann-Whitney rank test. As a fallback for edge cases, e.g. only one segment per chromosome, it compares difference of weighted medians in autosomes versus sex chromosomes.

VCF parsing:
- Improve handling of VCFs from Mutect2 (122, 153) and bcftools (146).
- Don't reject records where FILTER is 'PASS' or '.'.
- VCF options are now consistent across the commands that can use them (`call`, `scatter`, `segment`, `export theta` and `export nexus-ogt`).
- New VCF option -z/--zygosity-freq to override VCF genotype calls. (153, 132)

Commands

`target`, `antitarget`:
- Divide bins evenly, using the same internal mechanism (the new GenomicArray.subdivide() method). Previously, subdivided regions were not always equal-sized as they should have been. Now, the coordinates of newly generated targets from a baits BED file may be a little different than before.

`target`:
- Drop zero-width bins (167).
- Improve assignment of gene names to targets in WGS datasets. (164)
- Accept any supported region format for --annotate, including BED, interval list and GFF, in addition to the already supported UCSC refFlat. The format is detected automatically. (163)
- Raise an error if the given annotations file (refFlat or equivalent) and the given baited/targeted intervals do not have any overlapping chromosomes.

`antitarget`:
- Set the default average bin size to 150kb. Previously, the CLI default was 200kb, but the API default was 100kb; experience shows 150kb works well.

`access`:
- Avoid a possible error when more than 1000 small regions are excluded from a single sequencing-accessible region. (150)

`coverage`:
- Fix a unicode vs. bytes incompatibility on Python 3. (147)
- Fix a crash if the input BED has more than 4 columns.

`reference`:
- Add -g/--gender option to declare the chromosomal sex of the input sample(s) (same for all), instead of detecting/guessing for each sample. (161)
- Ensure printed table of bad bins is a reasonable width. (140)

`segment`:
- With a VCF (`-v`), don't output 'cn1' and 'cn2' columns; calculate the 'baf' column the same as in `call`. (148)
- Improve memory efficiency somewhat when using a VCF. (162)
- Fix possible 1-base overlap of output segments when using the `cbs` or `flasso` methods. Specifically, the start positions were erroneously all shifted 1 base to the left before. (158)

`scatter`, `heatmap`:
- Improve rendering of genomes much smaller than the human genome, e.g. yeast, by scaling telomere padding to the total genome size. The blank space at chromosome boundaries was set to a fixed number of basepairs, but is now calculated as 0.3% of the whole genome size (sum of chromosome lengths) -- which works out the same for the human genome. (155)

`scatter`:
- Add option `--segment-color`. Now you can choose 'red' if you like.

`metrics`:
- Input `-s`/`--segments` is now optional. If not given, compare bin log2 values to chromosome medians instead of segment means.

`import-theta`, `export theta`:
- Drop sex chromosomes, since THetA2 doesn't handle them well. (103, 153)

API

tabio:
- Read new formats: GFF (simply); UCSC genePred refFlat; sub-formats bed3, bed4
- Detect more formats with `tabio.read_auto`: BED, interval list, text coordinates (chr:start-end), refFlat, GFF, TSV with column names.
- Remove module `ngfrills.regions`, no longer needed.

GenomicArray:
- Moved to new sub-package 'genome'
- Rename method `select` to `filter`
- Rename method `match_to_bins` to `into_ranges` and generalize.
- New methods `flatten`, `merge`, `resize_ranges`, `subdivide`, `subtract`

In general, the 'genome' functionality can be reached by using the `tabio` sub-package to load a GenomicArray instance and use its methods directly:

from cnvlib import tabio
regions = tabio.read_auto(filename)
Generate 500bp flanking regions
flanks = regions.resize_ranges(500).merge().subtract(regions)

0.8.1

This is primarily a bugfix release. The [documentation](https://cnvkit.readthedocs.io/) is also improved, particularly covering the cnvlib API.

API:
- For convenience in scripting, the relevant functions for running each CLI command (_cnvlib.commands.do__*) are exported to the top level. For example: `import cnvlib; cnvlib.do_batch(...)`

Bug fixes:
- `access`: Avoid a type-validation error on Python 3. (141)
- `batch`: Parallel processing now selects an appropriate number of workers for each step of the pipeline, reducing CPU contention when processing multiple samples in parallel. (138)
- `call`: Apply the `ci` and `sem` filters before calculating b-allele frequencies and absolute copy number, as these filters can alter the final calls.
- `reference`: Safely handle an edge case in detecting gender from sample coverage depths when all bins have identical coverage depth, e.g. no coverage. (144)
- `segment`: Fix handling and segmentation of SNV allele frequencies from a VCF. Ensure output column ordering is correct. Avoid a crash that could occur when SNV segmentation produces a segment that does not cover any coverage bins. (chapmanb/bcbio-nextgen1590)
- _cnvlib.tabio_: Improve handling of empty files, including VCFs with no samples and/or no locus records. If records and samples are present but genotypes are missing or undetectable, `scatter`, `call` and `export` would previously reject all records when filtering for SNPs, but will now accept all records instead.

0.8.0

This is a larger release and the first update since our [publication](http://dx.doi.org/10.1371/journal.pcbi.1004873).

CNVkit now runs under Python 3 as well as 2.7. (3, 101; thanks mpschr)

File format changes:
- New "depth" column in .cnn, .cnr, .cns
- In .cns, "weight" is the sum, not mean, of bin-level weights within the segment

New script `cnn_updater.py` can be used to add the "depth" column to existing .cnn, .cnr and .cns files. However, most CNVkit commands should still work with pre-v0.8 files without using this script first. For best results, rebuild the .cnr and .cns for an ongoing study using the existing targetcoverage, antitargetcoverage and reference .cnn files.

Algorithmic changes:
- `reference`, `gender`, `call`, `diagram`, `export`: Gender, or chromosomal sex, is now inferred with a statistical test instead of a fixed threshold, significantly improving the inferences on noisy or aneuploid samples. (116)
- `reference`, `fix`, `call`: Center log2 values by median of chromosome medians, by default. (114)
- `reference`, `metrics`, `segmetrics`: Improve the calculation of biweight location and biweight midvariance (now in descriptives.py).

These deprecated components (since 0.7.x) have been removed:
- Commands `rescale` and `loh` -- use `call` and `scatter`, respectively, instead
- Some options in `export bed` and `export theta` -- use `call` first instead
- Script `genome2access.py` -- use `cnvkit.py access` instead

Updated commands:

`batch`:
- New option --method, with choices "hybrid" (default), "wgs", "amplicon", to simplify/streamline usage with whole-genome or amplicon sequencing protocols. See documentation for details; in short, "wgs" and "amplicon" do not use antitargets or the edge/density bias correction; "wgs" by default uses the sequencing-accessible genome as the targets, and uses a more stringent significance threshold for segmentation.
- Hide/deprecate --split option; it's always on now. To ensure bin coordinates do not change between `batch` runs (they generally won't anyway), use the -r/--reference option instead of specifying -t and -a in `batch`.
- Add --drop-low-coverage option, which is passed to `segment` internally.
- The -p/--processes option is also passed to `coverage` and `segment` internally (see below).

`antitarget`:
- Increase the default average bin size from 100kb to 200kb.

`coverage`:
- Parallelize coverage calculation over BED rows. The number of threads can be specified with the `-p` option. (121; thanks brentp)

`segment`:
- Parallelize CBS and Haar segmentation methods across chromosomes. (123, 125; thanks brentp)

`call`:
- New --filter option, with choices 'cn', 'ampdel', 'ci', 'sem' implemented.
- With VCF b-allele frequencies (`-v`, 'baf'), always calculate the allele-specific integer copy numbers 'cn1' and 'cn2' so that 'cn1' is the larger one. BAF mirror direction stays majority-rules. (105; thanks mpschr)
- If b-allele frequencies are used and total copy number is zero, report allelic copy numbers as 0, not NaN.

`scatter`:
- Add --title option.
- Allow selecting & labeling gene(s) w/ only segments as input.

`heatmap`, `scatter`:
- Allow saving plots in any image file format supported by matplotlib, not just The file format is determined by the output filename's extension, e.g. 'png' saves in PNG format -- making it easier to integrate CNVkit plots with HTML reports. (120; thanks chapmanb)

`diagram`:
- Add -g/--gender option to specify sample's known gender.

`gainloss`:
- Make output tables more consistent across options. Show individual gene names (rather than all genes grouped within a segment in 1 row); don't show rows with no gene name; report the segment probe count instead of number of probes within the gene; show any extra columns present in the input .cns file. (107, 108; thanks mpschr)

`gender`:
- Show column headers and Y-chromosome log2 values in the output table.

`segmetrics`:
- Add stats options for mean, median, mode
- Add MSE, SEM stats as options

`metrics`, `segmetrics`:
- Add --drop-low-coverage option (like in `segment` and `gainloss`)

Internals:
- New sub-package tabio: a more robust I/O framwork unifying support for tabular formats, including CNVkit's .cnn/.cnr/.cns, BED, SEG, VCF, GATK/Picard interval list, and text coordinates (chr:start:end). Base class GenomicArray and its derived classes CopyNumArray and VariantArray do not implement their own I/O, but rather are instantiated via tabio. The "import-" commands use this as well.
- Removed rary.RegionArray; all functionality is now in tabio and GenomicArray.
- New module "descriptives.py" implements descriptive statistics on plain numpy arrays or pandas Series instances, independent of CNVkit.
- Better testing on Travis, covering Python 2.7, 3.4 and 3.5, on both Linux and OS X (thanks kyleabeauchamp, rmcgibbo, and mpharrigan; 110)

Bug fixes:
- `batch`: Errors in parallel processes will immediately be raised as exceptions at the top level, rather than dying silently. Previously, no error would occur until a missing output file was needed later in the pipeline. (55)
- `segment`:
- Skip possible R warning text when parsing CBS output (106) and run Rscript with the --vanilla option (112; thanks jsmedmar). Non-isolated R processes were prone to add various warning messages to the expected SEG output, which could crash the "segment" command for some users.
- Handle zero-weight bins better (128; thanks chapmanb).
- `scatter`:
- Handle selected segments with an empty gene name (104; thanks mpschr).
- Don't crash on zero-length GenomicArray/CopyNumArray inputs.
- VCF parsing (now within tabio) improved:
- More robust to missing genotype (GT) & depth (DP) fields (102)
- Handle VCFs from MuTect2 (122)
- `export theta`: don't crash when SNP VCF is a single, unpaired sample, or if segmented input (.cns) is empty.
- `heatmap`: Avoid a possible crash if a sample is missing a chromosome.

Packaging:
- Universal wheels are enabled for installation with pip (setup.cfg).

New & updated dependencies:
- futures
- futurize
- numpy raised to version 1.9
- pandas raised to version 0.18.1
- pysam version 0.9.1.1 is specifically excluded

Page 3 of 7

Releases

Has known vulnerabilities

Previous Next

Cnvkit

Page 3 of 7

0.8.5

0.8.4

0.8.3

0.8.2

0.8.1

0.8.0

Page 3 of 7

Links

Releases