Snp-pipeline

Latest version: v2.2.1

Safety actively analyzes 682416 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 5

2.0.0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Changes Impacting Backwards Compatibility:**

* Moved the bam file creation functionality from the call_sites command to the
map_reads command. The map_reads command takes fastq files as input (as before) and
produces a finished bam file. The call_sites command now only creates a pileup and finds
high-confidence variant sites.
* Added local realignment around indels. This is an optional step enabled by default.
When enabled, local realignment creates a dependency on Picard and GATK. See
:ref:`local-realignment-label`.
* The pipeline uses multiple CPU threads when running SAMtools.
* SAMtools version 1.4 or higher is required. Older versions are not supported anymore.
* When running in an HPC environment, the pipeline is configured to use 20 CPU threads by
default when mapping reads to the reference. You can change this by configuring the
:ref:`CpuCoresPerProcessOnHPC-label` parameter.
* When running on a workstation, the pipeline is configured to use 8 CPU threads by
default when mapping reads to the reference. You can change this by configuring the
:ref:`CpuCoresPerProcessOnWorkstation-label` parameter.
* Changed the algorithm used to compute the Average Insert Size metric. The new algorithm uses
SAMtools stats. In most cases the average insert size will be larger than before.
* Added a new metric called ``Percent Proper Pair`` which measures the percentage of all reads that
are aligned to the reference in the proper orientation and within the expected paired-end distance.
See :ref:`metrics-usage-label`.
* Fixed a bug causing non-compliance with the VCF version 4.1 specification. Prior to this
fix, the VCF files sometimes wrongly contained ``-`` characters in the ALT field to represent
deletions. VCF files are now generated according to the VCF version 4.2 specification, and deletions,
if present, are represented with ``*`` characters.

**Other Changes:**

* Increased the configurable map quality threshold to exclude poorly mapped reads from analysis.
See :ref:`SamtoolsSamFilter-ExtraParams-label`.
* Enhanced the SNP density filter to find dense regions of SNPs in multiple window sizes, each with
a different number of allowed snps. See :ref:`FilterRegions-ExtraParams-label`.
* Changed the SAMtools mpileup options to include read alignments that are not properly paired.
This change increases the number of detected snps. It also increases the effectiveness of the
density filter by causing the removal of snps in dense regions that would not otherwise have been
detected. See :ref:`SamtoolsMpileup-ExtraParams-label`.
* Increased the minimum required variant-supporting depth to call variants in phase 1 with VarScan.
See :ref:`VarscanMpileup2snp_ExtraParams-label`.
* Increased the minimum required supporting depth to make a call in phase 2 with the consensus caller.
See :ref:`CallConsensus-ExtraParams-label`.
* Added a ``--threads`` option to the map_reads script. This should only be used when building custom :ref:`step-by-step-workflows`.
* Updated the included datasets.
* Documented the tested versions of other software used by the pipeline. See :ref:`installation-label`.
* Fixed compatibility with Python 3 when running with Grid Engine.
* Fixed merge_vcf failure when merging many VCF files. Increased the number of open file descriptors when needed.

1.0.1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Updated usage instructions and expected result files for the Agona and Listeria datasets.

1.0.0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Changes Impacting Backwards Compatibility:**

* Some configuration parameter names are changed. If you have been using a customized
configuration file, you should begin using a new configuration file.
* Simplified the configuration of multi-threading. Replaced the configuration parameters
MaxConcurrentCollectSampleMetrics, MaxConcurrentCallConsensus, and MaxConcurrentPrepSamples
with a single new configuration parameter ``MaxCpuCores``. See also :ref:`faq-performance-label`.
* The configuration file is not an executable bash script anymore. However, you can still
substitute environment variables with the $VAR_NAME notation.
* Log file names are changed to harmonize with cfsan_snp_pipeline sub-command names.
* Grid and Torque job names are changed to match cfsan_snp_pipeline sub-command names.
* Deprecated all the old step-by-step scripts. These will be removed in a future release:

* copy_snppipeline_data.py
* prepReference.sh
* alignSampleToReference.sh
* prepSamples.sh
* snp_filter.py
* create_snp_list.py
* call_consensus.py
* mergeVcf.sh
* create_snp_matrix.py
* calculate_snp_distances.py
* create_snp_reference_seq.py
* collectSampleMetrics.sh
* combineSampleMetrics.sh

* You may safely continue using ``run_snp_pipeline.sh``. It is not deprecated and will not be removed in future releases.

**Other Changes:**

* Sweeping changes under the hood replacing the main run_snp_pipeline shell script with equivalent
python code.
* Added a new helper utility, ``qarrayrun`` to simplify creating and running array jobs on Grid
Engine or Torque.

0.8.2

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Fix samtools sort compatibility with samtools 0.1.19.

0.8.1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Fix collect metrics failure when the fastq sequence id line is missing the machine or flowcell.

0.8.0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Changes Impacting Backwards Compatibility:**

* Changed the collectSampleMetrics script to only accept input files in the sample directory,
not in arbitrary locations.
* Changed the combineSampleMetrics script to write to metrics.tsv by default, not stdout.
* Leading zeros are stripped from Miseq flowcell identifiers in the metrics files.
* Added a dependency on Picard. You need to install Picard and change your CLASSPATH.
See :ref:`installation-label`.
* Removed the unused create_snp_pileup.py script.

**Bug Fixes:**

* Fixed the machine and flow cell reporting in the metrics file when the fastq read names are not
in the original Illumina format.
* Fixed the calculation of average pileup depth in the metrics file. The formula previously
included whitespace characters when calculating the length of the reference. The correct
average depth is slightly deeper than previously calculated.

**Other Changes:**

* Sweeping changes under the hood replacing most shell scripts with equivalent python code.
Repackaged the SNP Pipeline as a single executable with multiple sub-commands. The old scripts
still exist for backwards compatibility and are rewritten as one-liners calling the new
replacement commands. The main executable program is called :ref:`cmd-ref-cfsan-snp-pipeline`.
* Added the capability to remove duplicate reads from BAM files prior to creating the pileup and
calling snps. See :ref:`remove-duplicate-reads-label`. This change introduces a dependency on
``Picard`` and will require changing your CLASSPATH. See :ref:`installation-label`. You can
disable this step and keep the duplicate reads by configuring ``RemoveDuplicateReads=false``
in the configuration file.
* Added a new metric to count the number of duplicate reads in each sample.
* Capture read-group metadata in the SAM/BAM files during the read mapping step.
* Added a new configuration parameter, ``BcftoolsMerge_ExtraParams`` to allow customizing the
snpma.vcf files created when merging the consensus VCF files. See :ref:`configuration-label`.
* Removed the hard-coded wall-clock run-time limits for Torque and Sun Grid Engine jobs. Added
default limits (12 hours) to the configuration file. You can change the runtime limits for
all SNP Pipeline job steps with the ``Torque_QsubExtraParams`` or ``GridEngine_QsubExtraParams``
configuration parameters.
* Log the SNP Pipeline version in the header of all the log files.
* Changed the composition of the included Salmonella Agona data set to remove the excessively large
sample ERR178930 and include a more diverse set of isolates from different geographic locations,
different environmental sources, and different types of sequencing instruments.

Page 2 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.