Somaticseq

Latest version: v3.10.0

Safety actively analyzes 706267 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 10

3.6.0

* Re-wrote the XGBoost routine to use the xgboost library in python (somaticseq/somatic_xgboost.py, which also requires pandas library). Also made it the default algorithm for SomaticSeq because xgboost in python is orders of magnitudes faster than AdaBoost in R. You can still use ada in R by invoking `-algo ada` in the command.
* Got around VarDict's latest output VCF file that are incompatible with bedtools by removing the incompatible lines (i.e., when ALT has \<DUP\>, \<DEL\>, \<INV\> but has no END field in the INFO column). An extra step (may remove later if it becomes unnecessary) was added to somaticseq/combine_callers.py.
* Finally remove legacy SomaticSeq.Wrapper.sh and ssSomaticSeq.Wrapper.sh scripts (replaced by somaticseq_parallel.py since v3.0.0).

3.5.1

* Fixed a minor bug when num_caller in somaticseq/somatic_vcf2tsv.py and somaticseq/single_sample_vcf2tsv.py did not reset properly when there are multiple variant calls in the same genomic position. As a result, some variant calls that should not be output into the .tsv (because num_caller=0) will be output into the .tsv file because num_caller was adding up counts from the previous variant call of the same genomic coordinate. However, the features are still reported correctly, so the classification results should stay the same.

3.5.0

* Replaced z-scores from scipy's ranksums with p-values from scipy's mannwhitneyu, mostly because the mannwhitneyu corrects for discrete values. **Thus, models built prior to this version is no longer compatible with it due to different features.**

3.4.2

* Modified the linguistic sequence complexity calculation to limit the substring to 20-bp. It decreases runtime with no sacrifice of accuracy.
* Fixed a bug where the indels nearest to a position was not calculated properly when there are additional insertions and soft-clipped bases in a read.

3.4.1

Fixed a minor bug where the number of indels within 3 bps of the variant site double-counted number of indels adjacent to the variant site. Not a major issue, since it's a very minor feature (one of the lowest-ranked feature importance).

3.4.0

* Added [linguistic sequence complexity (LC)](https://doi.org/10.1093/bioinformatics/18.5.679) as a feature: 80-bp window adjacent to and spanning the variant position. For adjacent, the lower value (between right and left) is retained. **Therefore, be careful for models trained before this release. The feature set has changed.**
* Fixed a bug for xgboost mode when training and prediction mode used different feature set.
* Changed the ada model file name to have "ada" in it.

Page 3 of 10

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.