Scarap

Latest version: v1.0.0

Safety actively analyzes 723217 Python packages for vulnerabilities to keep your Python projects secure.

0.4.0

**Features:**

* The `build` and `search` modules have been improved.
* Both modules now use MMseqs2 profile searches instead of HMMER profile searches. This results in a big speed increase and removes the HMMER dependency from SCARAP.
* The `build` module can now build a core genome database; it takes the arguments `--core-prefilter`, `--core-filter` and `--max-core-genes`.
* The `build` module now always trains score cutoffs on all orthogroups that were supplied (and selected).
* The `core-pipeline` module has been improved and has been renamed to `core`.
* Of course, the `core` module benefits from the improvements of the `build` and `search` modules.
* The final core genes are now selected on the seed genomes instead of on all genomes. This means that the `core` module is now scalable to larger datasets. To compensate for this, the default number of seed genomes has been increased to 100. Instead of a "seedfilter" and "allfilter", a user now specifies a "prefilter" and "filter" with the arguments `--core-prefilter` and `--core-filter` (just like with the `build` module).
* The prefilter and filter are now based on single-copy occurrence instated of overall occurrence.
* It is now possible to specify a maximum number of core genes to extract with the parameter `--max-core-genes`.
* The `clust` module has been renamed to `sample`.
* The argument `--max_clusters` has been renamed to `--max-genomes`.
* The `supermatrix` module was renamed to `concat` and now allows for core gene selection with the arguments `--core-filter` and `--max-core-genes`.
* The `pan-pipeline` module has been removed.
* Its functionality can now be achieved by combining `pan` and `build`.
* Some argument names were improved.

**Bug fixes:**

* Tabs in fasta headers no longer give problems.
* Fixed a bug in the `core` module where a python error was thrown when a folder with faa files was used as input.
* Fixed a bug in the `sample` module: the last line of seeds.txt did not contain a newline character.

0.3.2

**This is a small update with some user-friendliness improvements and bug fixes.** It may be worth it to update to this version, because SCARAP v0.3.1 couldn't deal with MMseqs2 release 13 (February 24th, 2021).

Features:

* A folder with fasta files is now allowed as input (as an alternative to a file with paths to individual fasta files).
* An error is no longer thrown when the detected MMseqs2 version is unknown, and MMseqs2 release 13 is now recognized.
* SCARAP now checks whether gene names are unique across fasta files.

Bug fixes:

* Fasta headers without spaces no longer result in an error.
* Having zero splitable superfamilies no longer results in an error.
* Gene identifiers with "|" are now dealt with correctly.

0.3.1

With my PhD defense nearing quickly, I'm wrapping up some enhancements to Progenomics and **releasing them in a new version: v0.3.1**. The main improvements are that I added a module that can cluster huge genome datasets in linear time and that I fixed some rare but annoying bugs in the `pan` module. In addition, the toolkit also receives a brand-new name: **SCARAP (short for scalable and rapid pangenomes)**. The main reason for this name change is that the name Progenomics was a bit too generic and to avoid confusion with the [progenomes](https://progenomes.embl.de/) database of the EMBL. SCARAP can of course do more than infer pangenomes, but that is one of its main features. Also, I like the sound of SCARAP, and so did my colleagues at the [Lebeer Lab](https://lebeerlab.com/). SCARAP fulfills most of the criteria for a good software name: it is short, unique and easy to remember (or so I hope).

**Features:**

* A `clust` module has been added that can cluster genome datasets in linear time, given a desired number or clusters and/or an ANI/AAI-like cutoff.
* A `fetch` module has been added that extracts sequences of a pangenome into a fasta file per orthogroup.
* An "S" pangenome inference strategy has been added that runs the superclustering step only, without cluster splitting.
* The `pan` module has been made compatible with MMseqs2 releases 11 and 12 (but is now no longer compatible with releases before that).
* All modules are now able to read gzipped fasta files.
* The output of the checkgroups module now also contains the total occurrence of the orthogroups (next to their single-copy occurrence).
* Invalid commandline arguments are now corrected when possible (e.g. 50 to 0.50 for a percentage).
* Logging has been improved for many modules.
* Version checks were added for MMseqs2 and MAFFT.
* Large temporary files are now removed for the `pan` and `supermatrix` modules.

**Bug fixes:**

* Some rare but important bugs of the `pan` module were fixed:
* Error in the cluster splitting process when all sequences of a cluster had the same representative
* Error when the first characters of an amino acid sequence made it look like a DNA sequence
* Occasional false negative split of high-copy families
* Some other, small bugs

0.3.0

Progenomics now has its own builtin pangenome strategies!

* You can still set the pangenome inference strategy with the `-d` commandline option.
* The new default strategy, called "FH", is very fast in comparison to existing tools such as OrthoFinder or SonicParanoid. In addition, it scales more or less linearly with the number of input genomes.
* A variant of the FH strategy, called "H", is not scalable but is even faster on relatively small datasets (~ 60 prokaryotes genomes or less).
* You can still use OrthoFinder for pangenome inference by setting the strategy to "O-B" (for OrthoFinder with BLAST) or "O-D" (for OrthoFinder with DIAMOND).

0.2.0

This is a major overhaul of the entire toolkit:

* The interface has been rewritten, with simpler tasks and shorter and more intuitive task names
* The following tasks are now available: `pan`, `build`, `search`, `checkgenomes`, `checkgroups`, `filter` and `supermatrix`
* The following pipelines are now available: `pan-pipeline` and `core-pipeline`
* The R dependencies have been removed
* Dependencies are now checked before running
* Logging has been improved
* All tasks now assume unzipped fasta files

We're still not in major version 1 because task names can still change, as well as some of the functionality (for example, the way in which hmmer score cutoffs are trained).

0.1.0

This is the first public release of progenomics! It can handle the following tasks:

* construct_profile_db
* prepare_candidate_scgs
* select_scgs
* select_genomes
* construct_scg_matrix
* construct_supermatrix
* calculate_scnis
* nucleotide_supermatrix_from_scg_matrix

Releases

Has known vulnerabilities