With my PhD defense nearing quickly, I'm wrapping up some enhancements to Progenomics and **releasing them in a new version: v0.3.1**. The main improvements are that I added a module that can cluster huge genome datasets in linear time and that I fixed some rare but annoying bugs in the `pan` module. In addition, the toolkit also receives a brand-new name: **SCARAP (short for scalable and rapid pangenomes)**. The main reason for this name change is that the name Progenomics was a bit too generic and to avoid confusion with the [progenomes](https://progenomes.embl.de/) database of the EMBL. SCARAP can of course do more than infer pangenomes, but that is one of its main features. Also, I like the sound of SCARAP, and so did my colleagues at the [Lebeer Lab](https://lebeerlab.com/). SCARAP fulfills most of the criteria for a good software name: it is short, unique and easy to remember (or so I hope).
**Features:**
* A `clust` module has been added that can cluster genome datasets in linear time, given a desired number or clusters and/or an ANI/AAI-like cutoff.
* A `fetch` module has been added that extracts sequences of a pangenome into a fasta file per orthogroup.
* An "S" pangenome inference strategy has been added that runs the superclustering step only, without cluster splitting.
* The `pan` module has been made compatible with MMseqs2 releases 11 and 12 (but is now no longer compatible with releases before that).
* All modules are now able to read gzipped fasta files.
* The output of the checkgroups module now also contains the total occurrence of the orthogroups (next to their single-copy occurrence).
* Invalid commandline arguments are now corrected when possible (e.g. 50 to 0.50 for a percentage).
* Logging has been improved for many modules.
* Version checks were added for MMseqs2 and MAFFT.
* Large temporary files are now removed for the `pan` and `supermatrix` modules.
**Bug fixes:**
* Some rare but important bugs of the `pan` module were fixed:
* Error in the cluster splitting process when all sequences of a cluster had the same representative
* Error when the first characters of an amino acid sequence made it look like a DNA sequence
* Occasional false negative split of high-copy families
* Some other, small bugs