New Features
- `seismic ensembles` is a new command that performs scanning clustering, similar to DRACO (https://doi.org/10.1038/s41592-021-01075-w).
- A region of choice (default: full) is divided into overlapping subregions of identical lengths. By default, the length is chosen so that the average read has 2 mutations (the minimum for clustering).
- Each subregion is clustered to determine how many structures it forms.
- Consecutive subregions that form the same number of clusters, with similar mutation rates, are joined automatically.
- Each joined region suggests the existence of an RNA module that folds into one or more structures independently of the surrounding sequences.
- `seismic join` can now automatically determine the best way to match clusters from multiple regions, without needing to provide a `--join-clusts` file. The regions must still have the same numbers or clusters (or none, if coming from the Mask step).
- Three new types of graph have been added (and are also now available through `seismic wf`):
- `seismic graph abundance`: Abundance of each cluster, either as a fraction of the ensemble or as a number of reads in the cluster.
- `seismic graph poscorr`: Phi correlation of the mutations between every pair of positions.
- `seismic graph mutdist`: Histogram of the distance between the two closest mutations in each read (or 0 for each read with fewer than two mutations).
- Both `pool` and `join` now tabulate the pooled/joined datasets automatically (with the option to turn this off).
- Five new commands provide easy access to online resources:
- `seismic biorxiv`: the preprint on bioRxiv
- `seismic docs`: the documentation on GitHub pages
- `seismic github`: this GitHub repository
- `seismic pypi`: the Python Package Index page for SEISMIC-RNA
- `seismic conda`: the Conda page for SEISMIC-RNA
Bug Fixes
- In `seismic relate`, the algorithm that finds ambiguous indels has been redesigned to make it non-recursive, so that it will no longer take extreme amounts of time to process reads with indels in long stretches of low-quality base calls.
- Fixed a bug in `seismic align` and `seismic relate` where processing multiple FASTQ/BAM files with the same sample and reference names could cause crashes or files to be overwritten. Now, if this situation is detected at the beginning, an error is raised to protect the data.
- Fixed a bug in `seismic mask` and `seismic cluster` where processing datasets in multiple output directories with the same sample, reference, and region names could cause crashes or files to be overwritten.
- Fixed a bug where running `seismic mask` with multiple regions of the same Relate dataset simultaneously would cause all but one of those regions to crash.
- Fixed a bug in `seismic mask` where if 0 positions remained at the end of one iteration, it would fail to mask out the remaining reads.
- In `seismic mask`, `-s` has been renamed to `-i`.
- In `seismic join` for clustered datasets, if the joined mask report already exists, then it now checks that the joined regions in the mask report match the regions that will be joined in the cluster report (and raises an error if they do not match).
- When correcting observer bias, the algorithm now issues a warning for `ValueError: Jacobian inversion yielded zero vector` instead of crashing.
- In `seismic cluster`, error that happen when calculating the jackpot quotient and creating the graphs now also trigger warning messages rather than crashes.
- Graphs that take two tables (`corroll`, `delprof`, and `scatter`) now always sort the names of their two samples alphabetically, so that they don't generate multiple sample directories for the same graphs depending on the order of their arguments.
- Replaced `static const char` variables with macros for backwards compatibility with older versions of C.
Logging
- There are now eight levels, including a new level ACTION (for writes to the filesystem and shell commands).
- The verbosity arguments now range from `-vvvv` (log everything to console) to `-qqqq` (log nothing to console).
- Many of the logging messages have been made more concise for easier comprehension.
- SEVERE has been renamed to FATAL.
- COMMAND has been renamed to STATUS.
Other Changes
- To speed up `seismic relate`, batches of read names and the table of relationships per read are no longer written by default, since they are rarely needed but writing them takes a relatively long time.
- The `meson.build` file now includes compiler flags `-O2` and `-DNDEBUG` to make the machine code for `seismic relate` more efficient.
- The `pool` and `join` commands now require specifying the name of the pooled sample and joined region, respectively, to avoid confusion over what the new sample/region is named.
- When calculating the BIC, the threshold for the reads to parameters ratio has been relaxed to cause fewer warning messages.
- In the API, run functions now accept generic `Iterable[str | Path]` arguments where they previously expected only `tuple[str, ...]` arguments.
- Unit tests now test all types of graphs, and run in double verbose mode on the command line.
**Full Changelog**: https://github.com/rouskinlab/seismic-rna/compare/v0.22.3...v0.23.0