- Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
- Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many _k_-mers as we have observed by chance, for each family that has a match to a given query.
- UX improvements - more feedback during interactive search runs, whilst maintaining small log files.
---
Brief overview of major changes to OMAmer
The OMAmer placement algorithm consists of two steps: placing a query sequence into a protein family (root level HOG in OMA), before placing it into a sub-family. The original OMAmer publication focused on providing better and faster subfamily-level assignments than methods based on closest-sequence. Recently, the group has developed [OMArk](https://github.com/DessimozLab/OMark), a software package for proteome (protein-coding gene repertoire) quality assessment. The original OMAmer method was developed using a smaller taxonomic range than required for OMArk, which meant that the largest gene families were much smaller and less diverse in _k_-mer content. The largest HOG in OMA (November 2022 release) contains over 101,000 proteins and represents 53.9% of the _k_-mer index, based on the 6-mers that OMAmer uses by default. This means that a random protein sequence is very likely to be associated with this HOG.
In order to allow for this, we developed a scoring mechanism based on the binomial distribution. For each family, we estimated the probability of a random _k_-mer matching. We can then compute the $\textrm{Binomial}(N_{\textrm{query}}, P_{\textrm{family}})$ distribution for each family with matches (with probability $P_{\textrm{family}}$), with the number of draws ($N_{\textrm{query}}$) being the number of _k_-mers in the query sequence. Computing the complementary CDF (survival function), we can compute the probability of matching at least as many _k_-mers matches as we have observed by chance, for each family that has a match. Note: the results of this test are computed in negative-log units (natural log) for accuracy.
This is then used to filter the list of families which have an overlap with the query sequence (argument “``--family-alpha``”, default $10^{-6}$), to give us a list of candidate families. Candidates are then ordered by a normalised _k_-mer count, in the same way as the original algorithm. The expected count is now computed using the binomial approximation, with any ties broken based on the proportion of the query sequence covered by matching _k_-mers, then by the p-value computed above. By default, only the top family is taken forward. Sub-family placement is as in the original manuscript. Further software optimisation was performed, but did not affect the underlying method. As an example, it is now possible to run using the LUCA database in under 12GB of memory, whereas before this was using in excess of 40GB.