Contributers
- Richard Morris
- AoboHill made their first contribution!
- pithirat-horvichien made their first contribution!
- First contributions from Robert McArthur and Vijini Mallawaarachchi!
- Kath Caley, massive contributions to the sequence annotation refactor And the documentation and ...!
- Thanks to Dr Minh Bui, author of IQ-tree, for a sample extended newick output
from IQ-tree!
ENH
- Added automated changelog management to the project
- serializable deprecation of function and method arguments using decorator
- Added `SequenceCollection.distance_matrix()` method. This method provides
a mechanism for k-mer based approximation of genetic distances
between sequence pairs. It is applicable only to DNA or RNA moltypes.
Sequences are converted into 10-mers and the Jaccard distance is computed
between them. This distance is converted into an estimate of a proportion
distance using a 10-th degree polynomial. (That polynomial was derived from
regression to distances from 116 mammal alignments.) The final step is applying
JC69 to these approximated proportion distances.
- Robert and Vijini added a function for computing the matching distance
statistic between tree toplogies.
`from cogent3.phylo.tree_distance.lin_rajan_moret`
This is a better behaved statistic than Robinson-Foulds. The original
author was Dr Yu Lin who tragically passed away in 2022. He was our
dear friend, colleague and mentor.
Lin et al. 2012 "A Metric for Phylogenetic Trees Based on Matching"
IEEE/ACM Transactions on Computational Biology and Bioinformatics
vol. 9, no. 4, pp. 1014-1022, July-Aug. 2012
- Major rewrite of annotation handling! In short, we now use an in-memory SQlite
database to store all annotations from data sources such as GFF or GenBank. New
classes provide an interface to this database to support adding, querying records
that match certain conditions. The new database is added to `Sequence` or `Alignment`
instances on a `.annotation_db` attribute. When sequences are part of a collection
(`SequenceCollection` or `Alignment`) they share the same data base. Features are now
created on demand via the `Sequence` or `Alignment` instances and behave much as the
original `_Annotatable` subclasses did. There are notable exception to this, as
outlined in the deprecated and discontinued sections.
This approach brings a massive performance boost in terms of both speed and memory
A microbial genome sequence and all it's annotates can be loaded in less than a second.
- A new `cogent3.load_annotations()` function allows loading an annotation
db instance from one, or more, flatfiles. If you provide an existing annotation
db instance, the records are added to that db.
- Capture extended newick formatted node data. This is stored in
`TreeNode.params["other"]` as a raw string.
- The `tree_align()` function now uses new approximation method for faster
estimation of distances for a obtaining guide tree. This is controlled by
the `approx_dists` argument. The additional argument `iters` can be used to
do multiple iterations, using genetic distances computed from the alignment
produced by the preceding iteration to generate the new guide tree.
If `approx_dists` is `False`, or the moltype of chosen model is not a nucleic acid
compatible type, distances are computed by the slower method of performing
all pairwise alignments first to estimate the distances.
- Added new alignment quality measures as apps, and the ability to invoke them
from the `Alignment.alignment_quality()` method. The new apps are the
Information content measure of Hertz and Stormo (denoted `ic_score`), a
variant on the the sum of pairs measure of Carillo and Lipman
(denoted `sp_score`), and the log-liklelihood produced by the cogent3
progressive-HMM aligner (denoted `cogent3_score`). If these apps cannot
compute a score (e.g. the alignment has only 1 sequence), the return a
`NotCompleted` instance. Instances of that class evaluates to `False`.
- Added optional argument `lower` to `app.model()`. This provides a global
mechanism for setting the lower bound on rate and length parameters.
- `load_unaligned_seqs()` now handles glob patterns. If the filename is a glob
pattern, assumes a listing of files containing a single sequence. The `load_seq()`
function is applied to each file and a single `SequenceCollection` is returned. To
see progress, set `show_progress=True`.
- `Table.joined(col_prefix)` argument allows users to specify the prefix of
columns corresponding to the second table, i.e.
`result = table.inner_join(table2, col_prefix="")`
ensures result.header is the sum of table.header and table2.header
(minus the index column).
- Added `trim_stop` argument to `get_translation()` methods. This means
translating DNA to protein can be done with one method call, instead of
two.
DOC
- Thanks to Katherine Caley for awesome new docs describing the
new feature and annotation DB capabilities!
Deprecations
- Removed the original Composable app classes and related decorators for
user based apps. `user_function` and `appify` are replaced by the
`define_app` decorator.
- The function TreeAlign is to be deprecated in 2023.9 and replaced with tree_align
- Every method that has "annotation" in it is now deprecated with a replacement
indicated by their deprecation warnings. Typically, there's a new method with the
name "feature" in it.
- `<collection>.has_terminal_stops()` is being deprecated for
`<collection>.has_terminal_stop()`, because it returns True if a single
sequence has a terminal stop.
Discontinued
- Removed methods on `TreeNode` that are a recursion variant of an
existing methods. `TreeNode.copy_recursive()`, `TreeNode.traverse_recursive()`
`TreeNode.get_newick_recursive()` all have standard implementations that can
be used instead.
- `PhyloNode` inherits from `TreeNode` and is distinguished from it only by
have a length attribute on nodes. All methods that rely on length
have now been moved to `PhyloNode`. These methods are `PhyloNode.get_distances()`,
`PhyloNode.set_max_tip_tip_distance()`, `PhyloNode.get_max_tip_tip_distance()`,
`PhyloNode.max_tip_tip_distance()`, `PhyloNode.compare_by_tip_distances()`.
One exception is `TreeNode.get_newick()`. When `with_distance=True`, this
method grabs the "length" attribute from nodes.
- All methods that do not depend on the length attribute are moved to `TreeNode`.
These methods are `TreeNode.balanced()`, `TreeNode.same_topology()`,
`TreeNode.unrooted_deepcopy()`, `TreeNode.unrooted()`, `TreeNode.rooted_at()`,
`TreeNode.rooted_with_tip()`.
- The `SequenceCollection.annotate_from_gff()` method now accept file
paths only.
- Renaming a sequence in a sequence collection is not applied
to annotations. Users need to modify names prior to binding
annotations.
- Dropping support for attaching / detaching individual annotation
instances from an alignment.
- Backwards incompatible change! `Sequence` and `Alignment` no longer inherit from
`_Annotatable`, so the methods and attributes from that mixin class are no longer
available. (As there was no migration strategy, please let us know if it broke
your code and need help in updating it.)
Major differences include: the `.annotations` attribute is gone; individual
annotations can no longer be copied; annotations are not updated on sequence
operations (you need to re-query).
<a id='changelog-2023.2.12a1'></a>