This is the second major release of Mikado. It contains **backwards-incompatible changes** to the data structures used in the program;
as such, all users are **warmly invited to update the program as soon as possible**. Due to these changes, old runs might need to be redone
(e.g. for Mikado serialise).
This release has been greatly focused on making Mikado capable of integrating not just transcript assemblies but rather a mixture of transcripts assemblies
and _ab initio_ gene annotations. We also made possible to flag certain sets of transcripts for Mikado as of *reference quality*, and improved the possibility
of passing external metrics (e.g. expression values) to Mikado. In practice, these changes **make Mikado a robust program to integrate gene annotations from multiple
into a coherent, final gene annotation**. Mikado had already been used in this capacity
[for the annotation of _T. aestivum_](https://science.sciencemag.org/content/361/6403/eaar7191); the changes in this version build upon that early work.
Following these changes, we plan to use Mikado in this capacity in a fully automated gene annotation pipeline. Please also note that, due to our work
on this new product, **we are planning to retire Daijin in the near future and its development is now discontinued**.
Aside from numerous bug fixes, this release brings the following highlights:
- Now Mikado will use [TOML](https://github.com/toml-lang/toml) as default configuration language.
- Many parts of `mikado`, especially in `serialise` and `pick`, have been rewritten to be much more performant. Specifically:
- `mikado pick` underwent a strict code revision to remove quadratic (or worse) bottlenecks. This now allows `mikado pick` to run on much denser, larger inputs without prohibitive computational resources or times.
- `mikado serialise` now is fully parallelised both for ORF and BLAST loading (280).
- `mikado serialise` can now load data from custom-field tabular BLAST data, rather than only from XML files (280).
- both steps now use temporary SQLite3 databases for fast inter-process data exchange.
- Mikado will now function correctly with soft-masked genomes.
- Mikado pick now will **backtrack** during the picking stage. This prevents loci from being missed due to chaining in the early stages of the algorithm.
- Mikado is now capable of padding transcripts in a locus so that they will share the same 5' and 3', if appropriate.
This leads to more coherent gene models, and can lead to recover gene models that are present only in fragmentary form,
by piggybacking on other, more complete models. This padding behaviour is now **default** in Mikado.
- The Mikado database (for Mikado serialise) and the GF index (used by Mikado compare) have been overhauled and are **not** back compatible.
- Mikado compare is now fully multi-processed.
- Mikado compare now **can consider fuzzy matching for the introns**.
This helps in e.g. evaluating the results from noisy long reads, such as those from NanoPore. Briefly, when activated,
Mikado compare will consider an intron match to a reference intron any match which is within the specified amount of bases. A similar fuzzy logic will apply to intron chains.
- Mikado can now load arbitrary numerical or boolean external metrics for all transcripts. They are not limited any longer to floats between 0 and 1.
- Alternative transcript events will now have to have the same coding frame, in coding loci.
- Mikado now provides only two scoring files ("*plant.yaml*" and "*mammalian.yaml*").
"*Plant.yaml*" should function also for insect or fungal species, but we have not tested it extensively.
Old scoring files can be found under "HISTORIC".
- Mikado now can specify a **random seed generator** as a 32bit integer. This allows to produce fully reproducible runs.
- Mikado will now exit without hanging in case of a crash during a multi-processed run.
With this release, we are also officially dropping support for Python 3.4. Python 3.5 will not be automatically tested for, as many Conda dependencies are not up-to-date, complicating the TRAVIS setup.
Contributors to this release:
- Gemy George Kaithakottil (gemygk)
- Christian Schudoma (cschu)
- David Swarbreck (swarbred)
Acknowledgements for contributing by bug reports and suggestions:
- Tom Mathers (tommathers)
- AsclepiusDoc
- Justin S (codeandkey)
- zebrafish-507
- Dr Robert King (rob123king)
- mndavies286
- Ole Tørresen (Thieron)
- Ferdinand Marlétaz (fmarletaz)
- Luohao Xu (lurebgi)
- Sagnik Banerjee (sagnikbanerjee15)
- lijing28101
- Lawrence Percival Alwyn (for the suggestion on random seeds)
Detailed list of bugfixes and improvements:
General
- Many internal algorithms of `mikado pick` have been rewritten to avoid quadratic bottlenecks. This allows Mikado to analyse datasets that are much denser or richer, without the processing time getting out of hand.
- `mikado pick` is now much more efficient in using multiple processors.
- Mikado has now been tested to be compatible with Python 3.7.
- Mikado can now specify a static random seed, ensuring full reproducibility of the runs ([183](https://github.com/EI-CoreBioinformatics/mikado/issues/183))
- Mikado will now correctly terminate all child processes in the event of a crash, and exit without hanging ([205](https://github.com/EI-CoreBioinformatics/mikado/issues/205))
- Mikado now always uses PySam, instead of PyFaidx, to fetch chromosomal regions (e.g. during prepare and pick).
This speeds up and lightens the program, as well as making tests more manageable.
- Made logging more sensible and informative for all three steps of the pipeline (prepare, serialise, pick)
- Mikado now supports the BED12+1 format (ie a BED12 format with GFF-like attributes on the 13th field)
- Now Mikado can use alternative translation tables among those provided by [NCBI through BioPython](ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt). The default is "0", ie the Standard table
but with only the canonical "ATG" being accepted as valid start codon. ([34](https://github.com/EI-CoreBioinformatics/mikado/issues/34)).
Please note that this is still a **global** value, it is not yet possible to specify a subset of chromosomes functioning with a different table.
- Now Mikado correctly considers the phase (instead of the incorrect frame) for GTFs. This makes it
compatible with EnsEMBL and [GenomeTools](http://genometools.org/) or [GffRead](https://github.com/gpertea/gffread), among others ([#135](https://github.com/EI-CoreBioinformatics/mikado/issues/135))
- Mikado was not dealing correctly with soft-masked genomes ([139](https://github.com/EI-CoreBioinformatics/mikado/issues/139))
- Increased coverage of the unit tests to approximately 83% ([137](https://github.com/EI-CoreBioinformatics/mikado/issues/137))
- Created proper Docker and Singularity recipes for Mikado ([149](https://github.com/EI-CoreBioinformatics/mikado/issues/149), [#164](https://github.com/EI-CoreBioinformatics/mikado/issues/164))
- Fixed an incorrect algorithm for merging overlapping intervals ([150](https://github.com/EI-CoreBioinformatics/mikado/issues/150))
- Improved Mikado performance by removing the default overloading of `__getattribute__` in the *Transcript* class ([153](https://github.com/EI-CoreBioinformatics/mikado/issues/153), [#154](https://github.com/EI-CoreBioinformatics/mikado/issues/154))
- The configuration file has been overhauled for simplicity's sake ([158](https://github.com/EI-CoreBioinformatics/mikado/issues/158))
- Dropped the by-now obsolete "nosetests" suite for testing, moving to the modern and maintained "pytest".
- Now Mikado will be forced to run in single-threaded mode if the user is asking for debugging-level logs.
This is to prevent a [re-entrancy race condition that causes deadlocks](https://codewithoutrules.com/2017/08/16/concurrency-python/).
- During configure and prepare, Mikado can now flag some transcripts as coming from a "reference".
Transcripts flagged like this **will never be modified nor dropped during a mikado prepare run**, unless generic or
critical errors are registered. Moreover, if source scores are provided, Mikado will preferentially keep one identical
transcript from those that have the highest *a priori* score. This will allow to e.g. prioritise PacBio or reference
assemblies during prepare ([141](https://github.com/EI-CoreBioinformatics/mikado/issues/141)).
- Please note that this change **does not affect the final picking**, but rather is just a mechanism for allowing Mikado to accept pass-through data.
- If you desire to prioritise reference transcripts, please directly assign a source score higher than 0 to these sets.
- Alternatively, use the `--only-update-reference` flag for having Mikado only try to add ASEs to known loci (see under *Mikado pick*)
- Mikado runs should now be fully reproducible, by specifying a seed. One will be generated automatically by Mikado
when launching the configuration, so that repeated runs using the same configuration file will be deterministically identical.
- [136](https://github.com/EI-CoreBioinformatics/mikado/issues/136): documentation has been updated to reflect the changes in the latest releases.
Mikado prepare
- Mikado will now always strip the CDS when a transcript is reversed ([126](https://github.com/EI-CoreBioinformatics/mikado/issues/126)).
- Mikado prepare now will *not* consider redundant transcripts that have the same cDNA but *different* CDS
([127](https://github.com/EI-CoreBioinformatics/mikado/issues/127)).
- Mikado prepare will consider for redundancy whether a transcript is *contained* within another and *shares its intron chain in its entirety*.
This will allow to drastically reduce the number of inputs to the other steps ([270](https://github.com/EI-CoreBioinformatics/mikado/issues/270)).
- Mikado prepare will now allow to decide *per-source* whether redundant transcripts should be kept or discarded ([270](https://github.com/EI-CoreBioinformatics/mikado/issues/270)).
- Mikado prepare will now ascertain whether a CDS has a valid start and/or stop codon ([132](https://github.com/EI-CoreBioinformatics/mikado/issues/132)) and will retain the original phase values ([#133](https://github.com/EI-CoreBioinformatics/mikado/issues/133)).
- Mikado prepare now will preferentially keep "reference" transcripts and transcripts with a higher source score, in this order.
Reference transcripts will be never discarded for failing a requirements check ([141](https://github.com/EI-CoreBioinformatics/mikado/issues/141)).
- Mikado prepare was not considering correctly GTFs without a `transcript` line feature ([196](https://github.com/EI-CoreBioinformatics/mikado/issues/196)).
- Mikado prepare now can accept models that lack any exon features but still have valid CDS/UTR features - this is necessary for some protein prediction tools.
Mikado serialise
- Use of temporary SQLite databases for inter-process communication in Mikado serialise, with consequent speedup ([97](https://github.com/EI-CoreBioinformatics/mikado/issues/97))
- Fixed bugs related to Prodigal ORFs on the negative strand ([181](https://github.com/EI-CoreBioinformatics/mikado/issues/181))
- Now BLAST HSPs will have stored as well whether there is an in-frame stop codon.
- Mikado serialise is now much faster when serialising the ORFs or BLAST data.
This is due to better multiprocessing and to having moved to Cython the most expensive steps ([280](https://github.com/EI-CoreBioinformatics/mikado/issues/280))
- Mikado serialise is now able to use *tabular* BLAST data as input, not just XML.
The tabular output should contain the standard columns plus, *at the end*, the following two:
- ppos
- btop
Mikado pick
- For the external scores, Mikado can now accept any type of numerical or boolean value. Mikado will understand at
serialisation time whether a particular score can be used raw (ie its values are strictly comprised between 0 and 1)
or whether it has to be forcibly scaled.
- This allows Mikado to use e.g. transcript expression as a valid metric.
- Mikado is now capable of correctly padding the transcripts so to uniform their ends in a single locus. This will
also have the effect of trying to enlarge the ORF of a transcript if it is truncated to begin with. Please note that
padded transcripts will add terminal *exons* rather than just extending their terminal ends. This should prevent the
creation of faux retained introns. Moreover, now the padding procedure will explicitly find and discard transcripts
that would become invalid after padding (e.g. because they end up with a far too long UTR, or retained introns).
If some of the invalid transcripts had been used as template for the expansion, Mikado will remove the offending
transcripts and restart the procedure ([129](https://github.com/EI-CoreBioinformatics/mikado/issues/129),
[142](https://github.com/EI-CoreBioinformatics/mikado/issues/142)). Moreover:
- Mikado will remove fully redundant (ie 100% identical transcripts) after padding ([208](https://github.com/EI-CoreBioinformatics/mikado/issues/208))
- As a consequence of this change, Transcript objects have been modified to expose the following methods related to the internal interval tree:
- find/search (to find intersecting exonic or intronic intervals)
- find_upstream (to find all intervals upstream of the requested one in the transcript)
- find_downstream (to find all intervals downstream of the requested one in the transcript)
- Moreover, transcript objects now do not have any more the unused "cds_introntree" property. Combined CDS and CDS introns are now present in the "cds_tree" object.
- Again as a consequence, now Locus objects have a new private method - _swap_transcript - that allows two Transcript
objects with the same ID to be exchanged within the locus. This is essential to allow the Locus to recalculate most
scores and metrics (e.g. all the exons or introns in the locus).
- Fixed a bug which caused some loci to crash at the last part of the picking stage.
- After picking, loci will be either coding or non-coding - no admixture.
- Solved a bug which led Mikado to recalculate the phases for each model during picking, potentially creating mistakes
for models truncated at the 5' end ([138](https://github.com/EI-CoreBioinformatics/mikado/issues/138)).
- Transcript padding has been overhauled and bugfixes related to it fixed ([124](https://github.com/EI-CoreBioinformatics/mikado/issues/124),
[142](https://github.com/EI-CoreBioinformatics/mikado/issues/142)).
- During scoring, it is now possible to specify conditions **related to a different metric** as a filtering option; moreover,
Mikado now will ignore for the purposes of scoring transcripts that have not passed the minimum filter.
See [130](https://github.com/EI-CoreBioinformatics/mikado/issues/130) and documentation for details.
- Mikado pick now will backtrack if it realises that some loci have been lost due to chaining.
Previously, Mikado could have missed loci if they were lost between the sublocus and monosublocus stages.
Now Mikado implements a basic backtracking recursive algorithm that should ensure no locus is missed.
This check happens during the last stage of the picking. ([131](https://github.com/EI-CoreBioinformatics/mikado/issues/131))
- Now all coding transcripts of a Mikado pick locus will share the same frame. Moreover,
**Mikado will now calculate the CDS overlap percentage based on the primary transcript CDS length**, not the minimum
CDS length between primary and candidate. Please note that the change **regarding the frame** also affects the monosublocus stage.
Mikado still considers only the primary ORFs for the overlap. ([134](https://github.com/EI-CoreBioinformatics/mikado/issues/134))
- Mikado pick was forgetting the original phases of transcripts, when not loading them from a database ([138](https://github.com/EI-CoreBioinformatics/mikado/issues/138)).
- Mikado pick will never discard a reference transcript for failing the requirements check. Moreover,
**it is now possible to instruct Mikado to only update a reference** rather than trying to come up with an annotation on its own.
When so instructed, Mikado pick will ignore any locus without a reference transcript, consider those as pass-through, and try to add
new transcripts that are compatible with the known loci ([148](https://github.com/EI-CoreBioinformatics/mikado/issues/148)).
- Mikado now contains only two scoring files, *plants.yaml* and *mammals.yaml* ([155](https://github.com/EI-CoreBioinformatics/mikado/issues/155)).
- Mikado pick now uses the [WAL](https://www.sqlite.org/wal.html) method for faster dispatching of data and to avoid crashes
([205](https://github.com/EI-CoreBioinformatics/mikado/issues/205)).
- Corrected a long-standing bug that made Mikado lose track of some fragments during the fragment removal phase.
Somewhat confusingly, Mikado printed those loci into the output, but reported in the log file that there was a
"missing locus". Now Mikado is able to correctly keeping track of them and removing them.
- Corrected issues that caused a crash due to the data exchange databases being locked ([205](https://github.com/EI-CoreBioinformatics/mikado/issues/205))
Mikado compare
- Mikado compare now reports statistics related to **non-redundant introns and intron chains**. This provides a better picture of the prediction in some instances, eg. when analysing IsoSeq/ONT runs.
- Always in Mikado compare, possibility of considering "fuzzy matches" for the introns. This means that two transcripts might be considered as a "match" even if their introns
are slightly staggered. This helps e.g. when assessing imperfect data such as Nanopore, where the experimenter usually knows that the per-base precision is quite low.
- Switched to the lighter [msgpack](https://github.com/msgpack/msgpack-python) from ujson, with increase in performance, for the Mikado index ([#168](https://github.com/EI-CoreBioinformatics/mikado/issues/168))
- Mikado compare has been greatly improved ([166](https://github.com/EI-CoreBioinformatics/mikado/issues/166)), with the addition of:
- proper multiprocessing
- faster startup times
Daijin
- Daijin now supports the `--use-conda` command line switch, to download and install seamlessly the necessary packages.
Other
- The `add_transcript_feature.py` script has been improved. It now automatically splits chimeric transcripts
and corrects mistakes related the intron size, mostly to deal with Nanopore reads ([123](https://github.com/EI-CoreBioinformatics/mikado/issues/123))
- Fixed some parsing errors for GTFs created by converting from BAM files ([157](https://github.com/EI-CoreBioinformatics/mikado/issues/157))
- Mikado util convert now functions with BAM files ([197](https://github.com/EI-CoreBioinformatics/mikado/issues/197))
- Mikado `util grep -v` functions also for GTFs ([203](https://github.com/EI-CoreBioinformatics/mikado/issues/203))
- [209](https://github.com/EI-CoreBioinformatics/mikado/issues/209): now `daijin` supports conda environments. Moreover, we test the assemble part properly to ensure its correct functioning.