This version introduces the `--weighted-mean` option to the `tetra-freq`, `gc-content` and `coverage` modules. When this argument is used, the mean values are weighted by the contig length. The following example illustrates the value of this option.
The histograms below show the values of the GC content, tetranucleotide profile and coverage of the contigs of a MAG generated from real-world metagenomic data:
![object Object (1)](https://user-images.githubusercontent.com/22940964/75643086-ac59c900-5bf2-11ea-8bbb-9e378c9e9a4b.png)
From this data, it seems reasonable to assume that this MAG contains contigs from two different species with distinct genomic features. One species (blue) is highly abundant and is represented by a few long contigs (28) that have a total length of 4771.18 Kbp. The other species (gray) is lowly abundant and is represented by several short contigs (73) that have a total length of 201.62 Kbp.
Without the `--weighted-mean` option, the mean GC, TNF and coverage modules is closer to the lowly abundant species, due to the greater number of contigs. As a result, the contigs belonging to the more abundant species are flagged as contaminants:
• Reading genome bin
genome length: 101 contigs, 4972.8 Kbp
• Reading flagged contigs
phylo-markers: no output file found
clade-markers: no output file found
conspecific: no output file found
tetra-freq: 60 contigs, 4871.38 Kbp
gc-content: 6 contigs, 844.38 Kbp
coverage: no output file found
known-contam: no output file found
• Removing flagged contigs
removed: 60 contigs, 4871.38 Kbp
remains: 41 contigs, 101.42 Kbp
cleaned bin: out.fa
When the `--weighted-mean` option is turned on the mean values are closer to the more abundant species and the contigs belonging to the lowly-abundant species are removed:
• Reading genome bin
genome length: 101 contigs, 4972.8 Kbp
• Reading flagged contigs
phylo-markers: no output file found
clade-markers: no output file found
conspecific: no output file found
tetra-freq: 73 contigs, 201.62 Kbp
gc-content: 66 contigs, 180.68 Kbp
coverage: 73 contigs, 201.62 Kbp
known-contam: no output file found
• Removing flagged contigs
removed: 73 contigs, 201.62 Kbp
remains: 28 contigs, 4771.18 Kbp
cleaned bin: out.fa