- Determine alignment sortedness and index state from the header and by checking that the index file exists.
This allows to recover alignments when the index file was deleted. In such cases, sorting the alignment can
be done again, this is preferable to losing the alignemnt data.
- New mode simulate-reads will generate reads artifically against a reference sequence. We use this mode
to create simulated datasets of bisulfite converted reads or mutated reads and to test that Goby produces
the expected results.
- Show phred scores in DisplaySequenceVariants (tab + base)
- Add a QualityEncoding.PHRED in case one just wants to transfer quality scores without changing quality scale
- Rewritten sam-to-compact mode that handles sequence variations better, handles bsmap sam files better,
and handles quality score conversions more flexibly. The old mode is still around called
sam-to-compact-old for comparison. The new mode has slightly different command line paramters.
- Added a discover-sequence-variants mode format 'methylation' to estimate methylation rates for RRBS and
Methyl-Seq alignments.
- Dramatically improved TMH loading times for large alignemnts.
- Completely removed support for queryLength in header. This usage was deprecated in Goby 1.7, complicates
the code unecessarily and is error prone (because we had two ways to store read length in the previous
versions of Goby). Note that versions since 1.7 had a concat mode that transfered information from the
header to the alignment entries transparently. Use this mode from a pre 1.9.4 release if you need to
migrate a 1.6- alignment to work with Goby 1.9.5+.
- Fixed a bug where merge-compact-alignments would throw an ArrayIndexOutOfBounds because a TMH
query index was smaller than the first query index in the alignment.
- Changed discover-sequence-variant mode to filter out alignment entries whose read mapped multiple locations in the
reference (as determined by the aligner argument (i.e., -n for gsnap)).
- Made AlignmentReader an interface. The previous AlignmentReader class is now called AlignmentReaderImpl.
- ConcatSortedAlignmentReader and ConcatAlignemntReader now support a configurable AlignmentReaderFactory.
The factory makes it possible to plug in alignment reads that filter entries as they are read. The default
factory returns all reads. However, if NonAmbiguousAlignmentReader factory is installed, the concatenate
reader returns only entries for which the read did not match other locations in the genome. Other filtering
behaviour can be implemented in a sub-class of AlignmentReader (see NonAmbiguousAlignmentReader for an example)
and a factory created to return instances of this class.
This mechanism is used to filter out entries whose reads match several locations on the reference sequence.
- Goby now includes a VCFParser class (see package edu.cornell.med.icb.goby.readers.vcf). VCF stands
for Variant Call Format. The VCF format is described at http://www.1000genomes.org/node/101.
The Goby VCFParser class implements a VCF 4.0+ parser. Importantly, this implementation also can be
used to parse plain TSV files, or VCF that do not include the fixed VCF columns. It therefore support
an extended version of the VCF format that is as generic as a TSV file, but can also provide meta-information
about the columns in the specific file. Another difference with VCF 4.0 is that we support the Group
attribute on column fields. This makes it possible to indicate that fields are part of the same group.
Such a feature can be used by user interfaces that would like to offer the ability to manipulate multiple
column fields as a group (for instance to hide or show an entire group of fields).
- FDR mode now supports VCF input files and outputs. See the option --vcf to activate processing of VCF formatted
files.
- Added a VCFWriter class to write files in the VCF4 format. This class is now used by discover-sequence-variants
when writing in genotypes format. This should make it possible to use vcf-tools on the genotype files produced.
- Fix logic for IterateSortedAlignments which, in turn, fixes sequence-variation-stats2. The issue primarily
dealt with insertions, deletions, and left and/or right padding.
- Fixed the logic for TAB_SINGLE_BASE in display-sequence-variation mode to report the correct
read_index and ref_position.