- Support multiple group comparisons for RNA-Seq diff exp (mode compact-alignment-to-annotation-counts).
- Added a mode sam-comparison to compare a source SAM/BAM file with one that generated after sam-to-compact then
compact-to-sam.
- Refactor AlignmentWriter to introduce an interface and make it easier to create facades that modify the behaviour
of the default writer. For instance, such a facade is BufferedSortingAlignmentWriter, which keeps a number of entries
in memory to re-sort these entries by genomic position. This feature is used when importing already sorted SAM/BAM
files to create sorted Goby alignments and the files contain spliced alignments that would cause mis-ordering during
conversion.
- Make default chunk-size dependent on the type of chunk codec used. This is useful because hybrid compression does
better with larger chunk sizes (default chunk size for hybrid is 30000, 20000 for bzip2 and 10000 for gzip). The
default chunk size can be overriden with -x MessageChunksWriter:chunk-size=int
- Add ability to preserve SAM/BAM read groups. Read groups are automatically preserved if present in the input BAM file.
The concatenate mode automatically reassigns read_origin indices (see field read_origin_index) to prevent conflicts
when Goby files from different origins are concatenated. The approach we use is to keep the most specific read origin
information, and let the client decide what origins/groups are equivalent given the type of analysis at hand.
Read groups are supported by the hybrid codec (and therefore stored very efficiently), are imported from BAM with
sam-to-compact and are exported back to SAM/BAM with the compact-to-bam mode.
- Add ability to preserve all BAM attributes during import and export. Use --preserve-all-tags in mode sam-to-compact
to enable this.
- Add ability to preserve all quality scores. Use --preserve-all-mapped-qualities in mode sam-to-compact.
- Supports bzip2 compression in fasta-to-compact mode and sam-extract-reads (use the -x MessageChunksWriter:codec=bzip2
dynamic option).
- Renamed SortMode to Sort1Mode. Renamed SortLargeMode to SortMode.
- Added SortLargeMode which can sort compact alignments of any size, multithreaded.
- Fixes to sam-to-compact mode. Previous versions could fail for a variety of reasons. We have stress tested this mode
throwing at it various input BAM files, sorted or not and fixed the bugs we found. For instance, the --sorted option
would not work in some 1.9 versions of Goby after samtools/picard changed the semantic of the record comparator Goby
relied upon to verify the input was indeed sorted by position. This made it impossible to convert already sorted BAM
files as sorted Goby alignments).
- Moved error messages produced when parsing the command line of a mode to after usage. This is a simple change that
will make it easier to diagnose problems on a command line without having to scroll back up the console.
- Prevent logging when the log4j system has not been configured. For some reason, LOG.isDebugEnabled can return true
when the logging system is not initialized. For SamHelper, this means calling String.Format million of times to
create debug output that is never shown. This change dramatically improves the performance of the sam-to-compact mode
when logging is not properly configured.
- Refactor dynamic options with a central registry, and make GobyDriver handle option parsing.
This removes duplication of code parsing for each mode that would need dynamic options.
- methylation region can now estimate empirical p-values. Empirical P-values require biological replicates in at least
one of the groups under analysis. Two passes over the data are required. In the first pass, the empirical null
distribution is observed by comparing pairs of samples in the same group. In the second pass, this distribution is
used to estimate the p-value of observing the between group differences. Such empirical p-values can control FWER
in the strong sense.
- Support empirical p-value for individual bases (VCF output). Write a DMR INFO field that stores how many significant
sites were found in a moving window that ends at the site (significance is judged according to a configurable
threshold on the empirical p-value).
- New empirical-p mode to estimate p-values from data in text files. This makes it easier to derive p-values for
simulated data or counts generated by other tools than Goby.
- Make it possible to open Goby alignments through HTTP. Simply specify a URL as a basename as argument to the goby
tools. This is supported broadly by the API, so the concatenation reader also supports URLs, for instance. TMH files
currently cannot be loaded remotely. Alignments that require upgrading will also fail to load remotely.
- Fix issues with the barcode-decode mode. Add support for processing fasta/fastq files.
- vcf methylation format: removed space in name of C and Cm group INFO fields.
- Add a draft implementation of random access sequence interface that can read a fasta file indexed with faidx.
- Introduce chunk codecs for protocol buffer encoded collection messages (supports both reads and alignments).
- Added the ability in alignment-to-text mode to output HTML (-f html), to start/end at offsets (-s/-e) in the alignments and
to limit the number of alignment entries to output (-n).
- The RandomAccessSequenceCache had problems with bases that weren't G/A/T/C/N. Such bases would be skipped silently,
causing rare, but potentially significant, problems (such as on human chr 3 of the 1000g genome reference where a
R base appears). Bases not in the group G/A/T/C/N would introduce position shifts for bases immediately following
the offending character. Now bases other than G/A/T/C are stored as N and maintain the position of the following
bases. Please note that the problem was in a library used by RandomAccessSequenceCache, we updated the library in
this release, and no change to the code of RandomAccessSequenceCache was needed to fix the problem.
- last-to-compact: add option to substitute some bases with others in the aligned read.
- Add test and fix for bug that went back to start of alignment file, even though iterate alignment was created for a
slice of input. The problem only affected the IterateAlignments class because it was calling reposition(0,0) and the
method did not enforce slice limits.
- The code base was simplified by removing the now obsolete align mode.
- Fix a problem where sample names with several dots were stripped of too many extensions. For instance, a.b.c.entries
would be reduced to a, which could be non-unique across the remaining samples. Problem reported by Fang Fang in her
data on GobyWeb.
- DistinctIntValueCounterBitSet now uses LongArrayBitVector as its bit set implementation. The java BitSet implementation
was found to throw java.lang.ArrayIndexOutOfBoundsException for indices that should fit easily in a bit array (e.g.,
2,080,948 which can stored with about 230 MB).
- AlignmentEntry field insertSize is now stored in protobuf with sint32 rather than uint32 since negative values can be
stored in this field.
- Support multiple group comparisons for RNA-Seq diff exp (mode compact-alignment-to-annotation-counts).
- The mode sample-quality-scores now supports .sam, .sam.gz, and .bam files to make a guess at the scale of
the quality scores contained in the file.
- Added a mode sam-comparison to compare a source SAM/BAM file with one that generated after sam-to-compact then
compact-to-sam.
- Fixed a problem with concatenate-compact-reads that previously transferred only specific fields of a read to the
output file. concatenate-compact-reads now transfers all fields (including pair sequence and quality score).
- version mode now prints an official version number if the jar constains a VERSION.txt file.