- Added a mode to calculate counts and perform differential expression
analysis for transcript runs (alignment-to-transcript-counts).
Transcript runs are performed against a cDNA library. They find matches
through through exon-exon junctions represented in the input cDNA
library. They are a faster alternative to mapping the genome and
exon-exon boundaries separately. Disadvantage is that these searches
will only map to transcripts represented in the input library.
- Changes to fasta-to-compact mode:
- Add parallel processing in fasta-to-compact mode. Use the --parallel
flag to activate.
- Will now only regenerate compact-reads that do not
exist, or are older than the input file.
- Added a mode to write a read set to text format (set-to-text). The output
will show the multiplicity of each query index. ReadSets can be
efficiently created with tally-reads as before.
- Changes to CompactAlignmentToAnnotationCountsMode
- Added new option --write-annotation-counts boolean, defaults to
true. If set to false the annotation counts intermediate files
will not be written.
- Lines where "average count group *" values are ALL NaN or <= 0 will
not be written. This makes it so lines that don't add anything to
the output are just omitted.
- Added new option --omit-non-informative-columns, defaults to false.
If set to true, columns in which all of the data is non-informative
(values are ALL NaN or <= 0) will be omitted.
- Support for alternative global normalization methods. We currently
provide an implementation of the upper quartile normalization method
by Bullard et al (BUQ) and the normalization method provided in
Goby 1.4 (CAC, normalize by the number of alignment record in a sample)
See the --normalization-methods argument. New normalization methods
can be used with Goby by creating an implementation of the
NormalizationMethod interface,
and adding a jar on the classpath that defines a ServiceProvider
(see build.xml goby-jar target for an example of how this is done).
When several normalization methods are given as an argument
to --normalization-methods Goby will produce derived statistics
for each normalization method and append them as new columns in
the summary stats output. This makes it easy to compare alternative
normalization methods on the same dataset.
- Added support for sequence variations:
- Changed the compact alignment format to support recording sequence
variations.
- The new mode display-sequence-variations provides text output of
sequence variations in several formats.
- The new mode sequence-variation-stats will print statistics about
sequence variations found in a set of alignments.
- Added support for quality scores:
- Changed fasta-to-compact and compact-to-fasta to read and write with
the Sanger or Illumina quality encoding.
- Modified aligners to indicate which format they require (bwa needs
fastq format, lastag fasta format, lastal fastq format). This will
need extensive testing as some of these changes can affect gobyweb.
We use the FASTQ-SANGER encoding to communicate with lastal.
We don't yet support the Solexa quality score encoding (it is a bit
obsolete anyway).
Please note that the output format in compact-to-fasta now defaults to
Fasta format. This format has no quality scores, and consequently, we
now never write quality scores when Fasta is requested. The aligners
that need quality scores must request FASTQ format explicitly.
See also:
http://en.wikipedia.org/wiki/FASTQ_format
http://maq.sourceforge.net/fastq.shtml
http://last.cbrc.jp/last/doc/last-manual.txt (look for FASTQ-SANGER)
- Changes to the Compact format:
- Store target/reference sequence lengths in the alignment header. This
information is helpful when calculating statistics such as RPKMs
(transcript-level searches).
- Store constant query lengths as one integer. Goby 1.4.1 stored one
length for each read. This can become very memory consuming when the
number of reads is very large. This change saves memory and storage.