------------------------------
Highlights
- Support for `BigWig`_ files
- Reimplementation of `BigBed`_ file support
- Simplification of syntax / removal of annoyances in both command-line
scripts and in infrastructure
Added/Changed
.............
File formats
""""""""""""
- Support for `BigWig`_ files. ``BigWigReader`` reads `BigWig`_ files, and
``BigWigGenomeArray`` handles them conveniently.
- ``BigBedReader`` has been reimplemented using Jim Kent's C library, making
it far faster and more memory efficient.
- ``BigBedReader.search()`` created to search indexed fields included in BigBed
files, e.g. to find transcripts with a given `gene_id` (if `gene_id` is included
as an extension column and indexed). To see which fields are searchable,
use ``BigBedReader.indexed_fields``
Infrastructure
""""""""""""""
- Simplified file opening. All readers can now take filenames in addition
to open filehandles. No need to wrap filenames in lists any more.
For example:
.. code-block:: python
old way to open GTF2 file
>>> data = GTF2_TranscriptAssembler(open("some_file.gtf"))
new way. Also works with BED_Reader, GTF2_Reader, GFF3_TranscriptAssembler, and others
>>> data = GTF2_TranscriptAssembler("some_file.gtf")
old way to get read alignments from BAM files
>>> alignments = BAMGenomeArray(["some_file.bam","some_other_file.bam"])
new way
>>> alignemnts = BAMGenomeArray("some_file.bam","some_other_file.bam")
old way to open a tabix-indexed file
>>> data = BED_Reader(pysam.tabix_iterator(open("some_file.bed.gz"),pysam.asTuple()),tabix=True)
new way
>>> data = BED_Reader("some_file.bed.gz",tabix=True)
To maintain backward compatibility, the old syntax still works
- ``BAMGenomeArray`` can now use mapping functions that return multidimensional
arrays. As an example we added ``StratifiedVariableFivePrimeMapFactory``,
which produces a 2D array of counts at each position in a region (columns),
stratified by read length (rows).
- Reformatted & colorized warning output to improve legibility
- ``read_pl_table()`` convenience function for reading tables written
by command-line scripts into DataFrames, preserving headers, formatting,
et c
Command-line scripts
""""""""""""""""""""
- All script output metadata now includes command as executed, for easier
re-running and record keeping
- Scripts using count files get ``--sum`` flag, enabling users to
set effective sum of counts/reads used in normalization and RPKM
calculations
- ``psite``
- ``--constrain`` option added to ``psite`` to improve performance on
noisy or low count data.
- No longer saves intermediate count files. ``--keep`` option added
to take care of this.
- ``metagene``
- Fixed/improved color scaling in heatmap output. Color values are now
capped at the 95th percentile of nonzero values, improving contrast
- Added warnings for files that appear not to contain UTRs
- Like ``psite``, no longer saves intermediate count files. ``--keep``
option added to take care of this.
- ``phase_by_size`` can now optionally use an ROI file from the
``metagene generate`` subprogram. This improves accuracy in higher
eukaryotes by preventing double-counting of codons when more than
one transcript is annotated per gene.
- ``cs chart`` file containing list of genes is now optional. If not given,
all genes are included in comparisons
- ``reformat_transcripts`` is now able to export extended BED columns
(e.g. `gene_id`) if the input data has useful attributes. This particularly
useful when working with large transcript annotations in GTF2/GFF3 format-
they can now be exported to BED format, and converted to BigBed foramt,
allowing random access and low memory usage, while preserving gene-transcript
relationships.
Fixed
.....
- Version parsing bug in setup script.
- ``deprecated`` function decorator now gives ``FutureWarning`` instead
of ``DeprecationWarning``
Deprecated
..........
- ``--norm_region`` option of ``psite`` and ``metagene`` has been deprecated
and will be removed in ``plastid`` v0.5. Instead, use ``--normalize_over``,
which performs the same role, except coordinates are specified relative to the
landmark of interest, rather than entire window. This change is more
intuitive to many users, and saves them mental math. If both ``--norm_region``
and ``--normalize_over`` are specified, ``--normalize_over`` will be used.
- ``BigBedReader.custom_fields`` has been replaced with ``BigBedReader.extension_fields``
- ``BigBedReader.chrom_sizes`` has been replaced with ``BigBedReader.chroms``
for consistency with other data structures
- ``BPlusTree`` and ``RTree`` classes, which will be removed in ``plastid`` v0.5