singlecellmultiomics Changelog

0.1.11

This release contains a bugfix which resolves molecules being under or over counted when using `bamToCountTable.py --dedup`

BuysDB (49):
Added 3bp context profiler
Added annotated chic molecule tagging method
Added consensus options to multiprocessing
Added context extraction method
Added covariate extraction methods
Added covariate_key generator
Added customisable offset for wig export
Added custom methylation contexts
Added DS-methylation extraction script
Added no-qcfail flag
Added prob_to_phred method
Added --r1only to chic workflow
Added recalibration functions
Added simple mutation profiler
Added -tagthreads parameter to control how many threads are used for tagging
Add options and tests to only count R1, or R2
Always use both mates
Check for min_mq not being defined
Check if all files are indexed
Cleaned up code and added examples
Clean up handles to prevent memory leaks
Clean up labels
Clip output confidences to 0-62 phred range
Close plots to reduce memory footprint
Create plots and aggregate by strand and mate
Fixed tag descriptions
Fix phred score calculation
Improved handling when passing over deletions. Refactoring
**Never count half counts on properly formatted bam files**
optimisations, do not perform pileups in single bp binsize mode, perform pruning in thread.
Parameter passing fixes, and dealing with some globals
Properly import the indexing function
Psuedoread super call
Raise error when wrong fasta file is supplied
Refactoring
Removed DS check, added verbosity flag, fixed indentation
Removed required reference path
Renamed bamFileTabulator in Readme to bamTabulator
Revert to mean normalisation when median fails
Run on a single node for slurm (-N 1)

0.1.9

Summary:
BuysDB (331):
Updated download URL
Added get_samples_from_bam function
Added get_sample_to_read_group_dict functions
Added sample extraction script
Moved extract_samples method to function which can be unit tested
Added tests for extract_samples
Clean up all files created during testing. (Some .bai files remained)
Removed tf requirement
bugfix: no newline after map index rows
Added scartrace module to Molecule
Added scartrace module to Fragment
Import submodules
Added scartrace to bamtagmultiome
scartrace: Check if read is mapped before looking into the alignment
Added allele cache flag
Bamfilter: fixed header formatting
Use cigar in deduplication, fixed 84
Added test for 84
Use --no_umi_cigar_processing to disable the new behaviour
Fixed test case
Fixed test case file
Simplified sorted_bam handle
Removed use of getPairGenomicLocations and allowed fragments to decide their span.
Added get_safe_span() method to fragment which reports the span excluding primers
Added region parameters to AlleleResolver to reduce memory footprint
Added documentation
Also use region parameters when fetching from cache
Added pileup module
Added check_eject_every=None option to MoleculeIterator
Updated example
Fixed module reference
Version increment
Extract base-calls for fragments mapping to multiple contigs
Use the contig of the random primer in IVT deduplication
Pass kwargs to pysam pileup and set higher max_depth
Variant masking tool now runs for multiple contigs in parallel and will not crash when the VCF does not match the fasta file completely
Reading the vcf using 4 threads per process
Added support for non-properly paired reads
Added resolve_unproperly_paired_reads to bamtagmultiome
Set more decompression threads and fixed description
Set program ID tag in PG header line
The BI tag was used to identify the cell index, but it clashes with GATK. It is now changed to lowercase bi.
Added forwards compat
Added --slurm flag to submission.py.
Set job name
Fixed BI tag compat
Automatically convert BI to bi tag
Fixed bug accessing tag dict
Started work on slurm/sge/local wrapper
submission.py is now slurm compatible, added API to sumbit and hold jobs
Added scheduler selection argument to bamtagmultiome
fixed import
Removed references to args
Return job_id
Added slurm wrapper for snakemake
Added description to iterator class inputs
Fixed typo
Added legacy scripts
Made legacy scripts PEP8 compliant
Added job_name argument to submit_job
job_alias is now optional
Changed passed arg
Parse scientific notated locations during bed parsing
Use chromosome index in job script name
Perform explicit cd to working dir
Set job name of final job
Fixed job_name
Display id of last job
Show job ids of intermediate jobs
Use after: in slurm dependency submission
Addiotion to previous
Check for hold being None
Strip hold input
Use afterok instead of after
Strip job ids
Use one UUID for a single bamtagmultiome run
Show holding command
Concatenate all job ids in one dependency command
Use : as job separator
Changed argument order
Pass None to API when hold is empty
Set job name when using CLI
Swapped prefix and hard job name
job_alias
Added utf8 header
Generate random job name if not specified
Prefix job for sge compat
Typo fixes
Job name is now properly set when supplied. File names are timestamped if not specified.
Demux.py: create unique glue job name
Added script to match bam file with bqsr report
More descriptive error message when autodetection fails
Added memory management parameters for molecule iteration
Added parameters to Molecule to cap the amount of associated fragments
Fixed? the SLURM wrapper for snakemake workflows
Added slurm wrapper to setup
Parse job runtime from resources
Added SLURM command example to scmo_workflow.py
Added MUTECT2 workflow
Tweaked resources
Added first pass variant calling
sge and slurm wrapper now use the same API calls
Report job id
Set correct index name
Fixes 101
Extraction
Added germline variant filters
Some syntax fixes
Added germline filter message and header
SNV filter
Added extra uuid4
Write intermediate results
Added -filterMP flag to bamToCountTable
Double dash
Fixed tests
Added threads to bamcnv
Updated test cases with blacklist argument
Added CS2 demux without hexamer
Set class name
Added CELSeq2_c8_u6_NH to strat loader
Added test case for CELSEQ demux. Fixed hexamer setting of NH.
Fall back on using qsub when sbatch is not available
Added workflow for SCMO (not featurecounts) celseq2 analysis
Fixed exon gtf script name in description
Added capture_locations argument
Added hash function to SingleEndTranscript (speed benefit)
Re-ordered demux methods
Added genomic plot class
Added bamFeatures module
Indentation fix
Fixed broken indent
Updates for chic
Added script to split bamfile by tag
Added skip_contig option to bamtagmultiome
Added demux tests
Added compat for already demultiplexed index
Fixed cell-readcount plot
Added get_contig_size to bamprocessing utils
Fast multi-processing count table generation
FeatureCountsFullLengthFragment fragment class added
Fixed variable declaration
Added linting script
Added full length featurecounts dedup option to bamtagmultiome (fl_feature_counts)
Removed unused imports
Allow pysam.FastaFile as argument
Added method to reset axis of a contig
Swapped dictionary indexing
Added key_tags argument
Allow pysam handle
Scale axis and despine
Fixed ax reference
Added dedup option
Added genome coverage plot to library stats
Added bam_is_processed_by_program function
Autodetect which bam file should be used if not supplied
Added more arguments to configure memory limits
Dont use multiprocessing when one thread is requested
Added variant extraction to workflow
Added cn clustermap
Added more comments and only check sample when read is used
Make sure the contigs are in the correct order
Added live counting function
Added lowess count correction
Added script for extracting and plotting cn
Added progress indication
Fix print statement
Removed incorrect argument
Added missing cariage return
Bugfix: Check if gc matrix needs to be computed
Added max_fragment_size threshold
Added option to set a single read group sample id per library
Added option to allow shift in cycle
Write rejection reason tag
Added parameter to expose setting to allow cycle shift
Added read group format setting to bamtagmultiome
Added allow_cycle_shift to bamtagmultiome
allow_cycle_shift=False by default
Added test case and updated other test cases
Added overflow support to MoleculeIterator
Raise overflow error when too many fragments are being associated with a molecule
Added association limit parameters
Added callback function to MoleculeIterator to monitor progress and state
Added performance logging methods to bamtagmultiome
Correctly handle yield_invalid flag for overflow reads
Added yield_overflow parameter to MoleculeIterator
Added --no_overflow parameter to bamtagmultiome
Optimized ordering of progress indication and shows percentage deleted reads
Added verbosity settings
Added integrity status files and testing. Fixes 65
Added input_is_sorted argument
Refactored read group code
Added script to convert read group format of bam file
Made bamtagmultiome use the new read group protocol
Added get_read_group_from_read function to bamprocessing
Prevent duplicate program IDs
Added get_read_group_format function
Refactoring
Demux.py is now twice as fast.
Bugfix: pass keyword arguments in all Fragment classes
Fix kwargs
Set variant key to include ref and alt base
Added variants module
Added variant wrapper class which can be pickled
Start of postprocessing module
Added fast_compression flags to multiple functions
Added test case for writing with faster compression
Formatting
Added prototype bamtagmultiome script which uses multiple CPUs and automatically blacklists regions (scCHiC only for now)
Added more command line accessible arguments
Added functions to combine overlapping ranges
Added function to clip a list of regions between set boundaries
Added function to generate overlapping ranges excluding blacklisted regions
Added test case for blacklisted binning
Added blacklist option to bamtagmultiome_multi
Added statsmodels dependency
Minor tweaks
Bugfix: assume average GC for a region with only Ns in the reference sequence
Bugfixes
Added min_mapping_qual and debug_job_bin_bed arguments
Added min_mapping_qual to molecule iterator
Bugfix, always define total_commands
Filter fragments with large homopolymers in chic
Added allele freq filter script
Optimisations for slow disks
Bugfix: dont drop bins due to last bin with no molecules
Optimisations, smaller bins bigger clusters of jobs better tracking of blacklisted bins
Bugfix: use the paths specified by the user
Added get_read_group_to_sample_dict method
Added additional demultiplexing module for chic
Bugfix: --norejects was not always working
Added "auto" option to submission.py
Added tu tag for second UMI
Added chic with gene annotations to bamtagmultiome
Move calculate_consensus method
Create tag_multiome_single_thread method, such that it can be replaced by the multi version later
Started on read group support
Fall back to heatmap when clustermap fails
Added normalisation method argument
Refactored use of consensus model
Refactoring
Reset feature tags when re-appling
Added offset argument, which is by default -1
Refactoring
Added bp_chunked method
Allow lower version of numpy
Move merge_bams to bamProcessing
Created separate file for tagging methods
Handle tagging import and refactor
DeCamaLise
Added prefetching capability, for pre-loading genomic information for defined/current genomic region
Refactoring prefetching
Verify arguments
AlleleResolver can now be pickled
Added more tests
Added region arguments
Added verification method
Added read_all argument
Added prefetching to mapability reader
Added prefetch switch flag
Added Uninitialised class, which can be used to pass classes with cython code to Pool workers
Removed reference attribute from TAPS class. The reference handle is now accessed using the Molecule object
Added missing statsmodels dependency
Parse input arguments for unitialised classes and initialise if necessary
Added TF as total assoc fragment tag. Resolves 132
Type hints
Added tasks generation and use of blacklist 131
First working prototype! 131
Ignore molecukes without a defined cut-site. This needs some improvement later.
Use pickleable handles
Add some testcases for CHiC and multiprocessed CHiC
Added fasta handle which is pickleable
Write program header when using multiprocessing 131
Added replace_bam_header method
Added white and blacklisting for contigs 131
Added missing files
Updated setup.py
Added get_contigs_with_reads method and test-cases
Optimisation: only generate jobs for contigs which have reads mapped to them
Solves 131
Do not resolve mate pairs in qflag mode, and dont perform deduplication
Added method to calculate consensus based on majority vote
Bugfix: use reference handle which can be pickled
Bugfix (contig might not be set)
Write temp files to current directory by default
Added mj consensus method
Sortedbam: allow name of origin bam file to be passed as argument
Reset JN tag
Bugfix: prefetcher should copy instead of write to args
Additions
Added default settings for ct
Allow uncertain bases
Bufgix: CG. Fixed CHiC+TAPS
Usability fix: automatically append byValue tag if it is not specified
Solves 141. Set quality of read with no aligned bases to 0.
Added taps strand as setting to Taps
Changed default param value
Color reads based on CpG methylation
Extract single CpG calls using multiple processes
Update available tagging options
Use kwargs for all vars
Fixed nonsensical defaults
Bugfix comparison statement
Bugfix: prefetch in single thread mode
Added kwargs passing and known variant masking
More informative debug message
Optimisation, reduce iterations
Added option to write debug bam file
Fix, used not existing argument
Started on wrapper script to get methylation calls
Added readgroup tests
Formatting
Bugfix: header was not updated properly
Added WIG writer
Added test case for readgrouping in multiprocess mode
Added methylationt track csv and wig export
Removed incorrect -c flag
Handle cases where kwargs arguments are defined but None
Auto import
Fixes 143
Added methylation module
Refactoring
Added tests
Supress numpy warnings and remove rows with nans
Wig and distance matrix writing
Add bamProcessing/bamToMethylationCalls.py to setup
Set names of all columns
Set better defaults
solves 145

Marloes (27):
change bowtie2 (and bwa) reference file
Add reference file for mapper
scartrace workflow
cs_feature_counts was named 'from_featurecounts_tagged'
config.json for celseq workflow
Snakemake file for celseq workflow
Delete Snakefile
Snakefile for celseq workflow
added librarystatistics plots
Correct reference parameter in mapper (bwa/bowtie2)
For editing heterozygous SNPs to homozygous SNPs observed in data
Update heterozygousSNPedit.py
Update setup.py
Improved BWA mapping for 2x150bp paired end reads
Improved BWA mapping for 2x150bp paired end reads
improved insert size filtering
Update config.json
Update config.json
Update Chic snakemake workflow

Jake Yeung (4):
Add script to split bam by cluster
Add blacklist option for binned mode
Allow blackilst to create count table. Last commit before I copy over a test file.
Copy over test file. This should work for bins and beds

Maria Florescu (2):
adding BAM splitting script
working splitting of BAM file with double signal and optional linear interpolation

0.1.8

Added:

[Script to split bam by cluster](https://github.com/BuysDB/SingleCellMultiOmics/commit/338ae89fa509e609699e65389808bf5b867d26a2)
[Feature counts compatibility with bamtagmultiome](https://github.com/BuysDB/SingleCellMultiOmics/pull/73)
[CHIC+T experimental](https://github.com/BuysDB/SingleCellMultiOmics/pull/70)
[Added method nla_no_overhang method to bamtagmultiome](https://github.com/BuysDB/SingleCellMultiOmics/commit/c5124c15f2c1aa6cf17e0043b53a3e7579eefe46)
[Added CellReadCount plot to libraryStatistics.py](https://github.com/BuysDB/SingleCellMultiOmics/commit/26bd7da16efcae1c41481bded86f062e32b8dc46)
[Added tensorflow based consensus caller](https://github.com/BuysDB/SingleCellMultiOmics/commit/065df50f519a8d931d7c64afaaf2ea9abd653dd3)

Bugfixes:
[Try sorting at multiple locations](https://github.com/BuysDB/SingleCellMultiOmics/commit/af6f9080f871ece7f769c293ac0ddef9a3a85480)
[All submodules now have an entry in the docs](https://github.com/BuysDB/SingleCellMultiOmics/commit/180f524a18a8fd2846a988bcb94cabcdcabb6153)
[Library statistics: fixed bug where tagged.bam was not detected](https://github.com/BuysDB/SingleCellMultiOmics/commit/646dad4835378fb0ed9d03f9718ca9db5d0d4ab4)

And updated version name.

0.1.7

This is a maintenance release.

0.1.6

Bamtagmultiome
- Now verifies if a bam index is available and creates one if it's not available
- Continues without a reference when it is not indexed
- Fixed cluster jobs, the resulting read groups are now correct and all contigs including alternative contigs are all processed.
- Added script to estimate mappability for digest protocols, and this file can be supplied to bamtagmultiome to filter for map ability.
- Added write_program_tag method to write provenance information to bam, this is used by bamtagmultiome to add the arguments used for generating the file.

Additions
Created a lot of API documentation, [for example for Fragment](https://singlecellmultiomics.readthedocs.io/en/latest/source/singlecellmultiomics.fragment.html#singlecellmultiomics.fragment.fragment.Fragment)

Added FourThiouridine (4su) class and analysis script for newly synthesized RNA.

[Added function sorted_bam_file to write straight to a sorted and indexed bam file](https://singlecellmultiomics.readthedocs.io/en/latest/source/singlecellmultiomics.bamProcessing.html#singlecellmultiomics.bamProcessing.bamFunctions.sorted_bam_file)

Updates
The QC fail bit of reads is set when the associated fragment is not valid.

MoleculeIterator is now much faster when reading through regions with very high coverage.

MoleculeIterator can now read from an iterable yielding single end reads

BAM Sorting is now performed in local directory with uuid4 prefix to prevent flooding /tmp

Added deprecation warning to universalBamTagger.py

demux.py does not allow paths with a star in it anymore.

0.1.5

Additions and improvements

scChiC
The requirement for a valid scChiC fragment is now less strict, the second mate is not required to map anymore.

Methylation
Methylation calls are now stored in the molecule object
All Bismark tags are written

Molecule
Added methods to extract base-calling feature matrices per molecule and others to extract features for structural predictions
Added methods to create a consensus read from the molecule and methods to train a classifier for consensus calling
[IVT duplicates are now tagged in the bam file ](https://github.com/BuysDB/SingleCellMultiOmics/issues/54)
The molecule class writes tags indicating which SNPs were used as evidence in assigning the allele

Fragment
[UMI hamming distance threshold is now configurable](https://github.com/BuysDB/SingleCellMultiOmics/issues/21)

Multiome tagger
[bamtagmultiome.py is a replacement of universalBamTagger.py, solving many issues.](https://github.com/BuysDB/SingleCellMultiOmics/issues/12)
The multiome tagger now runs all chromosomes in parallel
The path to the reference fasta is auto-detected from the BAM file
Added a --consensus option which writes a single consensus read per molecule
[Added allele specific, and SNP aware methylation calling](https://github.com/BuysDB/SingleCellMultiOmics/issues/50)

Demultiplexer
For every rejected read the RR (Rejection reason) tag is set, explaining why a read was not demultiplexed
[Added testing framework for the demultiplexer](https://github.com/BuysDB/SingleCellMultiOmics/issues/7)
[The demultiplexer now trims off and stores the sequence of random primers.](https://github.com/BuysDB/SingleCellMultiOmics/issues/54)

Bugfixes
[Sort ligation CSV rows](https://github.com/BuysDB/SingleCellMultiOmics/commit/4154cbf8e007db3ead77e2221c9594484ec73816)
Fixed plate visualisation for Celseq2
Fixed race-condition when running demultiplexing on multiple cluster jobs

Misc
[Switched to sphinx_rtd_theme in docs](https://singlecellmultiomics.readthedocs.io/en/latest/source/singlecellmultiomics.molecule.html#singlecellmultiomics.molecule.molecule.Molecule.get_base_calling_feature_matrix)

Singlecellmultiomics

Page 4 of 5