This is the second pre-release of Oncodrive3D, a fast and accurate 3D-clustering algorithm for driver gene discovery. It identifies mutation-enriched volumes by analyzing missense somatic mutations, leveraging AlphaFold's structural predictions to define residue contacts and mutation profiles to simulate neutral mutagenesis. The tool uses rank-based statistics and can process mutations from duplex sequencing studies, enabling the analysis of both cancer and normal tissue datasets across potentially any organism.
Key Updates and Features
New Modules for Annotation and Plotting:
- Introduced a comprehensive plotting module, including summary plots, gene plots, comparative plots, association plots, and ChimeraX plots.
Nextflow Pipeline:
- Added a minimal Nextflow pipeline to perform 3D clustering analysis across multiple cohorts and generate all relevant plots.
MANE Transcripts Support:
- Built datasets prioritizing MANE AF-predicted structures.
- Tracked transcript IDs from input data, including mismatch, match, or missing status compared to Oncodrive3D datasets.
Mutation Filtering:
- Filtered mutations with wild-type (WT) structure-AA mismatches and genes exceeding a threshold ratio of mapping issues.
- Added an option to disable WT AA mismatch filtering, particularly useful for mouse data where VEP and Uniprot isoform inconsistencies occur.
Direct VEP Output Support:
- Enabled direct VEP output processing, allowing filtering of transcripts based on Oncodrive3D-built datasets.
Enhanced Outputs:
- Included processed input mutations (`<cohort>.mutations.processed.tsv`), missense mutation probabilities (`<cohort>.miss_prob.processed.tsv`), and Oncodrive3D sequence dataframes (`<cohort>.seq_df.processed.tsv`).
Mouse Data Support:
- Fully enabled and tested processing of mouse data (mm39) across all steps, including dataset building, annotations, and plotting.
Bug Fixes and Improvements:
- Resolved bug affecting the identification of the most significant volume per gene.
- Changed sorting of position-level results from rank-based (Gene, Rank) to significance-based (Gene, p-value, Score).
- Refactored `main.py`, offloading unnecessary code to module-specific scripts for better organization.
Example usage
To run the examples provided, the `<input_path>` directory should be organized as follows:
<input_path>/
├── vep/
│ ├── <cohort_1>.vep.tsv.gz
│ └── <cohort_2>.vep.tsv.gz
├── mut_profile/
│ ├── <cohort_1>.sig.json
│ ├── <cohort_2>.sig.json
`vep/`: Contains the [VEP](https://www.ensembl.org/info/docs/tools/vep/index.html) output files for each cohort, compressed as .tsv.gz.
`mut_profile/`: Contains the [Bgsignature](https://pypi.org/project/bgsignature/) output files (mutation profile in trinucleotide context) for each cohort, saved as .sig.json.
Human MANE
- `build_datasets -o <datasets_path> --mane`
- `build_annotations -o <annotations_path> -d <datasets_path>`
- `nextflow run main.nf --indir <input_path> --outdir <output_path> --data_dir <datasets_path> --annotations_dir <annotations_path> --vep_input true --verbose true --plot true --chimerax_plot true --mane true --seed 64 -profile container`
Mouse
- `build_datasets -o <datasets_path> --organism mouse`
- `build_annotations -o <annotations_path> -d <datasets_path> --organism mouse`
- `nextflow run main.nf --indir <input_path> --outdir <output_path> --data_dir <datasets_path> --annotations_dir <annotations_path> --ignore_mapping_issues true --plot true --chimerax_plot true --vep_input true -profile container`