This update features a
- change of Scirpy's data structure to improve interoperability with the AIRR standard
- a complete re-write of the clonotype definition module for improved performance.
This required several backwards-incompatible changes. Please read the release notes below and the updated tutorials.
Backwards-incompatible changes
Improve Interoperability by fully supporting the AIRR standard (241)
Scirpy stores receptor information in `adata.obs`. In this release, we updated the column names to match the [AIRR Rearrangement standard](https://docs.airr-community.org/en/latest/datarep/rearrangements.html#productive). Our data model is now much more flexible, allowing to import arbitrary immune-receptor (IR)-chain related information. Use `scirpy.io.upgrade_schema()` to update existing `AnnData` objects to the latest format.
Closed issues 240, 253, 258, 255, 242, 215.
This update includes the following changes:
- `IrCell` is now replaced by `AirrCell` which has additional functionality
- `IrChain` has been removed. Use a plain dictionary instead.
- CDR3 information is now read from the `junction` and `junction_aa` columns instead of `cdr3_nt` and `cdr3`, respectively.
- Clonotype assignments are now per default stored in the `clone_id` column.
- `expr` and `expr_raw` are now `duplicate_count` and `consensus_count`.
- `{v,d,j,c}_gene` is now `{v,d,j,c}_call`.
- There's now an `extra_chains` column containing all IR-chains that don't fit into our [receptor model](https://icbi-lab.github.io/scirpy/ir-biology.html#receptor-model). These chains are not used by scirpy, but can be re-exported to different formats.
- `merge_with_ir` is now split up into `merge_with_ir` (to merge IR data with transcriptomics data) and `merge_airr_chains` (to merge several adatas with IR information, e.g. BCR and TCR data).
- Tutorial and documentation updates, to reflect these changes
- Sequences are not converted to upper case on import. Scirpy tools that consume the sequences convert them to upper case on-the-fly.
- `{to,from}_ir_objs` has been renamed to `{to,from}_airr_cells`.
Refactor CDR3 network creation (230)
Previously, `pp.ir_neighbors` constructed a `cell x cell` network based on clonotype similarity. This led to performance issues
with highly expanded clonotypes (i.e. thousands of cells with exactly the same receptor configuration). Such cells would
form dense blocks in the sparse adjacency matrix (see issue 217). Another downside was that expensive alignment-distances had
to be recomputed every time the parameters of `ir_neighbors` was changed.
The new implementation computes distances between all _unique receptor configurations_, only considering one instance of highly expanded clonotypes.
Closed issues 243, 217, 191, 192, 164.
This update includes the following changes:
- `pp.ir_neighbors` has been replaced by `pp.ir_dist`.
- The options `receptor_arms` and `dual_ir` have been moved from `pp.ir_neighbors` to `tl.define_clonotypes` and `tl.define_clonotype_clusters`.
- The default key for clonotype clusters is now `cc_{distance}_{metric}` instead of `ct_cluster_{distance}_{metric}`.
- `same_v_gene` now fully respects the options `dual_ir` and `receptor_arms`
- v-genes and receptor types were previously simply appended to clonotype ids (when `same_v_gene=True`). Now clonotypes with different v-genes get assigned a different numeric id.
- Distance metric classes have been moved from `ir_dist` to `ir_dist.metrics`.
- Distances matrices generated by `ir_dist` are now square and symmetric instead of triangular.
- The default value for `dual_ir` is now `any` instead of `primary_only` (Closes 164).
- The API of `clonotype_network` has changed.
- Clonotype network now visualizes cells with identical receptor configurations. The number of cells with identical receptor configurations is shown as point size (and optionally, as color). Clonotype network does not support plotting multiple colors at the same time any more.
| Clonotype network (previous implementation) | Clonotype network (now) |
| -------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| Each dot represents a cell. Cells with identical receptors form a fully connected subnetwork | Each dot represents cells with identical receptors. The dot size refers to the number of cells |
| ![image](https://user-images.githubusercontent.com/7051479/114389098-dff87800-9b94-11eb-8dd8-cc406024eaa6.png) | ![image](https://user-images.githubusercontent.com/7051479/114389105-e2f36880-9b94-11eb-9a58-68a09e67efe7.png) |
Drop Support for Python 3.6
- Support Python 3.9, drop support for Python 3.6, following the numpy guidelines. (229)
Fixes
- `tl.clonal_expansion` and `tl.clonotype_convergence` now respect cells with missing receptors and return `nan` for those cells. (252)
Additions
- `util.graph.igraph_from_sparse_matrix` allows to convert a sparse connectivity or distance matrix to an `igraph` object.
- `ir_dist.sequence_dist` now also works sequence arrays that contain duplicate entries (192)
- `from_dandelion` and `to_dandelion` facilitate interaction with the [Dandelion package](https://github.com/zktuong/dandelion) (#240)
- `write_airr` allows to write scirpy's `adata.obs` back to the [AIRR Rearrangement format](https://docs.airr-community.org/en/latest/datarep/rearrangements.html).
- `read_airr` now tries to infer the locus from gene names, if no locus column is present.
- `ir.io.upgrade_schema` allows to upgrade an existing scirpy anndata object to be compatible with the latest version of scirpy
- `define_clonotypes` and `define_clonotype_clusters` now prints a logging message indicating where the results have been stored (215)
Minor changes
- `tqdm` now uses IPython widgets to display progress bars, if available
- the `process_map` from `tqdm` is now used to display progress bars for parallel computations instead the custom implementation used previously [f307c2b](https://github.com/icbi-lab/scirpy/pull/230/commits/f307c2b9a6a5b3e86ca0399b6142490d15511177)
- `matplotlib`s "grid lines" are now suppressed by default in all plots.
- Docs from the `master` branch are now deployed to `icbi-lab.github.io/scirpy/develop` instead of the main documentation website. The main website only gets updated on releases.
- Refactored the `_is_na` function that checks if a string evaluates to `None`.
- Fixed outdated documentation of the `receptor_arms` parameter (264)