Change Log
glycan_data
- Updated sugarbase database and all models
stats
- Newly added module to glycowork
- Moved all the statistics functions from `motif.processing` into this module: `cohen_d`, `mahalanobis_distance`, `mahalanobis_variance`, `variance_stabilization`, `MissForest`, `impute_and_normalize`, and `variance_based_filtering`
- Added `fast_two_sum`, `two_sum`, `expansion_sum`, `hlm`, `update_cf_for_m_n`, `jtkdist`, `jtkinit`, `jtkstat`, and `jtkx` helper functions for JTK test
- Added `get_BF` to calculate Jeffreys' approximate Bayes factor based on sample size and p-value
- Added `get_alphaN` to calculate sample size-appropriate significance cut-offs informed by Bayesian statistics
- Added `pi0_tst` and `TST_grouped_benjamini_hochberg` to perform a Two-Stage adaptive Benjamini-Hochberg procedure based on groups (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3175141/ or https://www.biorxiv.org/content/10.1101/2024.01.13.575531v1)
- Added `test_inter_vs_intra_group` to estimate intra- versus inter-group correlation with a mixed-effects model for groupings of glycans based on domain expertise
motif
regex
- Newly added module to glycowork
- Added the `get_match` function and associated functions to implement a regular expression system for glycans. This allows for powerful queries to detect and extract motifs of arbitrary complexity.
processing
- Moved `cohen_d`, `mahalanobis_distance`, `mahalanobis_variance`, `variance_stabilization`, `MissForest`, `impute_and_normalize`, and `variance_based_filtering` into `glycan_data.stats` to re-focus `processing` on processing glycan sequences
- Extended `canonicalize_composition` to cases like ‘5_4_2_1’, ‘5421’, and ‘(Hex)2 (HexNAc)2 (Deoxyhexose)1 (NeuAc)2 + (Man)3(GlcNAc)2’
- GlycoCT and WURCS handling for universal input now encompass more monosaccharides and more modifications
- Expanded `oxford_to_iupac` to handle more complex sequences, including sulfation, LacdiNAc, hybrid structures, extended Neu5Ac, complex fucosylation, more custom linkage specifications
- `enforce_class` can now deal with free glycans regardless of whether they end in ‘-ol’ or not
annotate
- `annotate_dataset` and downstream functions now accept a new keyword in “feature_set”, called “custom”. If “custom” is added to “feature_set”, a list of custom motifs can and must be added via the “custom_motifs” keyword argument. “custom” can be mixed and matched with all other keywords in “feature_set”
- `annotate_dataset` now also accepts glyco-regular expressions via the “custom” keyword in “feature_set”. These expressions need to be added within the “custom_motifs” keyword argument and have to start with an “r”, such as "rHex-HexNAc-([Hex|Fuc]){1,2}-HexNAc". Normal motifs and glyco-regular expressions can be freely mixed within “custom_motifs”
- Added `group_glycans_core`, `group_glycans_sia_fuc`, and `group_glycans_N_glycan_type` to group glycans by core structure (for O-glycans), Sia/Fuc/FucSia/Rest, or complex/hybrid/high-man/rest (for N-glycans)
- Fixed a bug in `get_k_saccharides`, in which redundant columns were not always correctly removed
analysis
- Added `get_jtk` to analyze circadian expression of glycans in temporal glycomics datasets using the Jonckheere–Terpstra–Kendall (JTK) algorithm, with the typical interface for motifs and imputation etc analogous to differential expression.
- `get_differential_expression`, `get_glycanova`, and `get_jtk` now use `get_alphaN` to calculate a sample size-appropriate significance cut-off (see https://journals.sagepub.com/doi/10.1177/14761270231214429) and add a ‘significant’ column to the output to display whether the corrected p-values lie below this threshold
- Added the “zscores” keyword argument to `get_pvals_motifs` to perform z-score transformation if used data are not yet z-score transformed, by setting “zscores” to False
- For statistical calculations, `get_pval_motifs` will now weigh the motif occurrences by z-score magnitude, rather than only using a cut-off for enrichment calculations
- Added effect size calculations to `get_pval_motifs` which are also in the output, as Cohen’s d
- Changed `get_pval_motifs` such that now both enrichments and depletions will be tested (with depletions resulting in negative effect sizes)
- Added `select_grouping` to find out which grouping of glycans has the highest intra- versus inter-group correlation, as estimated by `glycan_data.stats.test_inter_vs_intra_group`
- When “motifs = False” and “grouped_BH = True”, `get_differential_expression` now tries to use the Two-Stage adaptive Benjamini-Hochberg procedure based on groups for multiple testing correction, if meaningful groups can be found in the glycans [note this makes everything at least one order of magnitude slower, though most datasets should still finish in a few seconds]
draw
- In `GlycoDraw`, the “highlight_motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)
- Added `plot_glycans_excel` to allow for the automated insertion of `GlycoDraw` SNFG pictures into an Excel file containing glycan sequences
graph
- `categorical_node_match_wildcard` now uses string ID for matching, instead of integer ID, which means even two graphs, generated with two different libs, can now be successfully compared via `compare_glycans` or `subgraph_isomorphism`
- `compare_glycans` or `subgraph_isomorphism` (and all functions using these functions) now support negation, by prepending “!”. For instance, “!Fuc(a1-?)Gal(b1-4)GlcNAc” will match subsequences that have a monosaccharide that is NOT Fuc before the Gal. It is *highly* recommend to generate your own lib via `get_lib` if you use negation, as monosaccharides such as !Fuc are *not* within lib and will cause indexing errors.
- Added “?1-?” as another ultimate wildcard (promoting it from a strong narrow wildcard)
- Fixed some cases where “Monosaccharide” was not treated as an ultimate wildcard in graph operations
- Fixed an issue in `graph_to_string` in which glycans of size 1 (e.g., “GalNAc”) sometimes were missing their first character
network
- Updated pre-calculated biosynthetic networks for milk oligosaccharides
biosynthesis
- Refactored `find_diff` to make networks compatible with the automated, dynamic wildcards (i.e., ? behave as they should and don’t necessarily cause over-branching of the network)
- In `highlight_network`, the “motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)
ml
model_training
- In `training_setup`, upgraded the loss functions for all classification problems to PolyLoss with label smoothing (see https://arxiv.org/abs/2204.12511 for details).
- In `training_setup`, number of classes (for multiclass or multilabel classification) can now be specified via the new “num_classes” keyword argument