- Added `get_heatmap` to the glycoworkGUI
- Added an “About” tab to the glycoworkGUI, describing the glycowork version that it is running and pointers to the reference and documentation
- Added `get_lectin_array` to the glycoworkGUI
- Added a progress bar to lengthier operations in the glycoworkGUI
- Reduced filesize of glycoworkGUI by ~20% and filesize of glycowork by >80%
- Removed inplace operations from pandas functions, because of PDEP-8
- PyTorch (`torch`) is now no longer a mandatory requirement for base glycowork. It has been shifted to the setup requirements for the optional `glycowork[ml]` install. Trying to do machine learning without that install will result in an appropriate ImportError
- `gdown` is now a mandatory requirement for glycowork, to support hosting larger files outside the package itself
glycan_data
- Updated `glycan_binding` by averaging results from duplicate sequences with different formatting
- Added processed example glycomics datasets that are available via `loader.glycomics_data_loader`
- Added processed example lectin array datasets that are available via `loader.lectin_array_data_loader`
- Added a bit of fuzziness to the motifs in `motif_list` to allow for broader capture (e.g., “GalOS” instead of “Gal6S” when appropriate, or “Sia” instead of "Neu5Ac”)
- Fixed the definition of `Internal_LacNAc_type1` in `motif_list`
loader
- Added `glycomics_data_loader` as an object for requesting glycomics data. Use dir(glycomics_data_loader) for displaying available glycomics datasets, and then request them via glycomics_data_loader.XXX (same goes for lectin array data, which is requestable via `lectin_array_data_loader`)
- Added `human_skin_O_PMC5871710`, `human_skin_O_PMC5871710_BCC`, `human_skin_O_PMC5871710_SCC`, `human_colorectal_O_PMC9254241`, `human_colorectal_N_PMID26085185`, `human_colorectal_O_PMID19152289`, `human_gastric_O_PMC4816881`, `human_gastric_O_PMID28461410`, `human_gastric_O_PMC5762837`, `human_gastric_O_PMC7226152`, `human_liver_O_PMC9254241`, `human_liver_O_PMC5383776`, `human_ovarian_O_PMC4468167`, `human_prostate_O_PMC8010466`, `human_prostate_N_PMC8010466`, `human_retina_GSL_PMC5173345`, `human_leukemia_O_PMID34646384`, `human_leukemia_N_PMID34646384`, `HIV_gagtransfection_N_PMID35112714`, `HIV_gagtransfection_O_PMID35112714`, `time_series_N_PMID32149347`, `human_brain_GSL_PMID38343116`, `human_brain_N_PMID38343116`, `human_brain_O_PMID38343116`, `human_platelets_O_PMID36952551`, `human_platelets_N_PMID36952551`, `human_serum_bacteremia_N_PMID33535571`, `time_series_HMO_PMID22649065`, and `time_series_O_PMID32149347` as datasets for `glycomics_data_loader`
- Added `A549_influenza_PMID33046650` and `HEK_XBP1_PMID30305426` as datasets for `lectin_array_data_loader`
- Added `lectin_specificity` as a resource for documented lectin specificities for lectin array analysis
- Switch `glycan_binding`, `df_species`, and `df_glycan` to lazyloading for improved package import etc.
- Added `strip_suffixes` to strip a column of string values of suffixes such as “.1”, “.2” that pandas may assign to duplicate columns
- Added `download_model` to download hosted large files, such as model weights, when needed
stats
- Fixed an issue in `test_inter_vs_intra_group` in which mean values were not correctly broadcast if “paired = False” and “grouped_BH = True”
- Added `get_equivalence_test` to test for significant equivalence of group means via two one-sided t-tests
- Added `clr_transformation` for the center log ratio transformation of a glycomics dataframe with the addition of scale uncertainty via a gamma parameter (see for instance https://arxiv.org/abs/2201.03616 for the theory behind this)
- For `impute_and_normalize`, the default value for “min_samples” has been changed to 0.1, which now means that at least 10% of the samples (rounded down) need to be non-zero for a glycan to be retained. Further, features for which one group only has zero values will now be imputed with 1e-5 to avoid erroneous homogenization of effects by `MissForest`
- Changed the “min_feature_variance” default from 0.01 to 0.02 in `variance_based_filtering` and now it also outputs the discarded rows as a second output
- Added `replace_outliers_winsorization` to cap outliers via Winsorization
- Fixed numpy random seed to 0
- Added `anosim` for ANOSIM (Analysis of similarities) for the beta-diversity calculation in `get_biodiversity`
- Added `alpha_biodiversity_stats` for performing an ANOVA on alpha diversity metrics, if groups > 2 in `get_biodiversity`
- Fixed a warning if the standard deviation of a paired sample in `cohen_d` was exactly zero
- Added `calculate_permanova_stat` and `permanova_with_permutation` for PERMANOVA (Permutational multivariate analysis of variance) for the beta-diversity calculation in `get_biodiversity`
- Added `alr_transformation`, `get_procrustes_scores`, and `get_additive_logratio_transformation` to find ALR reference component to perform the ALR transformation for compositional data analysis
- Added `correct_multiple_testing` to centralize multiple testing correction and also add a warning if >90% of features are significant (in which case, Bonferroni correction will be applied to make results more conservative)
- Raised tolerance of `MissForest` from 1e-6 to 1e-5 (as it’s applied to the sum of differences, it’s still very conservative)
- Added `omega_squared` to calculate Omega squared, as an effect size for ANOVA-type analyses
motif
analysis
- Change `get_differential_expression` to only call `TST_grouped_benjamini_hochberg` if “grouped_BH = True”, otherwise default to scipy two-stage Benjamini-Hochberg
- `get_differential_expression` now also outputs equivalence tests for all cases in which the uncorrected p-value is above 0.05
- `get_differential_expression`, `get_glycanova`, `get_time_series`, and `get_jtk` now will internally CLR- or ALR-transform input glycomics data to appropriately handle compositional data. These functions also newly accept a “gamma” keyword argument to tune the scale uncertainty for lowering the potential for false-positives
- `get_heatmap` will now automatically transpose the input dataframe if it has been provided in the wrong orientation
- Added the “transform” keyword argument to `get_heatmap`, to optionally CLR/ALR-transform the input data by setting ‘transform = “CLR”’ or ‘transform = “ALR”’
- The “transform” keyword argument also exists in most other analysis functions and accepts “ALR” and “CLR”, if users wish to override the automatically inferred type of transformation (“Nothing” is accepted for not transforming data at all but this is not recommended in most circumstances)
- Changed multiple testing correction to two-stage Benjamini-Hochberg, even if no grouped Benjamini-Hochberg test is being done
- Also change the “min_samples” default to 0.1 in `get_differential_expression` and other functions
- Changed all analysis functions to use Winsorization (`glycan_data.stats.replace_outliers_winsorization`) instead of IQR capping (`glycan_data.stats.replace_outliers_with_IQR_bounds`) for outlier treatment
- Added `get_SparCC` to perform SparCC (Sparse Correlations for Compositional Data) to find pairwise associations between glycans sequences, or motifs, between two glycomics datasets, with the typical interface of `.analysis` functions (note that you can also use a glycomics dataset together with an, e.g., metagenomics dataset, even if “motifs=True” is set)
- Removed outlier treatment in `get_pvals_motifs` to avoid removing actual effects of effect-sparse glycan array data
- Added beta-diversity measures (via Euclidean distance on CLR/ALR-transformed data) to `get_biodiversity`. This function now operates on a shopping cart principle, similar to “feature_set” in the annotation functions. The “metrics” shopping cart currently has “alpha” and “beta” as options. Beta-diversity is tested via ANOSIM (e.g., differences in central tendencies) and PERMANOVA (e.g., variations in dispersions between groups)
- In `get_heatmap` a correct color mapping (ascending or contrastive) is now automatically chosen and applied depending on whether negative values are absent or present in the input data, respectively (transform=”CLR” will introduce negative values in the data and trigger contrastive coloring)
- Added the “custom_scale” keyword argument to `get_differential_expression`, `get_glycanova`, `get_biodiversity`, and `get_time_series`. Only use it if you know what you’re doing. Basically, if you know that the total amount of glycans goes up/down in your condition of interest (in the *condition*, not in the measurement), then provide the ratio of glycan signal as group2/group1 and that will be used for an informed scale model, as described in https://www.biorxiv.org/content/10.1101/2024.04.01.587602v1 . Alternatively, if you have more than two groups, “custom_scale” can be provided as a dictionary of type: group idx : mean(group)/min(mean(groups)). [In all these cases, “gamma” becomes a parameter describing experimental error in measuring this glycan signal]
- In `get_volcano` the default for “x_thresh” has been changed to 0 (post-hoc filtering of results by fold-change invalidates the FDR guarantee) and a new “n” keyword argument exists to provide the sample-size for applying an `get_alphaN` calculated alpha threshold
- Added `get_roc` to calculate ROC AUC scores for all features and, optionally, plot the ROC curve of the best feature. Also works in multi-group mode (i.e., best feature to distinguish class A from all other classes) and can use “custom_scale”
- Added `get_lectin_array` to analyze lectin array data to find out what kind of glycan motifs are increasing/decreasing between conditions
- Added an optional number of keyword arguments to `get_volcano` that get directly passed onto the seaborn scatterplot function (**kwargs)
- Added the “rarity_filter” keyword argument to `get_pca`, to support excluding extremely rare sequences/motifs from PCA calculation
- The `glycan_representation` file as a static embedding look-up for `plot_embeddings` has been removed from the package and is now downloaded at runtime from a hosted file
- Changed `get_differential_expression` and `get_glycanova` to re-append variance-based filtering discarded rows at the end, with a default p-value of 1.0
graph
- Deprecated “wildcards_ptm” keyword argument in `compare_glycans` and `subgraph_isomorphism`. Instead, this will be inferred internally and, if a monosaccharide with PTM uncertainty (e.g., “GalOS”) is present, then it will kick in and allow for matching to specified monosaccharides (e.g., “Gal6S”)
- Fixed an issue where `graph_to_string` sometimes returned incorrect brackets for multiple nested branches
processing
- Improved `canonicalize_iupac` by handling “*”, “Ga(“, and improperly formatted ambiguities (e.g., “Gal-GlcNAc”) in an otherwise properly formatted string. Also improved floating bit handling
- Fixed an issue in the `rescue_glycans` wrapper in which keyword arguments with empty list defaults could cause an indexing issue for wrapped functions
draw
- Added the “per_residue” keyword argument to `GlycoDraw`, which allows users to basically overlay a heatmap over the SNFG representation, where the “per_residue” values control the opacity (e.g., to visualize attention or any other kind of per-monosaccharide attribution)
- Added `process_per_residue` to match per-residue values to different levels of branching
- Added the “draw_method” keyword argument to `GlycoDraw`, which allows users to draw glycans on the atomic level (chemical depiction of monosaccharides, including steric information, outlined with the respective SNFG color) in 2D (“draw_method = chem2d”) as well as 3D (“draw_method = chem3d”). Note that this requires the glycowork[chem] optional installs
- Fixed an issue in `GlycoDraw` that incorrectly parsed global losses when drawing Domon-Costello fragments
- Fixed an issue in `GlycoDraw` where, if the filepath contained “svg” or “pdf”, that was sometimes read as the incorrect filepath
- Fixed an issue in `GlycoDraw` where “vertical = True” occasionally resulted in empty output files
annotate
- Added `load_lectin_lib`, `Lectin`, `create_lectin_and_motif_mappings`, and `lectin_motif_scoring` as helper functions for `analysis.get_lectin_array`
- `quantify_motifs` now also works with log2-transformed data
network
biosynthesis
- Added multiple testing correction (via two-stage Benjamini-Hochberg), `alphaN`, and significance column to `get_differential_biosynthesis`
- Fixed an issue in which no significant results in `get_differential_biosynthesis` could error out the function
ml
models
- The model weights of the trained `LectinOracle_flex`, `LectinOracle`, `SweetNet`, and `NSequonPred` models have been removed from the package and are now downloaded at runtime from a hosted file