Glycowork

Latest version: v1.4.0

Safety actively analyzes 687959 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 3

1.4.0

- Added an example workflow/tutorial for differential glycomics analysis to the Examples tab in the documentation
- Added additional tests via pytest
- Cleaned up repo with more stringent .gitignore, removing unnecessary files
- Added hover-over tooltips to the glycoworkGUI, describing how the input files should be formatted
- Exposed more keyword arguments of `get_heatmap` in GUI (CLR transformation + tick label control)
glycan_data
- Broadened the motif definition of “Mucin_elongated_core2” in `motif_list`
- Refined the motif definitions of the O-glycan core motifs in `motif_list` to prevent overlaps
- Larger (and cleaner) datasets for: `df_glycan`, `df_species`, `df_tissue`, `df_disease`, and `glycan_binding`
- Updated `lib` from 2,366 to 2,565 glycoletters
loader
- Added the `glycoproteomics_data_loader`, to request stored glycoproteomics datasets
- Added `human_milk_N_PMID34087070` and `human_keratinocytes_PMID37956981` as example datasets for `glycoproteomics_data_loader` (data are ID’ed in the “Glycosite” column in the format protein_site_composition)
- Added `HexOS` and `HexNAcOS` monosaccharide lists to be used in downstream functions
- Added `modification_map` to map which monosaccharides can be modified with which post-biosynthetic modification
- Added `DataFrameSerializer` to have a version-independent serializer for handling `df_glycan`
stats
- Added `get_glycoform_diff` to aggregate glycoforms differential expression across glycopeptides or glycoproteins via Fisher’s Combined Probability Test
- Fixed a pandas deprecation warning in `replace_outliers_winsorization` (for pandas >= 2.2.2)
- Added `get_glm` and `process_glm_results` to fit and analyze generalized linear models, with interaction terms, to grouped glycoproteomics data
- Added `partial_corr` to calculate regularized partial correlations
- Added `estimate_technical_variance` and `perform_tests_monte_carlo` to account for technical variation in glycomics data
- Added the “cap_side” keyword argument to `replace_outliers_with_IQR_bounds` and `replace_outliers_winsorization` to allow users to cap outliers on “both”, “upper”, “lower” sides; default: “both”
- Fixed the global NumPy RNG for `clr_transformation` and `alr_transformation` to ensure reproducibility
- Added the “correction_method” keyword argument to `correct_multiple_testing`, to allow users to switch between regular Benjamini-Hochberg and two-stage Benjamini-Hochberg
motif
processing
- Added support for sulfated monosaccharides to `get_possible_monosaccharides`
- Added `parse_glycoform`, `infer_features_from_composition`, and `process_for_glycoshift` as helper functions in glycoproteomics data analysis
- Expanded `canonicalize_composition` to deal with compositions of type “9 2 0 0”
- Fine-tune `canonicalize_iupac` to not mess up formatting of sequences ending in “GlcOP-ol”
- Added `de_wildcard_glycoletter` to retrieve a random specified monosaccharide/linkage of the general type present as a wildcard (e.g., Hex->Gal)
- Added `get_class` to return the glycan class as a string, given a glycan sequence
- If `choose_correct_isoform` is provided with isomers that have different amounts of ambiguities, it will now prioritize the isomers with the fewest ambiguities
graph
- Added support for mixing monosaccharide and modification wildcards in `compare_glycans` and `subgraph_isomorphism` (e.g., “HexNAcOS”)
- Added the `handle_negation` decorator and `subgraph_isomorphism_with_negation` to process motif annotation with restrictions (e.g., “Gal(b1-3)[!GlcNAc(b1-6)]GalNAc” to prevent annotating core2 O-glycans as core1)
- `subgraph_isomorphism` is now decorated with `handle_negation`, such that if the “motif” argument contains a negating operator (“!”), the function will actually execute `subgraph_isomorphism_with_negation`
- Added the “allowed_disaccharides” keyword argument to `get_possible_topologies` to support filtering possible extensions by physiological glycan extensions
- Added a filter to `get_possible_topologies` to maintain chemically feasible structures by checking that the same carbon does not get two linkages
- Support handling of post-biosynthetic modifications in `get_possible_topologies`, e.g., allowing things like “{6S}Gal(b1-3)[GlcNAc(b1-6)]GalNAc” as input, with uncertainty about where the sulfate is attached
- Refactored `graph_to_string_int` to recursively construct a depth-first search tree to construct the IUPAC-condensed string
- Supported monosaccharide-only graphs in `generate_graph_features`
- Added `deduplicate_glycans` to remove duplicate glycans (with different IUPAC strings) from a list of glycans
analysis
- Added the “glycoproteomics” and “level” keyword arguments to `get_differential_expression` to support the analysis of glycoproteomics data if “glycoproteomics=True”. “level” indicates whether different glycoforms should be analyzed at the level of glycopeptides or glycoproteins
- Added `get_glycoshift_per_site` to analyze whether, and in which way, glycosylation changes between conditions for each glycosylation site (controlling for protein expression etc.) via generalized linear models (GLM) adapted for compositional data (i.e., CLR-transformation)
- Added `preprocess_data` as a centralization of data preprocessing for easier maintenance
- Moved preprocessing code from `get_differential_expression`, `get_glycanova`, `get_biodiversity`, and `get_roc` into `preprocess_data`
- Fixed an issue in `clean_up_heatmap` in which sometimes the longer string instead of the longer sequence was picked for deduplication (e.g., “Internal_LewisX” vs “SialylLewisX”)
- Moved `clean_up_heatmap` into `motif.annotate`
- Added Omega-squared as an effect size output to `get_glycanova`
- Fixed an issue in `get_heatmap` in which sometimes the function did not correctly rescue an input by transposing it, if the index contained special characters
- Fixed an issue in `get_pca` in which the input of a dataframe for group specification resulted in an error
- Disabled Levene’s test in `get_differential_expression` if either group has fewer than three samples, for numerical stability
- Added the “partial_correlations” keyword argument to `get_SparCC`. If set to True, it will instead use regularized partial correlations to reduce multi-colinearity and enrich associations that represent direct effects (i.e., getting rid of bystander effects)
- Added the “monte_carlo” keyword argument (default False) to `preprocess_data` and `get_differential_expression`. If True, this will simulate technical variation by sampling 128 Monte Carlo instances from a Dirichlet distribution for each sample. Only works for sequences & CLR for now. This will substantially increase runtime and be considerably more conservative in yielding significant differences between conditions. Use with caution.
- In `get_differential_expression` glycans that had been filtered out by variance filtering now still have their mean abundance and log2FC recorded in the output table
- Added the “show_all” keyword argument to `get_heatmap` to force all tick labels to display, even if they visually overlap
annotate
- Added `annotate_glycan_topology_uncertainty` to probe whether motifs can be annotated in the case of structural ambiguity (e.g., {Fuc(a1-3)} in N-glycans, to still annotate Lewis X)
- Expanded `annotate_dataset` to let it automatically switch between `annotate_glycan` and ` annotate_glycan_topology_uncertainty`, depending on whether structural ambiguity is present in a glycan (the latter is much more costly in terms of computation)
- Added the (default: True) keyword argument “remove_redundant” to `quantify_motifs` that will call `clean_up_heatmap` on the output to remove redundant motifs
- Dynamically generated terminal motifs now have the prefix “Terminal_” in all outputs
- Resolved a recent deprecation warning from pandas in `get_k_saccharides`
- Added a warning to `annotate_dataset` that will print all features in “feature_set” that are not being recognized
- Support the use of “terminal1” as a synonym to the original “terminal” in “feature_set”
draw
- Support the new “Terminal_” prefix in `GlycoDraw` and `annotate_figure`
tokenization
- Added support for sulfated HexA and HexN in `map_to_basic`
- Added `calculate_adduct_mass` to calculate the mass for generic molecular formulae (e.g., C2H4O2)
- Added support for chemical tags or adducts in `composition_to_mass`, `glycan_to_mass`, and `mz_to_composition` via the new “adduct” keyword argument
- Added “Pen” to `get_core`
- The default “glycan_class” in `mz_to_composition` is now “all” (but it can of course still be user-specified)
- Added the new keyword argument “extras” to `mz_to_composition`, to allow users to switch off the consideration of adducts or doubly-charged input masses (the default now is to opt out of adducts but users can add that to “extras”)
- Copy the input dictionary in `composition_to_mass` to prevent any in-place modification of the keys
network
biosynthesis
- Made network construction faster via code optimizations
- Added the “mode” keyword argument to `choose_path`, `find_diamonds`, `trace_diamonds`, and `evoprune_network` to allow for biosynthetic motif analysis to use information from relative abundances
- We now support the use of longitudinal data in `get_differential_biosynthesis` to analyze whether biosynthetic flows change over time
- Fixed an issue in `get_differential_biosynthesis` in which N-glycans with high-mannose sequences caused errors (due to the backward direction of synthesis)
- Fixed an issue in `get_differential_biosynthesis` in which N-glycans, containing many unobserved intermediate sequences, had capacity bottleneck issues
- Added the “min_default” keyword argument to `estimate_weights`, to allow class-dependent fine-tuning of the minimum capacity
- Modified `construct_network` to disallow the transfer of modified monosaccharides (e.g., GlcNAc6S), only retaining the sequential assembly in accordance with known biosynthesis (e.g., GlcNAc, then 6S)
- Added `extend_glycans`, `edges_for_extension`, and `extend_network` to extend the biosynthetic network based on observed reactions and permitted disaccharide extensions
- Deprecated `safe_max` and `find_ptm`; will be done in-line instead
ml
- Updated trained models for new `lib`
processing
- Made `dataset_to_graphs` faster if there were any duplicates in the input glycans
- Added `augment_glycan` and `AugmentedGlycanDataset` to support glycan data augmentation during training of deep learning models. Currently, the only supported data augmentation is wildcarding of monosaccharides/linkages (e.g., GalHex, b1-4?1-?) and the inverse (de-wildcarding)
- Added the keyword arguments “augment_prob” and “generalization_prob” to `split_data_to_train` to control the likelihood of augmenting a glycan and the proportion of the glycan to be (de-)wildcarded if it is augmented
inference
- Added an `unwrap` call to `get_lectin_preds` to fix the output format
models
- Set “weights_only = True” for `torch.load` to prevent FutureWarning
model_training
- Support already one-hot encoded multilabel labels in `Poly1CrossEntropyLoss`

1.3.0

- Added `get_heatmap` to the glycoworkGUI
- Added an “About” tab to the glycoworkGUI, describing the glycowork version that it is running and pointers to the reference and documentation
- Added `get_lectin_array` to the glycoworkGUI
- Added a progress bar to lengthier operations in the glycoworkGUI
- Reduced filesize of glycoworkGUI by ~20% and filesize of glycowork by >80%
- Removed inplace operations from pandas functions, because of PDEP-8
- PyTorch (`torch`) is now no longer a mandatory requirement for base glycowork. It has been shifted to the setup requirements for the optional `glycowork[ml]` install. Trying to do machine learning without that install will result in an appropriate ImportError
- `gdown` is now a mandatory requirement for glycowork, to support hosting larger files outside the package itself
glycan_data
- Updated `glycan_binding` by averaging results from duplicate sequences with different formatting
- Added processed example glycomics datasets that are available via `loader.glycomics_data_loader`
- Added processed example lectin array datasets that are available via `loader.lectin_array_data_loader`
- Added a bit of fuzziness to the motifs in `motif_list` to allow for broader capture (e.g., “GalOS” instead of “Gal6S” when appropriate, or “Sia” instead of "Neu5Ac”)
- Fixed the definition of `Internal_LacNAc_type1` in `motif_list`
loader
- Added `glycomics_data_loader` as an object for requesting glycomics data. Use dir(glycomics_data_loader) for displaying available glycomics datasets, and then request them via glycomics_data_loader.XXX (same goes for lectin array data, which is requestable via `lectin_array_data_loader`)
- Added `human_skin_O_PMC5871710`, `human_skin_O_PMC5871710_BCC`, `human_skin_O_PMC5871710_SCC`, `human_colorectal_O_PMC9254241`, `human_colorectal_N_PMID26085185`, `human_colorectal_O_PMID19152289`, `human_gastric_O_PMC4816881`, `human_gastric_O_PMID28461410`, `human_gastric_O_PMC5762837`, `human_gastric_O_PMC7226152`, `human_liver_O_PMC9254241`, `human_liver_O_PMC5383776`, `human_ovarian_O_PMC4468167`, `human_prostate_O_PMC8010466`, `human_prostate_N_PMC8010466`, `human_retina_GSL_PMC5173345`, `human_leukemia_O_PMID34646384`, `human_leukemia_N_PMID34646384`, `HIV_gagtransfection_N_PMID35112714`, `HIV_gagtransfection_O_PMID35112714`, `time_series_N_PMID32149347`, `human_brain_GSL_PMID38343116`, `human_brain_N_PMID38343116`, `human_brain_O_PMID38343116`, `human_platelets_O_PMID36952551`, `human_platelets_N_PMID36952551`, `human_serum_bacteremia_N_PMID33535571`, `time_series_HMO_PMID22649065`, and `time_series_O_PMID32149347` as datasets for `glycomics_data_loader`
- Added `A549_influenza_PMID33046650` and `HEK_XBP1_PMID30305426` as datasets for `lectin_array_data_loader`
- Added `lectin_specificity` as a resource for documented lectin specificities for lectin array analysis
- Switch `glycan_binding`, `df_species`, and `df_glycan` to lazyloading for improved package import etc.
- Added `strip_suffixes` to strip a column of string values of suffixes such as “.1”, “.2” that pandas may assign to duplicate columns
- Added `download_model` to download hosted large files, such as model weights, when needed
stats
- Fixed an issue in `test_inter_vs_intra_group` in which mean values were not correctly broadcast if “paired = False” and “grouped_BH = True”
- Added `get_equivalence_test` to test for significant equivalence of group means via two one-sided t-tests
- Added `clr_transformation` for the center log ratio transformation of a glycomics dataframe with the addition of scale uncertainty via a gamma parameter (see for instance https://arxiv.org/abs/2201.03616 for the theory behind this)
- For `impute_and_normalize`, the default value for “min_samples” has been changed to 0.1, which now means that at least 10% of the samples (rounded down) need to be non-zero for a glycan to be retained. Further, features for which one group only has zero values will now be imputed with 1e-5 to avoid erroneous homogenization of effects by `MissForest`
- Changed the “min_feature_variance” default from 0.01 to 0.02 in `variance_based_filtering` and now it also outputs the discarded rows as a second output
- Added `replace_outliers_winsorization` to cap outliers via Winsorization
- Fixed numpy random seed to 0
- Added `anosim` for ANOSIM (Analysis of similarities) for the beta-diversity calculation in `get_biodiversity`
- Added `alpha_biodiversity_stats` for performing an ANOVA on alpha diversity metrics, if groups > 2 in `get_biodiversity`
- Fixed a warning if the standard deviation of a paired sample in `cohen_d` was exactly zero
- Added `calculate_permanova_stat` and `permanova_with_permutation` for PERMANOVA (Permutational multivariate analysis of variance) for the beta-diversity calculation in `get_biodiversity`
- Added `alr_transformation`, `get_procrustes_scores`, and `get_additive_logratio_transformation` to find ALR reference component to perform the ALR transformation for compositional data analysis
- Added `correct_multiple_testing` to centralize multiple testing correction and also add a warning if >90% of features are significant (in which case, Bonferroni correction will be applied to make results more conservative)
- Raised tolerance of `MissForest` from 1e-6 to 1e-5 (as it’s applied to the sum of differences, it’s still very conservative)
- Added `omega_squared` to calculate Omega squared, as an effect size for ANOVA-type analyses
motif
analysis
- Change `get_differential_expression` to only call `TST_grouped_benjamini_hochberg` if “grouped_BH = True”, otherwise default to scipy two-stage Benjamini-Hochberg
- `get_differential_expression` now also outputs equivalence tests for all cases in which the uncorrected p-value is above 0.05
- `get_differential_expression`, `get_glycanova`, `get_time_series`, and `get_jtk` now will internally CLR- or ALR-transform input glycomics data to appropriately handle compositional data. These functions also newly accept a “gamma” keyword argument to tune the scale uncertainty for lowering the potential for false-positives
- `get_heatmap` will now automatically transpose the input dataframe if it has been provided in the wrong orientation
- Added the “transform” keyword argument to `get_heatmap`, to optionally CLR/ALR-transform the input data by setting ‘transform = “CLR”’ or ‘transform = “ALR”’
- The “transform” keyword argument also exists in most other analysis functions and accepts “ALR” and “CLR”, if users wish to override the automatically inferred type of transformation (“Nothing” is accepted for not transforming data at all but this is not recommended in most circumstances)
- Changed multiple testing correction to two-stage Benjamini-Hochberg, even if no grouped Benjamini-Hochberg test is being done
- Also change the “min_samples” default to 0.1 in `get_differential_expression` and other functions
- Changed all analysis functions to use Winsorization (`glycan_data.stats.replace_outliers_winsorization`) instead of IQR capping (`glycan_data.stats.replace_outliers_with_IQR_bounds`) for outlier treatment
- Added `get_SparCC` to perform SparCC (Sparse Correlations for Compositional Data) to find pairwise associations between glycans sequences, or motifs, between two glycomics datasets, with the typical interface of `.analysis` functions (note that you can also use a glycomics dataset together with an, e.g., metagenomics dataset, even if “motifs=True” is set)
- Removed outlier treatment in `get_pvals_motifs` to avoid removing actual effects of effect-sparse glycan array data
- Added beta-diversity measures (via Euclidean distance on CLR/ALR-transformed data) to `get_biodiversity`. This function now operates on a shopping cart principle, similar to “feature_set” in the annotation functions. The “metrics” shopping cart currently has “alpha” and “beta” as options. Beta-diversity is tested via ANOSIM (e.g., differences in central tendencies) and PERMANOVA (e.g., variations in dispersions between groups)
- In `get_heatmap` a correct color mapping (ascending or contrastive) is now automatically chosen and applied depending on whether negative values are absent or present in the input data, respectively (transform=”CLR” will introduce negative values in the data and trigger contrastive coloring)
- Added the “custom_scale” keyword argument to `get_differential_expression`, `get_glycanova`, `get_biodiversity`, and `get_time_series`. Only use it if you know what you’re doing. Basically, if you know that the total amount of glycans goes up/down in your condition of interest (in the *condition*, not in the measurement), then provide the ratio of glycan signal as group2/group1 and that will be used for an informed scale model, as described in https://www.biorxiv.org/content/10.1101/2024.04.01.587602v1 . Alternatively, if you have more than two groups, “custom_scale” can be provided as a dictionary of type: group idx : mean(group)/min(mean(groups)). [In all these cases, “gamma” becomes a parameter describing experimental error in measuring this glycan signal]
- In `get_volcano` the default for “x_thresh” has been changed to 0 (post-hoc filtering of results by fold-change invalidates the FDR guarantee) and a new “n” keyword argument exists to provide the sample-size for applying an `get_alphaN` calculated alpha threshold
- Added `get_roc` to calculate ROC AUC scores for all features and, optionally, plot the ROC curve of the best feature. Also works in multi-group mode (i.e., best feature to distinguish class A from all other classes) and can use “custom_scale”
- Added `get_lectin_array` to analyze lectin array data to find out what kind of glycan motifs are increasing/decreasing between conditions
- Added an optional number of keyword arguments to `get_volcano` that get directly passed onto the seaborn scatterplot function (**kwargs)
- Added the “rarity_filter” keyword argument to `get_pca`, to support excluding extremely rare sequences/motifs from PCA calculation
- The `glycan_representation` file as a static embedding look-up for `plot_embeddings` has been removed from the package and is now downloaded at runtime from a hosted file
- Changed `get_differential_expression` and `get_glycanova` to re-append variance-based filtering discarded rows at the end, with a default p-value of 1.0
graph
- Deprecated “wildcards_ptm” keyword argument in `compare_glycans` and `subgraph_isomorphism`. Instead, this will be inferred internally and, if a monosaccharide with PTM uncertainty (e.g., “GalOS”) is present, then it will kick in and allow for matching to specified monosaccharides (e.g., “Gal6S”)
- Fixed an issue where `graph_to_string` sometimes returned incorrect brackets for multiple nested branches
processing
- Improved `canonicalize_iupac` by handling “*”, “Ga(“, and improperly formatted ambiguities (e.g., “Gal-GlcNAc”) in an otherwise properly formatted string. Also improved floating bit handling
- Fixed an issue in the `rescue_glycans` wrapper in which keyword arguments with empty list defaults could cause an indexing issue for wrapped functions
draw
- Added the “per_residue” keyword argument to `GlycoDraw`, which allows users to basically overlay a heatmap over the SNFG representation, where the “per_residue” values control the opacity (e.g., to visualize attention or any other kind of per-monosaccharide attribution)
- Added `process_per_residue` to match per-residue values to different levels of branching
- Added the “draw_method” keyword argument to `GlycoDraw`, which allows users to draw glycans on the atomic level (chemical depiction of monosaccharides, including steric information, outlined with the respective SNFG color) in 2D (“draw_method = chem2d”) as well as 3D (“draw_method = chem3d”). Note that this requires the glycowork[chem] optional installs
- Fixed an issue in `GlycoDraw` that incorrectly parsed global losses when drawing Domon-Costello fragments
- Fixed an issue in `GlycoDraw` where, if the filepath contained “svg” or “pdf”, that was sometimes read as the incorrect filepath
- Fixed an issue in `GlycoDraw` where “vertical = True” occasionally resulted in empty output files
annotate
- Added `load_lectin_lib`, `Lectin`, `create_lectin_and_motif_mappings`, and `lectin_motif_scoring` as helper functions for `analysis.get_lectin_array`
- `quantify_motifs` now also works with log2-transformed data
network
biosynthesis
- Added multiple testing correction (via two-stage Benjamini-Hochberg), `alphaN`, and significance column to `get_differential_biosynthesis`
- Fixed an issue in which no significant results in `get_differential_biosynthesis` could error out the function
ml
models
- The model weights of the trained `LectinOracle_flex`, `LectinOracle`, `SweetNet`, and `NSequonPred` models have been removed from the package and are now downloaded at runtime from a hosted file

1.2.0

- Added `glycoworkGUI.py` to build the .exe based GUI for important glycowork endpoint functions: `GlycoDraw`, `plot_glycans_excel`, and `get_differential_expression`
- Removed `python-louvain` as a required dependency for `glycowork`
glycan_data
loader
- Switched from `pkg_resources` to `importlib` for loading tabular data into the package
stats
- Fixed an issue in `TST_grouped_benjamini_hochberg` that caused errors if nothing was significantly different in the entire dataset or in any group
- `test_inter_vs_intra_grouping` is now robust to non-paired data and data with differing sample sizes per condition
- Added `replace_outliers_with_IQR_bounds` to support outlier treatment in `motif.analysis`
- Added `sequence_richness`, `shannon_diversity_index`, and `simpson_diversity_index` to calculate diversity indices of glycomics data
motif
processing
- WURCS handling for universal input now encompass more monosaccharides
- GlycoCT handling for universal input now is robust to the declaration of substituents not immediately following their monosaccharide in the GlycoCT string
- Added `equal_repeats` to check whether two repeating units of a polysaccharide are the same, just shifted
- Modified glycan nomenclature detection in `canonicalize_iupac` to be less prone of overidentifying Oxford when it’s just numbers etc.
- Added “ß” to the typo detection in `canonicalize_iupac` and “(-)” as a variation of linkage uncertainty detection
- Made `canonicalize_iupac` robust to the variation of using {} instead of () for linkages
graph
- Removed the required usage of lib in `glycan_to_nxGraph`, `compare_glycans`, `subgraph_isomorphism`, and all downstream functions (lib only remains for stemification and deep learning model training/inference)
- The keyword argument “wildcards_ptm” now also works as intended when providing pre-calculated graphs as input to `compare_glycans` or `subgraph_isomorphism`
- Fixed a rare issue in which `subgraph_isomorphism`, when “count = False”, would sometimes erroneously output “False” because of a greedy approach to evaluating potential matches
tokenization
- Added `get_unique_topologies` to retrieve all base topologies for a given composition that have been observed for a given taxonomic subset
- Added the “obfuscate_ptm” keyword argument to `map_to_basic`, to allow for mapping Gal6S to Hex6S rather than the default HexOS, if that is required/advantageous
- Support mapping of phosphorylated glycans in `map_to_basic`
draw
- Fixed an issue where cross-ring fragments were not correctly rendered in `GlycoDraw`
- `plot_glycans_excel` can now also be used with filepaths to .xlsx files (in addition to .csv files)
- `plot_glycans_excel` now also supports compact glycan drawing with the “compact” keyword argument
- Improved drawing resolution in `plot_glycans_excel`
- `GlycoDraw` will now more strongly make use of nomenclature canonicalization in case of IUPAC dialects (still not 100%, if you suspect you use a dialect of IUPAC, pass your sequences through `canonicalize_iupac` first)
- If no filepath is specified, `GlycoDraw` will now also display drawn glycan structures in a non-Jupyter environment (as the classic matplotlib pop-up). Note that this functionality requires the cairosvg dependency (head to https://bojarlab.github.io/glycowork/examples.html#glycodraw-code-snippets if you’re unsure about that)
analysis
- Functions able to use .csv paths as input can now also deal with .xlsx paths as input
- The new “annotate_volcano” keyword argument now allows for the direct insertion of SNFG images within plots from `get_volcano` without having to subsequently run `draw.annotate_figure`
- `get_pvals_motifs`, `get_differential_expression`, `get_glycanova`, `get_time_series`, and `get_jtk` now use `glycan_data.stats.replace_outliers_with_IQR_bounds` to auto-smooth outliers
- Moved `hotellings_t2` to `glycan_data.stats`
- All functions compatible with motif-level analysis now accept the “custom_motifs” keyword argument to be passed to `annotate_dataset` or `quantify_motifs` if “custom” is included in “feature_set”
- Changed the “mode” keyword argument in `get_heatmap` to “motifs” as a Boolean argument, like in all other `motif.analysis` functions
- Added a call to `clean_up_heatmap` to `get_jtk` to avoid redundant motifs
- Added `get_biodiversity` to compare two groups of glycomics datasets with regard to the sequence diversity that is present (similar to comparable analyses for microbiome data)
regex
- Added `filter_dealbreakers` to allow for the exclusion of identified matches if they have illegal components beyond the identified match (e.g., the forbidden Fuc in "Fuc-([Gal|GalNAc])?-Gal-([!Fuc]){,1}-GlcNAc"). Before this, the sequence context *except* the Fuc was extracted and returned.
- Fixed an edge case in `filter_matches_by_location` in which internal locations sometimes had to handle triple-nested lists which led to errors
- `get_match` can now also use glycan graphs, such as derived from `glycan_to_nxGraph`, as input
- Added `get_match_batch` to process a whole list of glycans at once, with some performance improvements via first pre-compiling the pattern
- Fixed an edge case in `get_match` in which pattern components consisting of a single monosaccharide with a specified linkage (e.g., “Fuca3”) could sometimes erroneously output no matches
- Added `motif_to_regex` to convert glycan motifs (e.g., in IUPAC-condensed) into a regular expression suitable for `get_match`. Limited to simple queries for now.
annotate
- `get_terminal_structures` now has a “size” keyword argument with which users can control the size of the extracted terminal motifs
- `get_k_saccharides` now has a “terminal” keyword argument with which users can filter to only count motifs at non-reducing ends
- `annotate_dataset` and functions using it now can add the “terminal2” and “terminal3” option in “feature_set” to also annotate & analyze terminal motifs of size 2 (e.g., Neu5Ac(a2-3)Gal(b1-4)) or size 3 (e.g., Neu5Ac(a2-3)Gal(b1-4)GlcNAc)
network
biosynthesis
- Added the possibility of providing abundances to `construct_network` that are then stored as node attributes in the network
- Added `add_high_man_removal` as a post-processing step in `construct_network` to allow for the addition of reactions removing mannoses from high-Man N-glycans occurring during maturation
- Added `estimate_weights` and `get_edge_weight_by_abundance` to estimate reaction capacities from abundances + estimate missing abundances
- Added `get_maximum_flow`, `get_max_flow_path`, and `get_reaction_flow` to calculate maximum flow paths between network root and endpoints as well as aggregate the flow by reaction type
- Added `get_differential_biosynthesis` as a wrapper function to compare two groups of glycomes/networks with regard to their biosynthesis (differential flow paths or differential reaction flows)
- Fixed an issue in `construct_network` in which sometimes nodes with outgoing but no incoming connections were not detected as unconnected nodes, leading to incomplete networks
- Added the `rescue_glycans` decorator to `construct_network`, to allow for auto-fixing nomenclature variations
- Improved performance of `construct_network` by reducing wasteful computation
evolution
- Switched `get_communities` from using `python-louvain` to the Louvain implementation in `networkx`

1.1.0

Change Log

glycan_data
- Updated sugarbase database and all models
stats
- Newly added module to glycowork
- Moved all the statistics functions from `motif.processing` into this module: `cohen_d`, `mahalanobis_distance`, `mahalanobis_variance`, `variance_stabilization`, `MissForest`, `impute_and_normalize`, and `variance_based_filtering`
- Added `fast_two_sum`, `two_sum`, `expansion_sum`, `hlm`, `update_cf_for_m_n`, `jtkdist`, `jtkinit`, `jtkstat`, and `jtkx` helper functions for JTK test
- Added `get_BF` to calculate Jeffreys' approximate Bayes factor based on sample size and p-value
- Added `get_alphaN` to calculate sample size-appropriate significance cut-offs informed by Bayesian statistics
- Added `pi0_tst` and `TST_grouped_benjamini_hochberg` to perform a Two-Stage adaptive Benjamini-Hochberg procedure based on groups (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3175141/ or https://www.biorxiv.org/content/10.1101/2024.01.13.575531v1)
- Added `test_inter_vs_intra_group` to estimate intra- versus inter-group correlation with a mixed-effects model for groupings of glycans based on domain expertise
motif
regex
- Newly added module to glycowork
- Added the `get_match` function and associated functions to implement a regular expression system for glycans. This allows for powerful queries to detect and extract motifs of arbitrary complexity.
processing
- Moved `cohen_d`, `mahalanobis_distance`, `mahalanobis_variance`, `variance_stabilization`, `MissForest`, `impute_and_normalize`, and `variance_based_filtering` into `glycan_data.stats` to re-focus `processing` on processing glycan sequences
- Extended `canonicalize_composition` to cases like ‘5_4_2_1’, ‘5421’, and ‘(Hex)2 (HexNAc)2 (Deoxyhexose)1 (NeuAc)2 + (Man)3(GlcNAc)2’
- GlycoCT and WURCS handling for universal input now encompass more monosaccharides and more modifications
- Expanded `oxford_to_iupac` to handle more complex sequences, including sulfation, LacdiNAc, hybrid structures, extended Neu5Ac, complex fucosylation, more custom linkage specifications
- `enforce_class` can now deal with free glycans regardless of whether they end in ‘-ol’ or not
annotate
- `annotate_dataset` and downstream functions now accept a new keyword in “feature_set”, called “custom”. If “custom” is added to “feature_set”, a list of custom motifs can and must be added via the “custom_motifs” keyword argument. “custom” can be mixed and matched with all other keywords in “feature_set”
- `annotate_dataset` now also accepts glyco-regular expressions via the “custom” keyword in “feature_set”. These expressions need to be added within the “custom_motifs” keyword argument and have to start with an “r”, such as "rHex-HexNAc-([Hex|Fuc]){1,2}-HexNAc". Normal motifs and glyco-regular expressions can be freely mixed within “custom_motifs”
- Added `group_glycans_core`, `group_glycans_sia_fuc`, and `group_glycans_N_glycan_type` to group glycans by core structure (for O-glycans), Sia/Fuc/FucSia/Rest, or complex/hybrid/high-man/rest (for N-glycans)
- Fixed a bug in `get_k_saccharides`, in which redundant columns were not always correctly removed
analysis
- Added `get_jtk` to analyze circadian expression of glycans in temporal glycomics datasets using the Jonckheere–Terpstra–Kendall (JTK) algorithm, with the typical interface for motifs and imputation etc analogous to differential expression.
- `get_differential_expression`, `get_glycanova`, and `get_jtk` now use `get_alphaN` to calculate a sample size-appropriate significance cut-off (see https://journals.sagepub.com/doi/10.1177/14761270231214429) and add a ‘significant’ column to the output to display whether the corrected p-values lie below this threshold
- Added the “zscores” keyword argument to `get_pvals_motifs` to perform z-score transformation if used data are not yet z-score transformed, by setting “zscores” to False
- For statistical calculations, `get_pval_motifs` will now weigh the motif occurrences by z-score magnitude, rather than only using a cut-off for enrichment calculations
- Added effect size calculations to `get_pval_motifs` which are also in the output, as Cohen’s d
- Changed `get_pval_motifs` such that now both enrichments and depletions will be tested (with depletions resulting in negative effect sizes)
- Added `select_grouping` to find out which grouping of glycans has the highest intra- versus inter-group correlation, as estimated by `glycan_data.stats.test_inter_vs_intra_group`
- When “motifs = False” and “grouped_BH = True”, `get_differential_expression` now tries to use the Two-Stage adaptive Benjamini-Hochberg procedure based on groups for multiple testing correction, if meaningful groups can be found in the glycans [note this makes everything at least one order of magnitude slower, though most datasets should still finish in a few seconds]
draw
- In `GlycoDraw`, the “highlight_motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)
- Added `plot_glycans_excel` to allow for the automated insertion of `GlycoDraw` SNFG pictures into an Excel file containing glycan sequences
graph
- `categorical_node_match_wildcard` now uses string ID for matching, instead of integer ID, which means even two graphs, generated with two different libs, can now be successfully compared via `compare_glycans` or `subgraph_isomorphism`
- `compare_glycans` or `subgraph_isomorphism` (and all functions using these functions) now support negation, by prepending “!”. For instance, “!Fuc(a1-?)Gal(b1-4)GlcNAc” will match subsequences that have a monosaccharide that is NOT Fuc before the Gal. It is *highly* recommend to generate your own lib via `get_lib` if you use negation, as monosaccharides such as !Fuc are *not* within lib and will cause indexing errors.
- Added “?1-?” as another ultimate wildcard (promoting it from a strong narrow wildcard)
- Fixed some cases where “Monosaccharide” was not treated as an ultimate wildcard in graph operations
- Fixed an issue in `graph_to_string` in which glycans of size 1 (e.g., “GalNAc”) sometimes were missing their first character
network
- Updated pre-calculated biosynthetic networks for milk oligosaccharides
biosynthesis
- Refactored `find_diff` to make networks compatible with the automated, dynamic wildcards (i.e., ? behave as they should and don’t necessarily cause over-branching of the network)
- In `highlight_network`, the “motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)
ml
model_training
- In `training_setup`, upgraded the loss functions for all classification problems to PolyLoss with label smoothing (see https://arxiv.org/abs/2204.12511 for details).
- In `training_setup`, number of classes (for multiclass or multilabel classification) can now be specified via the new “num_classes” keyword argument

1.0.1

Change Log
motif
processing
- Slightly extended WURCS parsing in `wurcs_to_iupac`
- Fixed an issue in `choose_correct_isoform` in which errors would be caused if the input list contained only duplicate glycans
- Fixed an issue in `choose_correct_isoform` in which errors would be caused if the input list contained only glycans without branching
draw
- Adapted cairosvg imports so that, even without cairosvg dependencies, users can plot glycans inline and export as .svg files (only export as .pdf and export of `annotate_figure` is still restricted to cairosvg)
network
biosynthesis
- Fixed handling of empty outputs of `choose_correct_isoform` in `construct_network`
evolution
- Fixed dictionary handling in `get_communities`

1.0.0

Change Log
- Added a Zenodo badge, to have a release-specific doi for glycowork
glycan_data
- Updated sugarbase database; sugarbase is now pickled, so literal evaluations are necessary
- Harmonized glycan column names across generated dataframes; all use ‘glycan’ now, ‘target’ has been deprecated
loader
- Updated `motif_list` to be compatible with new position encoding
- Added Internal_LewisX and Internal_LewisA to `motif_list` (renamed LewisX and LewisA to Terminal_LewisX and Terminal_LewisA, correspondingly)
- Made `df_species` static again to speed up package import
- Added `find_nth_reverse` helper function that finds the starting index of the nth occurrence of a substring from the end of the string
- Added `remove_unmatched_brackets` helper function to strip unmatched opening or closing brackets from glycan strings
motif
- Added more masses to mz_to_composition.csv / `mass_dict`: Acetonitrile, Formate, Cl-, HCO3-, and NH4+
processing
- Extended `canonicalize_iupac` to cases like "NeuGcα3Galβ3(NeuAcα6)GalNAcol" and even more modification formulations, e.g., “6S-GlcNAc”
- Added `canonicalize_composition` to convert compositions formatted either in the style of HexNAc2Hex1Fuc3Neu5Ac1 or N2H1F3A1 into dictionaries used by glycowork
- Added GalNAc4S to permitted reducing end monosaccharides for O-linked glycans in `enforce_class`
- `MissForest` now has a maximum number of iterations and will check for convergence each iteration (immediately finishing upon converging), yielding some speed-ups in most cases
- The output of `min_process_glycans` no longer contains empty strings for glycans ending in a linkage
- Updated `choose_correct_isoform` to be compatible with change in `min_process_glycans`
- Added `get_possible_linkages` to retrieve linkages matching a wildcarded linkage
- Added `get_possible_monosaccharides` to retrieve monosaccharides matching a monosaccharide type (HexNAc, etc.)
- Added decorators, `rescue_glycans` and `rescue_compositions`, to canonicalize them in case a decorated function errors out
- Added `linearcode_to_iupac` to support LinearCode as input format for glycowork (this will be called within `canonicalize_iupac` and the decorators); note that for now coverage may not be perfect yet
- Added `iupac_extended_to_condensed` to support IUPAC-extended as input format for glycowork (this will be called within `canonicalize_iupac` and the decorators); note that for now coverage may not be perfect yet
- Added `glycoct_to_iupac` to support GlycoCT as input format for glycowork (this will be called within `canonicalize_iupac` and the decorators); note that for now coverage may not be perfect yet
- Added `wurcs_to_iupac` to support WURCS as input format for glycowork (this will be called within `canonicalize_iupac` and the decorators); note that for now coverage may not be perfect yet
- Added `oxford_to_iupac` to support Oxford as input format for glycowork (this will be called within `canonicalize_iupac` and the decorators); note that for now coverage is limited
- `check_nomenclature` (formerly in `motif.tokenization`) now handles outputting warning messages for trying to use non-string, non-graph nomenclatures or SMILES with glycowork functions
- Expanded `find_isomorphs` to generate more isomorphic sequence variants and thereby increasing the chances that `choose_correct_isoform` will have access to the canonical sequence
- Fixed a rare issue with `canonicalize_iupac` where sequences coming from `structure_to_basic` would sometimes be formatted incorrectly if they contained dHex
- Fixed an issue in `find_isomorphs` in which double branches were not always correctly swapped
analysis
- `get_heatmap` now no longer tries to convert data to relative abundances if negative values are detected in the input
- All functions using dataframes as inputs in `analysis` can now also be used by providing full filepaths to the .csv file instead
- Optimized some of the code for readability and speed (everything should be at least a bit faster now)
annotate
- `get_k_saccharides` is now allowed to generate new dynamic motifs with tokens outside of lib (via `expand_lib`)
- `annotate_glycan` and `annotate_dataset` now also support narrow wildcards
- Fixed an issue in `count_unique_subgraphs_of_size_k` in which branched motifs were not always correctly formatted (i.e., opening/closing brackets)
- `get_k_saccharides` now outputs dataframes with counts as default and can yield the old nested lists of motifs by setting the new keyword `just_motifs` to True
- Fixed an edge case in which `get_k_saccharides` sometimes overcounted individual monosaccharides if their strings overlapped
graph
- `subgraph_isomorphism` and `compare_glycans` now support using wildcards and position encoding at the same time. The `extra` keyword argument is now deprecated and the functions auto-detect whether anything has been specified in wildcards and/or termini_list
- `subgraph_isomorphism` and `compare_glycans` now support automatically inferred narrow wildcards to allow for (i) matching linkages like a1-? to only specified linkages within that group (e.g., a1-3 but not b1-3 etc.) and (ii) matching monosaccharide types like HexNAc to only specified monosaccharides of that type (e.g., GlcNAc but not Glc, etc.)
- The `wildcard_list` keyword argument in all graph & annotation functions is now deprecated as wildcards are inferred automatically via narrow wildcards and native full wildcards (?1-? and Monosaccharide)
- `subgraph_isomorphism` now behaves as expected for testing motifs ending in linkages on glycans ending in linkages
- `subgraph_isomorphism` can now return the matched subgraphs in the input glycan with the new `return_matches` keyword argument
- `glycan_to_nxGraph` is now decorated with the `rescue_glycans` decorator, which auto-canonicalizes IUPAC strings if they are not in the format preferred by glycowork
- Fixed mismatch of labels and string_labels in `categorical_node_match_wildcard`
- Fixed an issue in `subgraph_isomorphism` in which, when using positional encoding, sometimes the mirror image of a motif was incorrectly captured if the termini aligned
- `termini_list` within `subgraph_isomorphism` now only requires the specification of monosaccharide positions
- Added `expand_termini_list` helper function to facilitate the expansion of monosaccharide-only `termini_list` into full `termini_list` behind the scenes
- Added support for shorthand notation of position encoding, now either ‘terminal’ or ‘t’ will work
- Improved handling of complex branching in `graph_to_string`; should be fewer unexpected translations now
- Fixed an issue in `graph_to_string` in which induced subgraphs could cause errors due to unexpected or weirdly sorted node indices
- Fixed an edge case in which the reducing end could be sometimes calculated as ‘internal’ when termini=’calc’ in `glycan_to_nxGraph`
- Deprecated a duplicate `character_to_label` and `string_to_labels`
- Deprecated `categorical_termini_match`; the functionality is now handled within `categorical_node_match_wildcard`
- Deprecated the `wildcards` keyword argument from `compare_glycans` as this will now be detected internally, if wildcards are provided via `wildcard_list`
tokenization
- Composition functions (e.g., `composition_to_mass`) are now decorated with `rescue_compositions`, which means that they can be used with compositions like “H3N2” (basically anything that `canonicalize_composition` can handle)
- Deprecated `character_to_label` as it’s now handled within `string_to_labels`
- Moved `check_nomenclature` into motif.processing
- Optimized some of the code for readability and speed (most things should be at least a bit faster now)
draw
- Support motif highlighting in `GlycoDraw`: by providing the `highlight_motif` keyword argument, motifs can be highlighted (everything else will be set to low opacity). Works with IUPAC-condensed motifs and named motifs from `known`
- Support wildcards in motif highlighting with the `highlight_wildcard_list` keyword argument, for instance highlighting all `Gal(?1-?)GlcNAc` subunits (for Gal(b1-?)GlcNAc you don’t need `highlight_wildcard_list`, as narrow wildcards are handled automatically)
- Support positional encoding in motif highlighting with the `highlight_termini_list` keyword argument, for instance highlighting all terminal, non-reducing end `Gal(b1-?)GlcNAc` subunits (yes, you can use both wildcards and positional encoding at the same time😊)
- Support drawing of repeat structures (indicated by brackets and the number of repeats) via the new `repeat` keyword argument. Internal repeats can also be specified with the additional `repeat_range` keyword argument.
- Optimized some of the code for readability and speed (most things should be at least a bit faster now)

network
biosynthesis
- Optimized some of the code for readability and speed (everything should be up to 2x faster now)
evolution
- Optimized some of the code for readability and speed (everything should be at least a bit faster now)

ml
- Optimized some of the code for readability and speed (most things should be at least a bit faster now)



v0.8.1-zenodo
Literally no code changes at this point (0.9 is expected to come in December) but Zenodo requires a new release to mint a doi

Page 1 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.