Features
Uncertainty Tools
Tools added for uncertainty quantification, calibration, and evaluation as part of the chemprop predict function. Uncertainty predictions are saved as part of the predictions file. Uncertainty functions and outputs are triggered using the arguments `--uncertainty_method {method}`.
Uncertainty outputs can be calibrated using an outside dataset (evaluation set from training is often suitable) in order to have better uncertainty estimates on new predictions. Can be activated using `--calibration_method {method}` and `--calibration_path {path-to-csv}`. For the regression dataset type, a calibrated output can provide either a standard deviation or one-sided interval bound, as set with the options `--regression_calibrator_metric {stdev-or-interval}` and `--calibration_interval_percentile {int}`.
If the data file containing smiles for the test path also contains target values, the uncertainty performance can be evaluated using various metrics, activated with the option `--evaluation_methods {list-of-methods}`.
Internally, this PR creates several classes for carrying out prediction tasks: UncertaintyEstimator, UncertaintyPredictor, UncertaintyCalibrator, UncertaintyEvaluator. Loss functions have been added that have auxiliary uncertainty outputs, `mve` and `evidential` for regression.
PR 267
PR 269
Reaction-Solvent Option
Gives the option to train a chemprop model using one reaction and one molecule for each datapoint. Active when used with the option `--reaction_solvent`. Options for making the solvent mpnn use different parameters than that for the reaction are possible using `--bias_solvent`, `--hidden_size_solvent {int}`, and `--depth_solvent {int}`.
PR 246
Multimolecule Fingerprinting
Added some new changes for fingerprint functions with multiple molecules. Models trained with a "shared-mpn" between two molecules can return a MPN fingerprint with only one molecule provided. Also, when multiple molecule models are used for MPN fingerprint generation, the output will indicate which molecule each element belongs to.
PR 242
Issue 236
Colab Notebook Examples
Created a Jupyter notebook that runs examples of Chemprop jobs, specifically as the functions can be used in python. Good resource for new users, demonstrations, or tutorials. Linked to Google Colab so that it can be run remotely, not requiring any local install of Chemprop.
PR 239
PR 273
Loss Function Options
Previously, loss functions were selected automatically based on the dataset type being used in model training. Now the loss function can be selected with `--loss_function {function}`. Some new specialty loss functions have been added with this capability.
* Matthews Correlation Coefficient (`mcc`) is a loss function for classification and multiclass that considers True Positives, True Negatives, False Positives, False Negatives separately in the loss function, avoiding domination by one class and making it well suited to unbalanced training sets.
* Bounded Mean Squared Error (`bounded_mse`) is a regression loss function that allows for training targets expressed as inequalities, e.g. ">5.0". Intended for use with experimental data with delimited ranges.
* Mean Variance Estimation (`mve`) and `evidential` loss are regression loss functions that maximize the likelihood of the target on an estimated uncertainty distribution. When used as loss functions, the outputs of these functions can be used in uncertainty estimation.
Appropriate metrics have been added along with these loss functions.
PR 238
PR 267
Development Environment
GitHub Addons
Added a `CONTRIBUTING.md` file with guidelines for how users can contribute to Chemprop. New templates are now available for issue submission that distinguish between different issue types: bug report, feature request, and questions. New templates also suggested for PRs. Templates stored in the `.github` directory.
PR 241
Unit Testing
Part of an ongoing effort to include a more complete set of automated tests for Chemprop. Unit tests added for data utils, uncertainty-related loss functions, and the uncertainty evaluation metrics.
PR 232
PR 267
PR 269
Flake8 Formatting
Ongoing effort to standardize the formatting of incoming code. New PRs now request/require the new code to be flake8 compliant in formatting. The utils module and files significantly associated with the new uncertainty function are flake8 compliant.
PR 241
PR 258
PR 267
Update Versioning
Changed the way that version numbers are stored and updated throughout the code.
PR 247
Remove Assertion Errors
Removed many of the assertion errors throughout Chemprop and replaced them with more easily interpretable error types and messages.
PR 257
Bug Fixes
Hyperopt Version Fix
Changed the way that random seeds are passed into hyperopt during hyperparameter optimization to avoid an error where hyperopt stopped supporting a previously supported way of passing numpy seeds.
PR 245
Issue 243
Issue 254
Issue 264