Chemprop

Latest version: v2.1.1

Safety actively analyzes 701442 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 5

1.3.1

Features
Resume training on multiple folds if interrupted
As training progresses through folds of a multiple fold model, the results of each individual fold are stored in a JSON file. If training is interrupted, the completed fold results will be read from the JSON file and resume on the first uncompleted fold if using the flag `--resume_experiment`.
PR 164

Frozen layers for pre-training
Added functionality to freeze the MPN or FFN layers in a model being trained at the values of a previously trained model. Freezes MPN values using a model indicated with `--checkpoint_frzn <path>`. FFN layers will also be frozen if indicated with `--frzn_ffn_layers <number-of-layers>`. Models with multiple molecules can select to only freeze the first molecule MPN using `--freeze_first_only`.
PR 170

tSNE functionality
Added HDBScan clustering to the tSNE script.
PR 172

Weighted training by target and by datapoint
Added training weights for different targets and different datapoints, with normalization of weight values. Target weights indicated with the argument `--target_weights <list-of-values>`. Data weights supplied through an input file indicated with the argument `--data_weights_path <path>`.
PR 173, 175, 189
Issue 145

Bug Fixes
MPNN input
Providing SMILES or RDKit molecules to the `MPN`'s `forward` function failed (only `BatchMolGraph` worked) following other changes. Now, SMILES and RDKit molecules can once again be used as input.
PR 164

Backwards compatibility with old checkpoints
Backwards compatibility for features scaling
PR 164
Issue 108

Updated readme
Added information to the readme and documentation of pre-training, treatment of missing values in multitask models and caching.
PR 165
Issue 156

Multiclass classification
Corrected error when using the metric `accuracy` with multiclass classification.
PR 169

RDKit Compatibility
Bugfix for compatibility issues of RDKit 2021.03.01 with the interpretation script.
PR 182
Issue 178

1.3.0

New Features

Custom atom/bond features

Enabled custom input of atom and bond features either in addition or instead of the default features.

PR: https://github.com/chemprop/chemprop/pull/137

Epistemic uncertainty

Introduced the argument `--ensemble_variance` which calculates the epistemic uncertainty of predictions via an ensemble of models.

PR: https://github.com/chemprop/chemprop/pull/140

Reaction option

Introduced CGR option - input of atom-mapped reaction smiles instead of molecules. This creates a pseudo-molecule of the graph transition state between reactants and products, and performs message passing on this pseudo-molecule

PR: https://github.com/chemprop/chemprop/pull/152

Latent representation

Added a new functionality that saves the latent representation of a molecule (the MPNN output), which can be used similar to predicting with a given checkpoint file, and saves the MPNN output to file.

PR: https://github.com/chemprop/chemprop/pull/119

Preprocessing updates

Updates to the preprocessing, handling and saving of smiles strings. Removed redundant checks.

PR: https://github.com/chemprop/chemprop/pull/135

Resume experiments

Experiments with multiple folds can now be resumed using the `--resume_experiment` flag. Additionally, the test results of each fold are saved as a JSON file in the corresponding subfolder in `save_dir`.

PR: https://github.com/chemprop/chemprop/pull/164

Bug Fixes

Atom messages

Major bugfix for running Chemprop with the argument `--atom_messages`, where the wrong features were passed to the MPNN. This improves the performance of Chemprop in `atom_messages` mode, and causes backwards incompatibility with old checkpoint files if created in `atom_messages` mode. Since Chemprop is mainly used for directed message passing via bond messages, we hope not many users are affected.

Issue: https://github.com/chemprop/chemprop/issues/133
PR: https://github.com/chemprop/chemprop/pull/138

Backwards compatibility with old checkpoints

Backwards compatibility for correctly setting recently introduced training arguments for old models.

Issue: https://github.com/chemprop/chemprop/issues/148 and https://github.com/chemprop/chemprop/issues/108
PR: https://github.com/chemprop/chemprop/pull/149 and PR: https://github.com/chemprop/chemprop/pull/164

Sklearn scores

Bugfix in training sklearn models: Scores were not saved correctly previously.

PR: https://github.com/chemprop/chemprop/pull/162

Data split script

Bugfix in a standalone script to create data splits: Multi-molecule input had previously created incompatibilities with passing data to the scaffold split functionality. Update of docstring.

Issue: https://github.com/chemprop/chemprop/issues/158
PR: https://github.com/chemprop/chemprop/pull/159

MPNN sanity check

Bugfix for sanity checks for dimensions of batches within the MPNN forward pass: The introduction of multi-molecule input had previously caused an inconsistency in one of the checks.

Issue: https://github.com/chemprop/chemprop/issues/153
PR: https://github.com/chemprop/chemprop/pull/154

MPNN type annotations

Bugfix for type annotation in the MPNN forward pass + update of docstring.

PR: https://github.com/chemprop/chemprop/pull/151 and PR: https://github.com/chemprop/chemprop/pull/164

Tanimoto distance

Bugfix for calculating Tanimoto distances. The introduction of multi-molecule input had previously caused incompatibilities in the standalone script to find similar molecules in the training data.

Issue: https://github.com/chemprop/chemprop/issues/143
PR: https://github.com/chemprop/chemprop/pull/144

README typos

Fixed typos for a few arguments in the README

PR: https://github.com/chemprop/chemprop/pull/139

Sanitize script

Bugfix in standalone script sanitize.py - open output file with write access.

RDKit molecule caching

Bugfix for creating RDKit molecules from smiles strings. Previously the molecules were recreated even though they were already cached.

PR: https://github.com/chemprop/chemprop/pull/152

Saving SMILES

Bugfix for error occurring when `--save_smiles_splits` is used in conjunction with `--separate_test_path`. Now, the data split csv files are still generated, but `split_indices.pkl` is not generated if there are multiple data points with the same SMILES or if some of the data comes from a separate data file.

Issue: https://github.com/chemprop/chemprop/issues/157
PR: https://github.com/chemprop/chemprop/pull/163

SMILES/mols as input to MPNN

Bugfix for SMILES or RDKit molecules as input to MPNN model instead of `BatchMolGraph`.

PR: https://github.com/chemprop/chemprop/pull/164

1.2.0

Features

New split type

The split type `--split_type cv` already existed to perform `k`-fold cross-validation (where `k` is set by `--num_folds`). In each fold, `1/k` of the data is put in the test set, `1/k` of the data is in put in the validation set, and the remaining `(k-2)/k` of the data is put in the training set.

Now, a new split type `--split_type cv-no-test` exists which is essentially identical except that it assigns no data to the test set on each fold (https://github.com/chemprop/chemprop/commit/b56ca9866b303036eab61cab93188cccbaa24af2). Instead, `1/k` of the data is put in the validation set and `(k-1)/k` of the data is put in the training set with no test data. The purpose of this split type is to maximize the training data when training a model in cases where the test performance is already known (or is not important) and doesn't need to be determined. Note that the validation set is still necessary to perform early stopping.

Dropping extra columns during prediction

Previously, when using `predict.py`, all the columns from the `test_path` file were copied to the `preds_path` file and then the predictions were added as additional columns at the end. Now there is an option called `--drop_extra_columns` which will not copy over these extraneous columns to `preds_path` (https://github.com/chemprop/chemprop/commit/83ea4c06dda4231902777ea6776da922aeba2ad3 and https://github.com/chemprop/chemprop/commit/061339568045863c30c9bd8c2a143b674a0082d8). When `--drop_extra_columns` is used, `preds_path` will only contain columns with the SMILES and with the prediction values.

Bug Fixes

Backward compatibility for `load_checkpoint`

Previously, newer versions of Chemprop incorrectly loaded checkpoints that were trained using older versions of Chemprop due to a change in the names of the parameters. Backward compatibility has now been added to allow this version of Chemprop to load checkpoints with either set of names (https://github.com/chemprop/chemprop/commit/5371b29e7c65e41fa8b83d9c76ba2bfdd400b139 and https://github.com/chemprop/chemprop/commit/206950c6ec92a3646800f95bc69ae6d8dc7ca646).

Saving SMILES splits

Due to new Chemprop features such as the ability to load multiple molecules, the feature `--save_smiles_splits`, which saves the SMILES corresponding to the train, validation, and test splits, had broken (https://github.com/chemprop/chemprop/issues/110). This was fixed in https://github.com/chemprop/chemprop/pull/117.

Fixing `interpret.py`

Similar to the issue with saving SMILES splits, `interpret.py` broke due to the Chemprop feature that enables multiple molecules to be used as input (https://github.com/chemprop/chemprop/issues/107 and https://github.com/chemprop/chemprop/issues/113). This was fixed in https://github.com/chemprop/chemprop/pull/128.

Updating Dockerfile

The Dockerfile has been updated to address https://github.com/chemprop/chemprop/issues/100 and https://github.com/chemprop/chemprop/issues/129. This was fixed in https://github.com/chemprop/chemprop/pull/131.

Fixing atom descriptors

The `atom_descriptors` feature did not work in `predict.py` (https://github.com/chemprop/chemprop/issues/120). This was fixed in https://github.com/chemprop/chemprop/pull/114.

Logging

Logging to the terminal and to files (`quiet.log` and `verbose.log` in the `save_dir`) broke for some OS systems (https://github.com/chemprop/chemprop/issues/106). This was fixed in https://github.com/chemprop/chemprop/pull/118.

README additions

Some of the relatively new features, like custom atomic features, were missing from the README (https://github.com/chemprop/chemprop/issues/121). This was fixed in https://github.com/chemprop/chemprop/pull/122.

Infrastructure Changes

Migrating from Travis CI to GitHub Actions

Chemprop previously used Travis CI to run automated tests upon pushing to master or creating a pull request, but Travis changed its pricing structure and no longer offers unlimited free testing. For this reason, Chemprop now uses GitHub Actions to run automated tests. The results of the test runs can be seen in the Actions tab of the repo.

1.1.0

Features

Multiple Input Molecules

[[PR](https://github.com/chemprop/chemprop/pull/76)] Use multiple molecules as an input to chemprop. The number of molecules is specified with the keyword `number_of_molecules`. Those molecules are embedded with a separate D-MPNN by default. The latent representations are concatenated prior to the FFN.

The keyword `mpn_shared` allows you to use a shared D-MPNN. Note that, since the latent representations are concatenated, the order of the input molecules is important. This method is not invariant and there are better ways to use multiple molecules with shared D-MPNN, which will be implemented for the next release.

Custom Atom Features

[[PR](https://github.com/chemprop/chemprop/pull/82)] Implemented custom atomic features as a counterpart of the custom molecular features in ChemProp. The new feature allows users to provide additional atomic features to each node in a given molecule. To use the feature, use the keyword `atom_descriptors`. The custom atom features can be employed in two modes. In the first mode, `--atom_descriptors feature`, custom features are used as normal node features, which are concatenated to the default node vector before the D-MPNN block. In the second mode, `--atom_descriptors descriptor`, custom atom features will not participate in the model until the atom feature vector has been updated through D-MPNN block. That is, the `--atom_descriptors descriptor` model will not disturb the extra custom atom features much and keep the information to the maximum extent.

The extra custom descriptors can be put into ChemProp through a variety of pickle files (`.pkl`, `.pickle`, `.pckl`), Numpy save file (`.npz`), or a `.sdf` file.

`.pkl` format

The `.pkl` file must store a Pandas DataFrame with smiles as index and columns as descriptors. All descriptors must be a 1D numpy array or 2D numpy array. For example:

1 custom atomic feature for each atom provided in a 1D array

smiles descriptors
CCOc1ccc2nc(S(N)(=O)=O)sc2c1 [0.637781931055927, 0.7075571757878132, 0.7339...
CCN1C(=O)NC(c2ccccc2)C1=O [0.09588231301387817, 0.6521911050735447, 0.45...

Multiple atomic features for each atom provided in multiple 1D array

smiles desc1 desc2
CCOc1ccc2nc(S(N)(=O)=O)sc2c1 [0.637781931055927, 0.7075571757878132... [0.8266363223032338, 0.89641156703512 ...
CCN1C(=O)NC(c2ccccc2)C1=O [0.09588231301387817, 0.6521911050735447... [0.2847367042611851, 0.8410454963208516...

Note: mixed 1D array and 2D array for different columns are not allowed

`.npz` file

Atomic descriptors for each molecule must be saved as one independent 2D numpy array ([number of atoms x number of descriptors]) in the `.npz` file for example by:

python
np.savez('descriptors.npz', *descriptors)

where `descriptors` is a list of atomic descriptors in 2D array in the order of molecules in the training/predicting datafile

`.sdf` file

Each molecule is presented as a mol block in the `.sdf` file. Descriptors should be saved as entries for each mol block in the format of comma separated values. Each molecule must has an entry named SMILES that stores the smiles string. For example:

CHEMBL1308_loner5
RDKit 3D

6 6 0 0 1 0 0 0 0 0999 V2000
-0.7579 -0.5337 -2.8744 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.2229 -1.3763 -1.7558 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0046 -1.0089 -0.4029 C 0 0 0 0 0 0 0 0 0 0 0 0
0.4824 -2.0104 0.3280 N 0 0 0 0 0 0 0 0 0 0 0 0
0.5806 -3.0317 -0.5484 N 0 0 0 0 0 0 0 0 0 0 0 0
0.1735 -2.6999 -1.8031 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 6 2 0
2 3 1 0
3 4 2 0
4 5 1 0
5 6 1 0
M END
> <desc1> (1)
-8.568031e-05,0.0001865207,-0.0002012379,-5.054658e-05,0.0002148434,-0.0003503839,1.970448e-05,3.081137e-05,2.997883e-05,9.446278e-05,-7.194711e-05,0.0001527364

> <desc2> (1)
5.462954e-05,-2.415399e-06,0.0001044788,-2.274438e-05,0.0001698836,5.206409e-06,4.5825e-06,-8.882181e-07,-1.08787e-05,2.993307e-05,-4.069051e-06,1.338413e-05

> <SMILES> (1)
Cc1cnnHc1

$$$$

where the name of descriptor entries `desc1`, `desc2` can be arbitrary.

When using this feature, users are responsible for all atomic feature preprocessing works, including feature normalization and expansion.

Note: This feature is developed for small-to-medium sized training dataset, where extra QM descriptors have been demonstrated to be powerful and slow down the model performance downgrade.

Options for Aggregation Function

[[PR](https://github.com/chemprop/chemprop/pull/87)] By default, at the end of message passing, the D-MPNN aggregates atom hidden representations into a single hidden representation for the whole molecule by taking the mean of the atom representations. Now, this aggregation function can be changed by using `--aggregate <mode>`, which currently supports “mean” (the default), “sum”, and “norm” (which is equivalent to “sum” with normalization by the constant specified by `--aggregation_norm`).

Cross-Validation

[[commit](https://github.com/chemprop/chemprop/commit/8a2ad6ceec67f0ad2ee83e58c15ed743e824f77f)] The default split type (i.e., `--split_type random`) randomly samples data into the train, validation, and test sets on each of the `num_folds` folds independently. This means that the same molecule can end up in the test split on more than one fold. The advantage of this method is that it can be used easily with an arbitrary number of folds, but the downside is that it does not perform strict cross-validation.

The new split type cv (`--split_type cv`) performs true cross-validation. The data is broken down into `num_folds` pieces, each of size `len(data) / num_folds`, and each piece serves as the test split one, the validation split once, and part of the train split on all other folds. The benefit of this method is that it is true cross-validation, but the downside is that the size of each split is dependent on the number of folds, meaning less flexibility (e.g., `--num_folds 3` will result in train, validation, and test splits each with 33.3% of the data, which is perhaps too small for the train split and too large for the test split). `--num_folds 10` is recommended.

Saving Test Predictions

[[commit](https://github.com/chemprop/chemprop/commit/8d5d0c61833a658c7e329bcb72e599334820146c)] The `--save_preds` option will save predictions on the test split of each fold in a file called “test_preds.csv” in the `save_dir`.

Multiple Metrics

[[commit](https://github.com/chemprop/chemprop/commit/46b9f642fa38e9962b310bfadf1422a8e30c3457)] The `--metric` argument still works as before and this is still the metric that is used for early stopping (i.e., selecting the model which performs best on the validation split), but now there is an additional `--extra_metrics` argument where additional metrics can be specified and will be recorded. The metrics should be space separated (e.g., `--extra_metrics mae rmse r2`).

Saving Test Scores
[[commit](https://github.com/chemprop/chemprop/commit/a225ef0328f49c610b8480b8fa4af1acad42898f#diff-728bf686eee83c7034ef4a09fd7fd790b856e03af17adbf22c40e5366da58e16)] Scores on the test splits are now saved to file in the `save_dir` under the name “test_scores.csv”.

Fixes and Improvements

Undefined Rows

[[commit](https://github.com/chemprop/chemprop/commit/c5d354502084c53e34ca10ef964f407e8a7b2323)] Rows in the input data file with target values that are all undefined are now correctly skipped. This is especially relevant when the row may contain some defined target values, but none of those targets are included in `target_columns`.

Data Loading

[[commit](https://github.com/chemprop/chemprop/commit/befaabc9642d0979d1dd5cef2fe6cc7a4150fb81)] Data is now only loaded once to decrease training time.

Tests

[[tests](https://github.com/chemprop/chemprop/blob/master/tests/test_integration.py)] Added more comprehensive tests to ensure correct functionality.

Train Loss

[[commit](https://github.com/chemprop/chemprop/commit/e9a28dea80bc6d255e380dfd10b14549d8c687cc)] Fixed incorrect averaging of the train loss, which affects the train loss that is printed to screen and saved in tensorboard.

1.0.2

Since descriptastorus isn't on PyPi, it can't be installed automatically via `pip install chemprop`. Instead, it must be installed separately via `pip install git+https://github.com/bp-kelley/descriptastorus`.

1.0.1

Fixing an issue with PyPi installation and updating relevant documentation.

Page 4 of 5

Releases

Has known vulnerabilities

Previous Next

Chemprop

Page 4 of 5

1.3.1

1.3.0

1.2.0

1.1.0

1.0.2

1.0.1

Page 4 of 5

Links

Releases