Features
Multiple Input Molecules
[[PR](https://github.com/chemprop/chemprop/pull/76)] Use multiple molecules as an input to chemprop. The number of molecules is specified with the keyword `number_of_molecules`. Those molecules are embedded with a separate D-MPNN by default. The latent representations are concatenated prior to the FFN.
The keyword `mpn_shared` allows you to use a shared D-MPNN. Note that, since the latent representations are concatenated, the order of the input molecules is important. This method is not invariant and there are better ways to use multiple molecules with shared D-MPNN, which will be implemented for the next release.
Custom Atom Features
[[PR](https://github.com/chemprop/chemprop/pull/82)] Implemented custom atomic features as a counterpart of the custom molecular features in ChemProp. The new feature allows users to provide additional atomic features to each node in a given molecule. To use the feature, use the keyword `atom_descriptors`. The custom atom features can be employed in two modes. In the first mode, `--atom_descriptors feature`, custom features are used as normal node features, which are concatenated to the default node vector before the D-MPNN block. In the second mode, `--atom_descriptors descriptor`, custom atom features will not participate in the model until the atom feature vector has been updated through D-MPNN block. That is, the `--atom_descriptors descriptor` model will not disturb the extra custom atom features much and keep the information to the maximum extent.
The extra custom descriptors can be put into ChemProp through a variety of pickle files (`.pkl`, `.pickle`, `.pckl`), Numpy save file (`.npz`), or a `.sdf` file.
`.pkl` format
The `.pkl` file must store a Pandas DataFrame with smiles as index and columns as descriptors. All descriptors must be a 1D numpy array or 2D numpy array. For example:
1 custom atomic feature for each atom provided in a 1D array
smiles descriptors
CCOc1ccc2nc(S(N)(=O)=O)sc2c1 [0.637781931055927, 0.7075571757878132, 0.7339...
CCN1C(=O)NC(c2ccccc2)C1=O [0.09588231301387817, 0.6521911050735447, 0.45...
Multiple atomic features for each atom provided in multiple 1D array
smiles desc1 desc2
CCOc1ccc2nc(S(N)(=O)=O)sc2c1 [0.637781931055927, 0.7075571757878132... [0.8266363223032338, 0.89641156703512 ...
CCN1C(=O)NC(c2ccccc2)C1=O [0.09588231301387817, 0.6521911050735447... [0.2847367042611851, 0.8410454963208516...
Note: mixed 1D array and 2D array for different columns are not allowed
`.npz` file
Atomic descriptors for each molecule must be saved as one independent 2D numpy array ([number of atoms x number of descriptors]) in the `.npz` file for example by:
python
np.savez('descriptors.npz', *descriptors)
where `descriptors` is a list of atomic descriptors in 2D array in the order of molecules in the training/predicting datafile
`.sdf` file
Each molecule is presented as a mol block in the `.sdf` file. Descriptors should be saved as entries for each mol block in the format of comma separated values. Each molecule must has an entry named SMILES that stores the smiles string. For example:
CHEMBL1308_loner5
RDKit 3D
6 6 0 0 1 0 0 0 0 0999 V2000
-0.7579 -0.5337 -2.8744 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.2229 -1.3763 -1.7558 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0046 -1.0089 -0.4029 C 0 0 0 0 0 0 0 0 0 0 0 0
0.4824 -2.0104 0.3280 N 0 0 0 0 0 0 0 0 0 0 0 0
0.5806 -3.0317 -0.5484 N 0 0 0 0 0 0 0 0 0 0 0 0
0.1735 -2.6999 -1.8031 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 6 2 0
2 3 1 0
3 4 2 0
4 5 1 0
5 6 1 0
M END
> <desc1> (1)
-8.568031e-05,0.0001865207,-0.0002012379,-5.054658e-05,0.0002148434,-0.0003503839,1.970448e-05,3.081137e-05,2.997883e-05,9.446278e-05,-7.194711e-05,0.0001527364
> <desc2> (1)
5.462954e-05,-2.415399e-06,0.0001044788,-2.274438e-05,0.0001698836,5.206409e-06,4.5825e-06,-8.882181e-07,-1.08787e-05,2.993307e-05,-4.069051e-06,1.338413e-05
> <SMILES> (1)
Cc1cnnHc1
$$$$
where the name of descriptor entries `desc1`, `desc2` can be arbitrary.
When using this feature, users are responsible for all atomic feature preprocessing works, including feature normalization and expansion.
Note: This feature is developed for small-to-medium sized training dataset, where extra QM descriptors have been demonstrated to be powerful and slow down the model performance downgrade.
Options for Aggregation Function
[[PR](https://github.com/chemprop/chemprop/pull/87)] By default, at the end of message passing, the D-MPNN aggregates atom hidden representations into a single hidden representation for the whole molecule by taking the mean of the atom representations. Now, this aggregation function can be changed by using `--aggregate <mode>`, which currently supports “mean” (the default), “sum”, and “norm” (which is equivalent to “sum” with normalization by the constant specified by `--aggregation_norm`).
Cross-Validation
[[commit](https://github.com/chemprop/chemprop/commit/8a2ad6ceec67f0ad2ee83e58c15ed743e824f77f)] The default split type (i.e., `--split_type random`) randomly samples data into the train, validation, and test sets on each of the `num_folds` folds independently. This means that the same molecule can end up in the test split on more than one fold. The advantage of this method is that it can be used easily with an arbitrary number of folds, but the downside is that it does not perform strict cross-validation.
The new split type cv (`--split_type cv`) performs true cross-validation. The data is broken down into `num_folds` pieces, each of size `len(data) / num_folds`, and each piece serves as the test split one, the validation split once, and part of the train split on all other folds. The benefit of this method is that it is true cross-validation, but the downside is that the size of each split is dependent on the number of folds, meaning less flexibility (e.g., `--num_folds 3` will result in train, validation, and test splits each with 33.3% of the data, which is perhaps too small for the train split and too large for the test split). `--num_folds 10` is recommended.
Saving Test Predictions
[[commit](https://github.com/chemprop/chemprop/commit/8d5d0c61833a658c7e329bcb72e599334820146c)] The `--save_preds` option will save predictions on the test split of each fold in a file called “test_preds.csv” in the `save_dir`.
Multiple Metrics
[[commit](https://github.com/chemprop/chemprop/commit/46b9f642fa38e9962b310bfadf1422a8e30c3457)] The `--metric` argument still works as before and this is still the metric that is used for early stopping (i.e., selecting the model which performs best on the validation split), but now there is an additional `--extra_metrics` argument where additional metrics can be specified and will be recorded. The metrics should be space separated (e.g., `--extra_metrics mae rmse r2`).
Saving Test Scores
[[commit](https://github.com/chemprop/chemprop/commit/a225ef0328f49c610b8480b8fa4af1acad42898f#diff-728bf686eee83c7034ef4a09fd7fd790b856e03af17adbf22c40e5366da58e16)] Scores on the test splits are now saved to file in the `save_dir` under the name “test_scores.csv”.
Fixes and Improvements
Undefined Rows
[[commit](https://github.com/chemprop/chemprop/commit/c5d354502084c53e34ca10ef964f407e8a7b2323)] Rows in the input data file with target values that are all undefined are now correctly skipped. This is especially relevant when the row may contain some defined target values, but none of those targets are included in `target_columns`.
Data Loading
[[commit](https://github.com/chemprop/chemprop/commit/befaabc9642d0979d1dd5cef2fe6cc7a4150fb81)] Data is now only loaded once to decrease training time.
Tests
[[tests](https://github.com/chemprop/chemprop/blob/master/tests/test_integration.py)] Added more comprehensive tests to ensure correct functionality.
Train Loss
[[commit](https://github.com/chemprop/chemprop/commit/e9a28dea80bc6d255e380dfd10b14549d8c687cc)] Fixed incorrect averaging of the train loss, which affects the train loss that is printed to screen and saved in tensorboard.