New feature: The `PapyrusDataset` class allows for object-oriented 'pandas-style' querying.
Changes - `reader.read_papyrus`: raises an error when trying to load the Papyrus++ set with stereochemistry, - `preprocess.keep_source`: argument `source` uses regex matching, - `preprocess.keep_organism`: argument `organism` is now case insensitive when `generic_regex=False` - `download.download_papyrus` now downloads also the README files
Additions: - `preprocess.keep_not_match`: keep unmatched column values. - `preprocess.keep_not_contains`: keep records whose specified column do not contain the specified value - `preprocess.keep_dissimilar`: keep records whose molecules are not similar to the provided molecule - `preprocess.keep_not_substructure`: keep records whose molecules are not substructures of the provided molecule
Bug fixes: - ***keep_source*** now returns an empty dataframe for chunks in which the desired source does not appear
New features: - ***qsar*** and ***pcm***'s **split_by** argument now supports **'custom-cluster'** to split training and test sets according to a custom assignment that is not directly specifying train/test (as is the case when its value is **'cutsom'**).
1.0.2
- Made download disclaimer and errors due to low disk space more evident - `papyrus_scripts.utils.IO.process_data_version` <br/>now raises an exception stating <br/>Papyrus data not available (did you download it first?)
1.0.1
The Papyrus++ datasets contained duplicated data wrongly associated to multiple assay types (i.e. Ki, KD, EC50, IC50).
The datasets have been updated and links of this release and of the `db-links` branch have updated accordingly.
1.0
Version 1.0 of the Papyrus-scripts library.
Allows one to: - download the Papyrus dataset - convert it from/to XZ to/from GZIP - match the data to structures of the Protein Data Bank - create FPSubSim2 (extension of FPSim2) files for similarity and substructure searches - filter the Papyrus data - model it with QSAR and PCM models - remove the data files