Wilds

Latest version: v2.0.0

Safety actively analyzes 626118 Python packages for vulnerabilities to keep your Python projects secure.

2.0

- Previously, the ID split was done uniformly at random, meaning that images from the same sequence (i.e., taken within a few seconds of each other by the same camera) could be found across all of the training / validation (ID) / test (ID) sets.
- In v2.0, we have redone the ID split so that all images taken on the same day by the same camera are in only one of the training, validation (ID), or test (ID) sets. In other words, these sets still comprise images from the same cameras, but taken on different days.
- In line with the new iWildCam 2021 challenge on Kaggle, we have also removed the following images:
- images that include humans or pictures taken indoors.
- images with non-animal categories such as `start` and `unidentifiable`.
- images in categories such as `unknown`, `unknown raptor` and `unknown rat`.
- We added back in location 537 that was previously removed as we mistakenly believed those images were corrupted.
- We have re-split the data into training, validation (ID), test (ID), validation (OOD), and test (OOD) sets. This is a different random split from v1.0.
- Since we remove any classes that do not end up in the train split, removing those images and redoing the split gave us a different set of species. There are now 182 classes instead of 186. Specifically, the following classes have been removed: `['unknown', 'macaca fascicularis', 'proechimys sp', 'unidentifiable', 'turtur calcospilos', 'streptopilia senegalensis', 'equus africanus', 'macaca nemestrina', 'start', 'paleosuchus sp', 'unknown raptor', 'unknown rat', 'misfire', 'mustela lutreolina', 'canis latrans', 'myoprocta pratti', 'xerus rutilus', 'end', 'psophia crepitans', 'ictonyx striatus']`. The following classes have been added: `[‘praomys tullbergi', 'polyplectron chalcurum', 'ardeotis kori', 'phaetornis sp', 'mus minutoides', 'raphicerus campestris', 'tigrisoma mexicanum', 'leptailurus serval', 'malacomys longipes', 'oenomys hypoxanthus', 'turdus olivaceus', 'macaca sp', 'leiothrix argentauris', 'lophura sp', 'mazama temama', 'hippopotamus amphibius']`. For convenience, we have also added a `categories.csv` that maps from label IDs to species names.
- To speed up downloading and model training (by reducing the I/O bottleneck), we have also resized all images to have a height of 448px while keeping the original aspect ratio. All images are wide (so they now have a min dimension of 448px). Note that as JPEG compression is lossy, this procedure gives different images from resizing the full-sized image in the code after loading it.

Minor updates to existing datasets
We made two backwards-compatible changes to existing datasets. We encourage all users to update these datasets; these updates should leave results unchanged (modulo training randomness). In future versions of the WILDS package, we will deprecate the older versions of these datasets.

1.2.2

- Added a check to make sure that a group data loader is used whenever `n_groups_per_batch` or `distinct_groups` are passed in as arguments to `examples/run_expt.py`. (https://github.com/p-lambda/wilds/issues/79)
- Data augmentations now only transform `x` by default. Set `do_transform_y` when initializing the `WILDSSubset` to modify both `x` and `y`. (https://github.com/p-lambda/wilds/issues/77)
- For FasterRCNN, we now use the PyTorch implementation of `smooth_l1_loss` instead of the custom torchvision implementation, which was removed in torchvision v0.10.
- Updated the requirements to include torchvision, scipy, and scikit-learn. Previously, torchvision was only needed for the example scripts. However, it is now also used for computing metrics in the GlobalWheat-WILDS dataset, so we have moved it into the core set of requirements.

1.2.1

It also simplifies saving and evaluation predictions made across different replicates and datasets.

New datasets

1.1

- Previously, the images were stored in a single .npy file and read in using NumPy memmapping.
- Now, we have converted them (loselessly) into individual compressed .npz files. This should help with disk I/O and memory usage.
- We have correspondingly updated the default number of workers for the data loader from 1 to 4.

Default model updates
We have updated the default models for several datasets. Please take note of these changes if you are currently running experiments with these datasets.

Amazon and CivilComments
- To speed up model training, we have switched from BERT-base-uncased to DistilBERT-base-uncased. This obtains roughly similar accuracy but at twice the speed.
- For CivilComments, we have also increased the number of replicates from 3 to 5, to reduce variability in the reported performance.

Camelyon17
- Previously, we were upsizing each image to 224x224 before passing it into the model.
- We now leave the images at their original resolution of 96x96, which significantly speeds up model training.

iWildCam
- Previously, we were resizing each image to 224x224 before passing it into the model. However, this limited model accuracy, as the animals in the images can sometimes be quite small.
- We now resize each image to 448x448 before passing it into the model, which improves accuracy and macro F1 across the board.

FMoW
- For consistency with the other datasets, we have changed the early stopping validation criterion (`val_metric`) from `acc_avg` to `acc_worst_region`.

PovertyMap
- For consistency with the other datasets, we have changed the early stopping validation criterion (`val_metric`) from `r_all` to `r_wg`.

Other changes
- We have uploaded an executable version of our paper to [CodaLab](https://worksheets.codalab.org/worksheets/0x52cea64d1d3f4fa89de326b4e31aa50a). This contains the exact commands, code, and data used for each experiment reported in our paper. The trained model weights for every experiment can also be found there.
- To ease downloading, we have added `wilds/download_datasets.py`, which allows users to download all (or a subset of) datasets at once. Please see the README for instructions.
- We have added a convenience function for getting the appropriate constructor for each dataset in `wilds/get_dataset.py`. This function allows you to specify a `version` argument. If this is not specified, it defaults to the latest available version for that dataset. If that version is not downloaded and the `download` argument is also set, then it will automatically download that version.
- The example script `examples/run_expt.py` now also takes in a `version` argument.
- We have added download sizes and expected training times to the README.
- We have updated the default inputs for `WILDSDatasets.eval` methods for various datasets. For example, `eval` for most classification datasets now take in predicted labels by default, while the predictions were previously passed in as logits. The default inputs vary across datasets, and we document this in the docstring of each `eval` method.
- We made a few updates to the code in `examples/` to interface better with language modeling tasks (for Py150). None of these changes affect the results or the interface with algorithms.
- We updated the code in `examples/` to save model predictions in an appropriate format for submissions to the leaderboard.
- Finally, we have also updated our [paper](https://arxiv.org/abs/2012.07421) to streamline the writing and include these new numbers and datasets.

1.1.0

The v1.1.0 release contains a new Py150 benchmark dataset for code completion, as well as updates to several existing datasets and default models to make them significantly faster and easier to use.

Some of these changes are breaking changes that will impact users who are currently running experiments with WILDS. We sincerely apologize for the inconvenience. We ask all users to update their package to v1.1.0, which will automatically update your datasets. In addition, please update your default models, for example by using the latest example scripts in this repo. These changes were primarily made to accelerate model training, which was a bottleneck for many users; at this time, we do not expect to have to make further changes to the existing datasets or default models.

New datasets

New benchmark dataset: Py150
- The Py150-WILDS dataset is a code completion dataset, where the distribution shift is over code from different Github repositories.
- We focus on accuracy on the subpopulation of class and method tokens, as prior work has shown that those are the most frequent queries in real-world code completion settings.
- It is a variant of the Py150 dataset from [Raychev et al., 2016](https://dl.acm.org/doi/10.1145/2983990.2984041).
- See our [paper](https://arxiv.org/abs/2012.07421) for more details.

Additional dataset: SQF
- The SQF dataset is based on the stop-question-and-frisk dataset released by the New York Police Department. We adapt the version processed by [Goel et al., 2016](http://projecteuclid.org/euclid.aoas/1458909920). The task is to predict criminal possession of a weapon.
- We use this dataset to study distribution shifts in an algorithmic fairness context. Specifically, we consider subpopulation shifts across locations and race groups. However, while there are large performance gaps, we did not find that they were caused by the distribution shift. We therefore did not include this dataset as part of the official benchmark.

Major updates to existing datasets
Note that datasets are versioned separately from the main WILDS version. We have two major updates (i.e., breaking, non-backwards-compatible changes) to datasets.

1.0

- The RxRx1 dataset comprises images of genetically-perturbed cells taken with fluorescent microscopy and collected across 51 experimental batches. The task is to classify the identity of the genetic perturbation applied to each cell, and the distribution shift is over different experimental batches.
- Model performance is measured by average classification accuracy. The example script implements a ResNet-50 baseline.
- This dataset is adapted from the [RxRx1](https://www.rxrx.ai/rxrx1) dataset released by Recursion.

Additional dataset: ENCODE
- The ENCODE dataset is based on the [ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge](https://www.synapse.org/#!Synapse:syn6131484/wiki/402026). The task is to classify if a given genomic location will be bound by a particular transcription factor, and the distribution shift is over different cell types.
- We did not include this dataset in the official benchmark as we were unable to learn a model that could generalize across all the cell types simultaneously, even in an in-distribution setting, which suggested that the model family and/or feature set might not be rich enough.

Other changes

Saving and evaluating predictions
To ease evaluation and leaderboard submission, we have made the following changes:
- Predictions are now automatically saved in the format described in our [submission guidelines](https://wilds.stanford.edu/submit/).
- We have added an evaluation script that evaluates these saved predictions across multiple replicates and datasets. See the updated README and `examples/evaluate.py` for more details.

Code changes to support detection tasks
To support detection tasks, we have modified the example scripts as well as made slight changes to the WILDS data loaders. All interfaces should be backwards-compatible.
- The labels `y` and the model outputs no longer need to be a `Tensor`. For example, for detection tasks, a model might return a dictionary containing bounding box coordinates as well as class predictions for each bounding box. Accordingly, several helper functions have been rewritten to be more flexible.
- Models can now optionally take in `y` in the forward call. For example, during training, a model might use ground truth bounding boxes to train a bounding box classifier.
- Data transforms can now transform both `x` and `y`. We have also merged `train_transform` and `eval_transform` functions into a single function that takes a `is_training` parameter.

Miscellaneous changes
- We have changed the names of the in-distribution `split_scheme`'s to match the terminology in Section 5 of the updated [paper](https://arxiv.org/abs/2012.07421).
- The FMoW-WILDS and PovertyMap-WILDS constructors now no longer use the `oracle_training_set` parameter to use an in-distribution split. This is now controlled through `split_scheme` to be consistent with the other datasets.
- We fixed a minor bug in the PovertyMap-WILDS in-distribution baseline. The Val (ID) and Test (ID) splits are slightly changed.
- The FMoW-WILDS constructor now sets `use_ood_val=True` by default. This change has no effect for users using the example scripts, as `use_ood_val` is already set in `config/datasets.py`.
- Users who are only using the data loaders and not the evaluation metrics or example scripts will no longer need to install `torch_scatter` (thanks Ke Alexander Wang).
- The Waterbirds dataset now computes the adjusted average accuracy on the validation and test sets, as described in Appendix C.1 of the corresponding [paper](https://arxiv.org/pdf/1911.08731.pdf).
- The behavior of `algorithm.eval()` is now consistent with `algorithm.model.eval()` in that both preserve the `grad_fn` attribute (thanks Divya Shanmugam). See https://github.com/p-lambda/wilds/issues/45.
- The dataset name for OGB-MolPCBA has been changed from `ogbg-molpcba` to to `ogb-molpcba` for consistency.
- We have updated the OGB-MolPCBA data loader to be compatible with v1.7 of the `pytorch_geometric` dependency (thanks arnaudvl). See https://github.com/p-lambda/wilds/issues/52.

Releases

Has known vulnerabilities