Train pipeline improvements
The train pipeline now allows the user to specify if they want all pipelined batches to be executed after exhausting the dataloader iterator. Normally when StopIteration is raised, the train pipeline will halt with the last 2 pipelined batches yet to be executed.
Core train pipeline logic has been refactored for better readability and maintainability.
The memcpy and data_dist streams have been set to high priority. We’ve seen kernel launches get delayed scheduling even with nothing on the GPU blocking the kernel. This will block the CPU unnecessarily, and we see perf gains after making this change.
FX + Script Inference Module
Sharded Quantized EmbeddingBagCollection and EmbeddingCollection are now torch.fx-able and torchscript-able ( via torch.script(torch.fx(module)) ), and can now be run with torchscript.
RecMetrics
Add `include_logloss` option to NE metric, to return the log of cross entropy loss on top of NE.
Add grouped AUC metric option. To use, toggle `grouped_auc=True` when instantiating AUC metric, and provide an additional `grouping_keys` tensor to specify the `group_id` for each element along the batch dimension in `update` method. The grouped AUC will then calculate AUCs per specified group, and return the averaged AUC.
**Enable the grouped_auc during metric instantiation**
`auc = AUCMetric(world_size=4, my_rank=0, batch_size=64, tasks=["t1"], grouped_auc=True)
`
**provide grouping keys during update**
`auc.update(predictions=predictions, labels=labels, weights=weights, grouping_keys=grouping_keys)
`
**Full Changelog**: https://github.com/pytorch/torchrec/commits/v0.4.0