Torchrec

Latest version: v1.0.0

Safety actively analyzes 682471 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 4

0.5.0rc2

Install fbgemm via nova

0.5.0rc1

0.4.0

Train pipeline improvements
The train pipeline now allows the user to specify if they want all pipelined batches to be executed after exhausting the dataloader iterator. Normally when StopIteration is raised, the train pipeline will halt with the last 2 pipelined batches yet to be executed.
Core train pipeline logic has been refactored for better readability and maintainability.
The memcpy and data_dist streams have been set to high priority. We’ve seen kernel launches get delayed scheduling even with nothing on the GPU blocking the kernel. This will block the CPU unnecessarily, and we see perf gains after making this change.

FX + Script Inference Module
Sharded Quantized EmbeddingBagCollection and EmbeddingCollection are now torch.fx-able and torchscript-able ( via torch.script(torch.fx(module)) ), and can now be run with torchscript.

RecMetrics
Add `include_logloss` option to NE metric, to return the log of cross entropy loss on top of NE.
Add grouped AUC metric option. To use, toggle `grouped_auc=True` when instantiating AUC metric, and provide an additional `grouping_keys` tensor to specify the `group_id` for each element along the batch dimension in `update` method. The grouped AUC will then calculate AUCs per specified group, and return the averaged AUC.

**Enable the grouped_auc during metric instantiation**
`auc = AUCMetric(world_size=4, my_rank=0, batch_size=64, tasks=["t1"], grouped_auc=True)
`
**provide grouping keys during update**
`auc.update(predictions=predictions, labels=labels, weights=weights, grouping_keys=grouping_keys)
`

**Full Changelog**: https://github.com/pytorch/torchrec/commits/v0.4.0

0.3.2

KeyedJaggedTensor
We observed performance regression due to a bottleneck in sparse data distribution for models that have multiple, large KJTs to redistribute.

To combat this we altered the comms pattern to transport the minimum data required in the initial collective to support the collective calls for the actual KJT tensor data. This data sent in the initial collective, ‘splits’ means more data is transmitted over the comms stream overall, but the CPU is blocked for significantly shorter amounts of time leading to better overall QPS.

Furthermore, we altered the TorchRec train pipeline to group the initial collective calls for the splits together before launching the more expensive KJT tensor collective calls. This pseudo ‘fusing’ minimizes the CPU blocked time as launching each subsequent input distribution is no longer dependent on the previous input distribution.

We no longer pass in variable batch size in the sharder

Planner

On the planner side, we introduced a new feature “early stopping” to GreedyProposer. This brings a 4X speedup to planner when there are many proposals (>1000) to propose with. To use the feature, simply add “threshold=10” to GreedyProposer (10 is the suggested number for it, which means GreedyProposer will stop proposing after seeing 10 consecutive bad proposals). Secondly, we refactored the “deepcopy” logic in the planner code, which bring a 8X speedup on the overall planning time. See [PR 665](https://github.com/pytorch/torchrec/pull/665) for the details.

Pinning requirements
We are also pinning requirements to add more stability to TorchRec users

0.3.0

[ProtoType] Simplified Optimizer Fusion APIs

We’ve provided a simplified and more intuitive API for setting fused optimizer settings via apply_optimizer_in_backward. This new approach enables the ability to specify optimizer settings on a per-parameter basis and sharded modules will configure [FBGEMM’s TableBatchedEmbedding modules accordingly](https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/fbgemm_gpu/split_table_batched_embeddings_ops.py#L181). Additionally, this now let's TorchRec’s planner account for optimizer memory usage. This should alleviate reports of sharding jobs OOMing after using Adam using a plan generated from planner.

[ProtoType] Simplified Sharding APIs

We’re introducing the shard API, which now allows you to shard only the embedding modules within a model, and provides an alternative to the current main entry point - DistributedModelParallel. This lets you have a finer grained control over the rest of the model, which can be useful for customized parallelization logic, and inference use cases (which may not require any parallelization on the dense layers). We’re also introducing construct_module_sharding_plan, providing a simpler interface to the TorchRec sharder.

[Beta] Integration with FBGEMM's Quantized Comms Library

Applying [quantization or mixed precision](https://dlp-kdd.github.io/assets/pdf/a11-yang.pdf) to tensors in a collective call during model parallel training greatly improves training efficiency, with little to no effect on model quality. TorchRec now integrates with the [quantized comms library provided by FBGEMM GPU](https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/fbgemm_gpu/quantize_comm.py) and provides an interface to construct encoders and decoders (codecs) that surround the all_to_all, and reduce_scatter collective calls in the output_dist of a sharded module. We also allow you to construct your own codecs to apply to your sharded module. The codces provided by FBGEMM allow FP16, BF16, FP8, and INT8 compressions, and you may use different quantizations for the forward path and backward pass.

Planner

* We removed several unnecessary copies inside of planner that drastically decreases the runtime.
* Cleaned up the Topology interface (no longer takes in unrelated information like batch size).

0.2.0

Changelog

PyPi Installation
The recommended install location is now from pypy. Additionally, TorchRec's binary will not longer contain fbgemm_gpu. Instead fbgemm_gpu will be installed as a dependency. See README for details

Planner Improvements

We added some additional features and bug fixed some bugs
Variable batch size per feature to support request only features
Better calculations for quant UVM Caching
Bug fix for shard storage fitting on device

Single process Batched + Fused Embeddings

Previously TorchRec’s abstractions (EmbeddingBagCollection/EmbeddingCollection) over FBGEMM kernels, which provide benefits such as table batching, optimizer fusion, and UVM placement, could only be used in conjunction with DistributedModelParallel. We’ve decoupled these notions from sharding, and introduced the [FusedEmbeddingBagCollection](https://github.com/pytorch/torchrec/blob/eb1247d8a2d16edc4952e5c2617e69acfe5477a5/torchrec/modules/fused_embedding_modules.py#L271), which can be used as a standalone module, with all of the above features, and can also be sharded.

Sharder
We enabled embedding sharding support for variable batch sizes across GPUs.

Benchmarking and Examples
We introduce
A set of [benchmarking tests](https://github.com/pytorch/torchrec/tree/eb1247d8a2d16edc4952e5c2617e69acfe5477a5/benchmarks), showing performance characteristics of TorchRec’s base modules and research models built out of TorchRec.
We provide an example demonstrating training a distributed TwoTower (i.e. User-Item) Retrieval model that is sharded using TorchRec. The projected item embeddings are added to an IVFPQ FAISS index for candidate generation. The retrieval model and KNN lookup are bundled in a Pytorch model for efficient end-to-end retrieval.
inference example with Torch Deploy for both single and multi GPU

Integrations
We demonstrate that TorchRec works out of the box with many components commonly used alongside PyTorch models in production like systems, such as
* [Training](https://github.com/pytorch/torchrec/tree/main/examples/ray) a TorchRec model on Ray Clusters utilizing the Torchx Ray scheduler
* [Preprocessing](https://github.com/pytorch/torchrec/tree/main/torchrec/datasets/scripts/nvt) and DataLoading with NVTabular on DLRM
* [Training](https://github.com/pytorch/torchrec/tree/main/examples/torcharrow) a TorchRec model with on-the-fly preprocessing with TorchArrow showcasing RecSys domain UDFs.

Scriptable Unsharded Modules
The unsharded embedding modules (EmbeddingBagCollection/EmbeddingCollection and variants) are now torch scriptable.

EmbeddingCollection Column Wise Sharding
We now support column wise sharding for EmbeddingCollection, enabling sequence embeddings to be sharded column wise.

JaggedTensor
Boost performance of `to_padded_dense` function by implementing with FBGEMM.

Linting
Add lintrunner to allow contributors to lint and format their changes quickly, matching our internal formatter.

Page 3 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.