Nanotron

Latest version: v0.4

Safety actively analyzes 682449 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

0.4

How to use

![cmd](https://github.com/huggingface/nanotron/assets/47445085/726d6cd8-9373-4874-a2ae-a1fb7b7b7ccb)

What's Changed
* [Fix] Assert the wrong tolerance of FA2's Layer Norm kernel by xrsrke in https://github.com/huggingface/nanotron/pull/81
* [DoReMi] Small refactors by xrsrke in https://github.com/huggingface/nanotron/pull/95
* Add Mamba PR by 3outeille in https://github.com/huggingface/nanotron/pull/83
* Bump v0.4 + Quick refactos by NouamaneTazi in https://github.com/huggingface/nanotron/pull/96


**Full Changelog**: https://github.com/huggingface/nanotron/compare/v0.3...v0.4

0.3

You might think that one of the key ways to speed up pretraining performance is either by finding more quality data, increasing FLOPs, or changing the model architecture, but actually, these are not the only ways. DoReMi shows that, given the same source of training data, a model using an optimal data mixing strategy could outperform its counterpart with random sampling in at least 70% domains or all domains and downstream evaluations without any knowledge of the downstream evaluation tasks.
> DoReMi Blog: https://crfm.stanford.edu/2023/09/14/doremi

Using DoReMi in Nanotron:
(Thanks to xrsrke)
- Step 0: Preprocessing data

- Step 1: Train a small reference model using uniform sampling from each domain (for a given global batch size, you equally sample `x` samples across all domains, or in some cases, a domain has a smaller amount of samples than other domains. This leads to some domains running out of samples early, so you could enable automatic domain weights based on the token count).

bash
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 examples/doremi/train_reference.py --config-file examples/doremi/configs/config_280m_llama.yaml


- Step 2: Use the trained reference model from step 1 to train an identical model, and use its performance to dynamically tune the domain weights during training.

bash
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 examples/doremi/train_doremi.py --config-file examples/doremi/configs/config_280m_llama_proxy.yaml


- Step 3: Nanotron saves the domain weights in the model checkpoint. Now, calculate the optimal domain weights by averaging the domain weights across all training steps from step 1: $\bar{\alpha}=\frac{1}{T} \sum_{i=1}^T \alpha_t$.


python

import torch

domain_weights = torch.load("checkpoints/doremi/proxy-280m-llama/doremi_domain_weights_100000.pt")

total_weights = sum(d["domain_weights"] for d in domain_weights)
avg_weights = total_weights / len(domain_weights)


Then, set these `avg_weights` in the config of the larger run in the `doremi` section.

- Step 4: Use the optimized domain weights from step 3 to train a larger model (could be 10x to 30x larger).

bash
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 examples/doremi/train_reference.py --config-file examples/doremi/configs/config_2.8b_llama_with_tuned_weights.yaml


- Step 5: Profit 🤑

0.2

How to use nanotron's MoEs
To use nanotron's 3D parallel implementation of MoEs simply add `dMoE` to your modeling as such:
python
self.block_sparse_moe = dMoE(
config,
expert_parallel_group=parallel_context.expert_pg,
tp_pg=parallel_context.tp_pg,
parallel_config=parallel_config,
)

See example in [examples/moe/llamoe.py](https://github.com/huggingface/nanotron/blob/main/examples/moe/llamoe.py#L551-L556)
You can control **expert parallelism degree** by setting `parallelism.expert_parallel_size` and **weight parallelism degree** is the same as tensor parallel degree

What's Changed
* Make tests pass by NouamaneTazi in https://github.com/huggingface/nanotron/pull/52
* Refactoring tying mechanism + small fixes by NouamaneTazi in https://github.com/huggingface/nanotron/pull/62
* [`Docs`] Fix typos by standardAI in https://github.com/huggingface/nanotron/pull/63
* quick fix train steps assertion by NouamaneTazi in https://github.com/huggingface/nanotron/pull/66
* fix configs by NouamaneTazi in https://github.com/huggingface/nanotron/pull/67
* [FP8 Training] A single forward and backward pass for a linear in FP8 by xrsrke in https://github.com/huggingface/nanotron/pull/56
* Update bench script by NouamaneTazi in https://github.com/huggingface/nanotron/pull/64
* Add CI/CD for unit tests by xrsrke in https://github.com/huggingface/nanotron/pull/41
* Refactor `ParallelContext` and some process groups creation by NouamaneTazi in https://github.com/huggingface/nanotron/pull/69
* Support Expert Parallelism by NouamaneTazi in https://github.com/huggingface/nanotron/pull/72
* Add MoEs support by NouamaneTazi in https://github.com/huggingface/nanotron/pull/73
* Implement pipeline parallel size-agnostic optimizer state loading by nopperl in https://github.com/huggingface/nanotron/pull/71

New Contributors
* standardAI made their first contribution in https://github.com/huggingface/nanotron/pull/63
* nopperl made their first contribution in https://github.com/huggingface/nanotron/pull/71

**Full Changelog**: https://github.com/huggingface/nanotron/compare/v0.1...v0.2

0.1

Initial release of the nanotron library

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.