Existing code normalized as: `norm = sqrt(batch_size / total_iterations)`, where `total_iterations` = (number of fits per epoch) * (number of epochs in restart). However, `total_iterations = total_samples / batch_size` --> `norm = batch_size * sqrt(1 / (total_iterations_per_epoch * epochs))`, making `norm` scale _linearly_ with `batch_size`, which differs from authors' sqrt.
Users who never changed `batch_size` throughout training will be unaffected. (λ = λ_norm * sqrt(b / BT); λ_norm is what we pick, our "guess". The idea of normalization is to make it so that if our guess works well for `batch_size=32`, it'll work well for `batch_size=16` - but if `batch_size` is never changed, then performance is only affected by the guess.)
Main change [here](https://github.com/OverLordGoldDragon/keras-adamw/pull/53/filesdiff-220519926b87c12115d2f727803fbe6bR19), closing 52.
**Updating existing code**: for a choice of λ_norm that previously worked well, apply `*= sqrt(batch_size)`. Ex: `Dense(bias_regularizer=l2(1e-4))` --> `Dense(bias_regularizer=l2(1e-4 * sqrt(32)))`.