it uses RMSNorm in place of layer normalization (proved out by Gopher and RETRO from Deepmind)
as well as ALiBi in place of rotary embeddings (proven out by huggingface at scale)
both are simpler and slightly more compute efficient
ALiBi has the benefit of being able to train at shorter lengths and extrapolate to longer ones