As well as updating the README, I updated the default behavior of the calculation of the inner head dimension. Now, instead of the default value having to be given, it works just like in the "attention is all you need" paper, where it takes however many channels there are, and divides the channels by the number of heads, and then that dimension goes into each of the attention heads.