- docs: remove useless comments
- fix(Transformer): too much things to tell
- feat: even more precise floating point for metrics and loss
- refactor: special tokens now passed via __init__ for Transformer
- feat: enhance beam search and token prediction mechanisms
- docs: update readme
- fix(Transformer): vanishing gradient fix
- fix(Transformer): still on it (wip)
- fix(Transformer): another fix
- fix(Transformer): special token indices
- fix(Transformer): normalization IS the issue
- docs: update readme
- fix(Transformer): cross attention weights
- fix: LearningRateScheduler
- fix: LearningRateScheduler
- fix: normalization in data preparation
- fix: different vocab size for different tokenizations
- fix(PositionalEncoding): scaling
- fix(AddNorm): better normalization
- fix(TransformerEncoderLayer): huge improvements
- perf(SequenceCrossEntropy): add vectorization
- fix(Tokenizer+Transformer): tokenization alignement for special tokens
- fix(transformer): investigate and address gradient instability and explosion
- fix(sce): label smoothing
- refactor: gradient clipping
- fix(Transformer): gradient explosion
- fix(Transformer): tokens padding and max sequence
- test: tried with a better dataset
- fix(sce): y_pred treated as logits instead of probs
- fix(TransformerEncoderLayer): remove arbitrary scaling
- fix(Transformer): sce won't ignore sos and eos tokens
- fix: sce extending lossfunction
- fix(sce): softmax not necessary
- feat: add BLEU, ROUGE-L and ROUGE-N scores
- fix: validation data in fit method and shuffle in train_test_split
- docs: modifies example to use validation split and bleu score
- fix(PositionalEncoding): better positional scaling
- ci: bump version to 3.3.7