Added
- Parallelized data reading with e.g. `--data-threads 8`
- Top-k sampling during decoding with e.g. `--output-sampling topk 10`
- Improved mixed precision training with `--fp16`
- Set FFN width in decoder independently from encoder with e.g. `--transformer-dim-ffn 4096 --transformer-decoder-dim-ffn 2048`
- Adds option `--add-lsh` to marian-conv which allows the LSH to be memory-mapped.
- Early stopping based on first, all, or any validation metrics via `--early-stopping-on`
- Compute 8.6 support if using CUDA>=11.1
- Support for RMSNorm as drop-in replace for LayerNorm from `Biao Zhang; Rico Sennrich (2019). Root Mean Square Layer Normalization`. Enabled in Transformer model via `--transformer-postprocess dar` instead of `dan`.
- Extend suppression of unwanted output symbols, specifically "\n" from default vocabulary if generated by SentencePiece with byte-fallback. Deactivates with --allow-special
- Allow for fine-grained CPU intrinsics overrides when BUILD_ARCH != native e.g. -DBUILD_ARCH=x86-64 -DCOMPILE_AVX512=off
- Adds custom bias epilogue kernel.
- Adds support for fusing relu and bias addition into gemms when using cuda 11.
- Better suppression of unwanted output symbols, specifically "\n" from SentencePiece with byte-fallback. Can be deactivated with --allow-special
- Display decoder time statistics with marian-decoder --stat-freq 10 ...
- Support for MS-internal binary shortlist
- Local/global sharding with MPI training via `--sharding local`
- fp16 support for factors.
- Correct training with fp16 via `--fp16`.
- Dynamic cost-scaling with `--cost-scaling`.
- Dynamic gradient-scaling with `--dynamic-gradient-scaling`.
- Add unit tests for binary files.
- Fix compilation with OMP
- Added `--model-mmap` option to enable mmap loading for CPU-based translation
- Compute aligned memory sizes using exact sizing
- Support for loading lexical shortlist from a binary blob
- Integrate a shortlist converter (which can convert a text lexical shortlist to a binary shortlist) into marian-conv with --shortlist option
Fixed
- Fix AVX2 and AVX512 detection on MacOS
- Add GCC11 support into FBGEMM
- Added pragma to ignore unused-private-field error on elementType_ on macOS
- Do not set guided alignments for case augmented data if vocab is not factored
- Various fixes to enable LSH in Quicksand
- Added support to MPIWrappest::bcast (and similar) for count of type size_t
- Adding new validation metrics when training is restarted and --reset-valid-stalled is used
- Missing depth-scaling in transformer FFN
- Fixed an issue when loading intgemm16 models from unaligned memory.
- Fix building marian with gcc 9.3+ and FBGEMM
- Find MKL installed under Ubuntu 20.04 via apt-get
- Support for CUDA 11.
- General improvements and fixes for MPI handling, was essentially non-functional before (syncing, random seeds, deadlocks during saving, validation etc.)
- Allow to compile -DUSE_MPI=on with -DUSE_STATIC_LIBS=on although MPI gets still linked dynamically since it has so many dependencies.
- Fix building server with Boost 1.75
- Missing implementation for cos/tan expression operator
- Fixed loading binary models on architectures where `size_t` != `uint64_t`.
- Missing float template specialisation for elem::Plus
- Broken links to MNIST data sets
- Enforce validation for the task alias in training mode.
Changed
- MacOS marian uses Apple Accelerate framework by default, as opposed to openblas/mkl.
- Optimize LSH for speed by treating is as a shortlist generator. No option changes in decoder
- Set REQUIRED_BIAS_ALIGNMENT = 16 in tensors/gpu/prod.cpp to avoid memory-misalignment on certain Ampere GPUs.
- For BUILD_ARCH != native enable all intrinsics types by default, can be disabled like this: -DCOMPILE_AVX512=off
- Moved FBGEMM pointer to commit c258054 for gcc 9.3+ fix
- Change compile options a la -DCOMPILE_CUDA_SM35 to -DCOMPILE_KEPLER, -DCOMPILE_MAXWELL,
-DCOMPILE_PASCAL, -DCOMPILE_VOLTA, -DCOMPILE_TURING and -DCOMPILE_AMPERE
- Disable -DCOMPILE_KEPLER, -DCOMPILE_MAXWELL by default.
- Dropped support for legacy graph groups.
- Developer documentation framework based on Sphinx+Doxygen+Breathe+Exhale
- Expresion graph documentation (788)
- Graph operators documentation (801)
- Remove unused variable from expression graph
- Factor groups and concatenation: doc/factors.md