Add m2m100 as the new default model to support 100 languages
Added
- `dlt.lang.m2m100` module: Now has variables for over 100 languages, also auto-complete ready. Example: `dlt.lang.m2m100.ENGLISH`.
- `dlt.utils.available_languages`, `dlt.utils.available_codes`: Now supports argument "m2m100"
- Available languages for each model family
- Script and template to generate available languages
Changed
- [BREAKING] `dlt.lang.TranslationModel`: A new model parameter called `model_family` in the initialization function. Either "mbart50" or "m2m100". By default, it will be inferred based on `model_or_path`. Needs to be explicitly set if `model_or_path` is a path.
- [BREAKING] Default model changed to m2m100
- Docs and readme about mbart50 were reframed to take into account the new model
- `dlt.TranslationModel.translate`: Improved docstring to be more general.
- Tests pertaining to `m2m100`
- `scripts/generate_langs.py`: Renamed, mechanism now changed to loading from json files
- `docs/index.md`: Expand the "Usage" and "Advanced" sections
- `README.md`: Add acknowledgement about m2m100, significantly trim "Advanced" section, make "Usage" more concise
Fixed
- `dlt.TranslationModel.available_codes()` was returning the languages instead of the codes. It will now correctly return the code.
Removed
- Output type hints for `TranslationModel.get_transformers_model` and `TranslationModel.get_tokenizer`
- [BREAKING] `dlt.TranslationModel.bart_model` and `dlt.TranslationModel.tokenizer` are no longer available to be used directly. Please use `dlt.TranslationModel.get_transformers_model` and `dlt.TranslationModel.get_tokenizer` instead.