Mtdata

Latest version: v0.4.3

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 4

0.3.6

- fixes and additions for wmt22
- Fixed KECL-JParaCrawl
- added Paracrawl bonus for ukr-eng
- added Yandex rus-eng corpus
- added Yakut sah-eng
- update recipe for wmt22 constrained eval

0.3.5

- Parallel download support `-j/--n-jobs` argument (with default `4`)
- Add histogram to web search interface (Thanks, sgowdaks)
- Update OPUS index. Use OPUS API to download all datasets
- A lot of new datasets are added.
- WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
- Fix: JESC dataset language IDs were wrong
- New datasets:
- jpn-eng: add paracrawl v3, and wmt19 TED
- backtranslation datasets for en2ru ru2en en2ru
- Option to set `MTDATA_RECIPES` dir (default is $PWD). All files matching the glob `${MTDATA_RECIPES}/mtdata.recipes*.yml` are loaded
- WMT22 recipes added
- JW300 is disabled [77](https://github.com/thammegowda/mtdata/issues/77)
- Automatically create references.bib file based on datasets selected

0.3.4

- ELRC datasets updated
- Added docs, separate copy for each version (github pages)
- Dataset search via web interface. Support for regex match
- Added two new datasets Masakane fon-fra
- Improved TMX files BCP47 lang ID matching: compatibility instead of exact match

0.3.3

- bug fix: xml reading inside tar: Element tree's compain about TarPath
- `mtdata list` has `-g/--groups` and `-ng/--not-groups` as include exclude filters on group name (91)
- `mtdata list` has `-id/--id` flag to print only dataset IDs (91)
- add WMT21 tests (90)
- add ccaligned datasets wmt21 (89)
- add ParIce datasets (88)
- add wmt21 en-ha (87)
- add wmt21 wikititles v3 (86)
- Add train and test sets from StanfordNLP NMT page (large: en-cs, medium: en-de, small: en-vi) (84)
- Add support for two URLs for a single dataset (i.e. without zip/tar files)
- Fix: buggy matching of languages `y1==y1`
- Fix: `get` command: ensure train/dev/test datasets are indeed compatible with languages specified in `--langs` args

0.3.2

- Fix: recipes.yml is missing in the pip installed package
- Add Project Anuvaad: 196 datasets belonging to Indian languages
- add CLI `mtdata get` has `--fail / --no-fail` arguments to tell whether to crash or no-crash upon errors

0.3.1

- Add support for recipes; list-recipe get-recipe subcommands added
- add support for viewing stats of dataset; words, chars, segs
- FIX url for UN dev and test sets (source was updated so we updated too)
- Multilingual experiment support; ISO 639-3 code `mul` implies multilingual; e.g. mul-eng or eng-mul
- `--dev` accepts multiple datasets, and merges it (useful for multilingual experiments)
- tar files are extracted before read (performance improvements)
- setup.py: version and descriptions accessed via regex

---

Page 2 of 4

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.