Mtdata

Latest version: v0.4.3

Safety actively analyzes 723607 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.4.3

* Add preliminary support for huggingface datasets; currently wmt24++ is the only supported dataset
* Update setup.py -> pyproject.toml; hf datasets is optional dependency
* Add mtdata index subcommand. deprecate `mtdata --reindex <cmd>`
* Add a field named `meta` of type dictionary to the Entry class; stores arbitrary key-vals which maybe useful for downloading and parsing datasets.
* Support for document id , (currently, one among the many in meta fields)in `.meta.jsonl.gz`
* OPUS index updated
* `mtdata score` sub command added; support QE scoring via pymarian

0.4.2

- minor fixes

0.4.1

* Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
* `mtdata cache` added. Improves concurrency by supporting multiple recipes
* Added WMT general test 2022 and 2023
* mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
* mtdata-bcp47 : --script {suppress-default,suppress-all,express}
* Uses`pigz` to read and write gzip files by default when pigz is in PATH. export `USE_PIGZ=0` to disable

0.4.0

* Fix: allenai_nllb.json is now included in MANIFEST.in [137](https://github.com/thammegowda/mtdata/issues/137). Also fixed CI: Travis -> github actions
* Update ELRC datasets [138](https://github.com/thammegowda/mtdata/pull/138). Thanks [AlexUmnov](https://github.com/AlexUmnov)
* Add Jparacrawl Chinese-Japanese subset [143](https://github.com/thammegowda/mtdata/pull/143 ). Thanks [BrightXiaoHan](https://github.com/BrightXiaoHan)
* Add Flores200 dev and devtests [145](https://github.com/thammegowda/mtdata/pull/145). Thanks [ZenBel](https://github.com/ZenBel)
* Add support for `mtdata echo <ID>`
* dataset entries only store bibtext keys and not full citation text
* creates index cache as JSONLine file. (WIP towards dataset statistics)
* Simplified index loading
* simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
* all resources are moved to `mtdata/resource` dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like 137 )

**New and exciting features:**
* Support for adding new datasets at runtime (`mtdata*.py` from run dir). Note: you have to reindex by calling `mtdata -ri list`
* Monolingual datasets support in progress (currently testing)
* Dataset IDs are now `Group-name-version-lang1-lang2` for bitext and `Group-name-version-lang` for monolingual
* `mtdata list` is updated. `mtdata list -l eng-deu` for bitext and `mtdata list -l eng` for monolingual
* Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...


> _skipped 0.3.9 because the chages are significant_

0.3.8

* CLI arg `--log-level` with default set to `WARNING`
* progressbar can be disabled from CLI `--no-pbar`; default is enabled `--pbar`
* `mtdata stats --quick` does HTTP HEAD and shows content length; e.g. `mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deu`
* `python -m mtdata.scripts.recipe_stats` to read stats from output directory
* Security fix with tar extract | Thanks TrellixVulnTeam
* Added NLLB datasets prepared by AllenAI | Thanks AlexUmnov
* Opus and ELRC datasets update | Thanks ZenBel

0.3.7

- Update ELRC data including EU acts which is used for wmt22 (thanks kpu)

Page 1 of 4

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.