Mtdata

Latest version: v0.4.2

Safety actively analyzes 683530 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.4.1

* Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
* `mtdata cache` added. Improves concurrency by supporting multiple recipes
* Added WMT general test 2022 and 2023
* mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
* mtdata-bcp47 : --script {suppress-default,suppress-all,express}
* Uses`pigz` to read and write gzip files by default when pigz is in PATH. export `USE_PIGZ=0` to disable

0.4.0

* Fix: allenai_nllb.json is now included in MANIFEST.in [137](https://github.com/thammegowda/mtdata/issues/137). Also fixed CI: Travis -> github actions
* Update ELRC datasets [138](https://github.com/thammegowda/mtdata/pull/138). Thanks [AlexUmnov](https://github.com/AlexUmnov)
* Add Jparacrawl Chinese-Japanese subset [143](https://github.com/thammegowda/mtdata/pull/143 ). Thanks [BrightXiaoHan](https://github.com/BrightXiaoHan)
* Add Flores200 dev and devtests [145](https://github.com/thammegowda/mtdata/pull/145). Thanks [ZenBel](https://github.com/ZenBel)
* Add support for `mtdata echo <ID>`
* dataset entries only store bibtext keys and not full citation text
* creates index cache as JSONLine file. (WIP towards dataset statistics)
* Simplified index loading
* simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
* all resources are moved to `mtdata/resource` dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like 137 )

**New and exciting features:**
* Support for adding new datasets at runtime (`mtdata*.py` from run dir). Note: you have to reindex by calling `mtdata -ri list`
* Monolingual datasets support in progress (currently testing)
* Dataset IDs are now `Group-name-version-lang1-lang2` for bitext and `Group-name-version-lang` for monolingual
* `mtdata list` is updated. `mtdata list -l eng-deu` for bitext and `mtdata list -l eng` for monolingual
* Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...


> _skipped 0.3.9 because the chages are significant_

0.3.8

* CLI arg `--log-level` with default set to `WARNING`
* progressbar can be disabled from CLI `--no-pbar`; default is enabled `--pbar`
* `mtdata stats --quick` does HTTP HEAD and shows content length; e.g. `mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deu`
* `python -m mtdata.scripts.recipe_stats` to read stats from output directory
* Security fix with tar extract | Thanks TrellixVulnTeam
* Added NLLB datasets prepared by AllenAI | Thanks AlexUmnov
* Opus and ELRC datasets update | Thanks ZenBel

0.3.7

- Update ELRC data including EU acts which is used for wmt22 (thanks kpu)

0.3.6

- fixes and additions for wmt22
- Fixed KECL-JParaCrawl
- added Paracrawl bonus for ukr-eng
- added Yandex rus-eng corpus
- added Yakut sah-eng
- update recipe for wmt22 constrained eval

0.3.5

- Parallel download support `-j/--n-jobs` argument (with default `4`)
- Add histogram to web search interface (Thanks, sgowdaks)
- Update OPUS index. Use OPUS API to download all datasets
- A lot of new datasets are added.
- WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
- Fix: JESC dataset language IDs were wrong
- New datasets:
- jpn-eng: add paracrawl v3, and wmt19 TED
- backtranslation datasets for en2ru ru2en en2ru
- Option to set `MTDATA_RECIPES` dir (default is $PWD). All files matching the glob `${MTDATA_RECIPES}/mtdata.recipes*.yml` are loaded
- WMT22 recipes added
- JW300 is disabled [77](https://github.com/thammegowda/mtdata/issues/77)
- Automatically create references.bib file based on datasets selected

Page 1 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.