- New datasets - WMT20 Tests - Paracrawl_v5_1 for Pashto and Khmer -English - NunavutHansard_v3 for Inuktitut -English - paracrawl_v8 and paracrawl_bonus datasets ([29][i29]) - ELRC-Share contributed by [kpu](https://github.com/kpu) ([#32][p32]) - AI4Bharath Samanathar v0.2 - New features - 'mtdata -b' for short outputs and crash on error input - Fixes and improvements: - ISO 639-1 -> ISO 639-3 mapping bug fix e.g. `nb` ([24][i24]) - Consistent docs for the default behavior of --merge ([26][i26]) - broken pipe error when `mtdata list | head` is now handled
- Paracrawl v7 and v7.1 -- 29 new datasets - Fix swapping issue with TMX format (TILDE corpus); add a testcase for TMX entry - Add mtdata-iso shell command - Add "mtdata report" sub command to summarize datasets by language and names
----
0.2.7
- Add OPUS 100 corpus
----
0.2.6
- Add all pairs of neulab_tedtalksv1 - train,test,dev -- 4,455 of them - Add support for cleaning noise. Entry.is_noise(seg1, seg2) - some basic noise is removed by default from training - add `__slots__` to Entry class (takes less memory and faster attrib lookup)
----
0.2.5
- Add all pairs of Wikimatrix -- 1,617 of them - Add support for specifying `cols` of `.tsv` file - Add all Europarlv7 sets - Remove hin-eng `dict` from JoshuaIndianParallelCorpus - Remove Wikimatrix1 from statmt -- they are moved to separate file
----
0.2.4
- File locking using portalocker to deal with race conditions when multiple `mtdata get` are invoked in parallel - Remove language name from local file name -- as a same tar file can be used by many languages, and they dont need copy - CLI to print version name - Added KFTT Japanese-English set