- 2020-01-15
- Important changes
- Distribute `ginza` and `ja_ginza` from PyPI
- Simple installation; `pip install ginza`, and run `ginza`
- The model package, `ja_ginza`, is also available from PyPI.
- Model improvements
- Change NER training data-set to GSK2014-A (2019) BCCWJ edition
- Improved accuracy of NER
- `token.ent_type_` value is changed to [Sekine's Extended Named Entity Hierarchy](http://liat-aip.sakura.ne.jp/ene/ene8/definition_jp/html/enedetail.html)
- Add `ENE7` attribute to the last field of the output of `ginza`
- Move [OntoNotes5](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf) -based label to `token._.ne`
- We extended the OntoNotes5 named entity labels with `PHONE`, `EMAIL`, `URL`, and `PET_NAME`
- Overall accuracy is improved by executing `spacy pretrain` over 100 epochs
- Multi-task learning of `spacy train` effectively working on UD Japanese BCCWJ
- The newest `SudachiDict_core-20191224`
- `ginzame`
- Execute `sudachipy` by `multiprocessing.Pool` and output results with `mecab` like format
- Now `sudachipy` command requires additional SudachiDict package installation
- Breaking API Changes
- commands
- `ginza` (`ginza.command_line.main_ginza`)
- change option `mode` to `sudachipy_mode`
- drop options: `disable_pipes` and `recreate_corrector`
- add options: `hash_comment`, `parallel`, `files`
- add `mecab` to the choices for the argument of `-f` option
- add `parallel NUM_PROCESS` option (EXPERIMENTAL)
- add `ENE7` attribute to conllu miscellaneous field
- `ginza.ent_type_mapping.ENE_NE_MAPPING` is used to convert `ENE7` label to `NE`
- add `ginzame` (`ginza.command_line.main_ginzame`)
- a multi-process tokenizer providing `mecab` like output format
- spaCy field extensions
- add `token._.ne` for ner label
- `ginza/sudachipy_tokenizer.py`
- change `SudachiTokenizer` to `SudachipyTokenizer`
- use `SUDACHI_DEFAULT_SPLIT_MODE` instead of `SUDACHI_DEFAULT_SPLITMODE` or `SUDACHI_DEFAULT_MODE`
- Dependencies
- upgrade `spacy` to v2.2.3
- upgrade `sudachipy` to v0.4.2