We are excited to announce the latest release of PyThaiNLP - version 5.0! PyThaiNLP is a Python library for Thai natural language processing (NLP). We are welcome to release PyThaiNLP 5.0!
With PyThaiNLP 5.0, you can expect improved performance and accuracy for NLP tasks in Thai. We have also added new functions to make your NLP tasks even easier and more efficient.
Install: `pip install pythainlp`
Upgrade: `pip install -U pythainlp`
- Documentation: https://pythainlp.github.io/docs/5.0
- Report bug: https://github.com/PyThaiNLP/pythainlp/issues
See PyThaiNLP 5.0 Change Log: https://github.com/PyThaiNLP/pythainlp/issues/788.
What is new?
License information
- Use [SPDX license identifier](https://spdx.org/licenses/) at the header of source code #876
Deprecation and other API changes
- Change default NER to thainer-v2 https://github.com/PyThaiNLP/pythainlp/commit/5e97e7c4ebcf68bca64e4f942c8dfe3a5ab2ebc5
- Move `pythainlp.util.is_native_thai` to `pythainlp.morpheme.is_native_thai` https://github.com/PyThaiNLP/pythainlp/commit/524759ac1926fb9837bb9464f0a40cd984af2608
Dependency
- Add tzdata as a dependency on Windows by BLKSerene in 841
New API
- Add `pythainlp.coref` for Thai coreference resolution 802
- Add `wtpsplit` to sentence segmentation & paragraph segmentation 804 and add `paragraph_threshold` into `paragraph_tokenize()` function 806
- Add word approximation to `pythainlp.soundex.sound` 809 by wannaphong
- Add `pythainlp.wsd` for Thai word sense disambiguation 818 by wannaphong
- Add `pythainlp.chat` and `WangChanGLM` to `pythainlp.generate` 819 by wannaphong
- Add `pythainlp.cls` a param-free classification model 821 by c4n
- Add `pythainlp.el` entity linking 822 by wannaphong
- Add `pythainlp.ancient` by wannaphong in 833
- Add `pythainlp.util.rhyme` by wannaphong in 849
- Add `remove_trailing_repeat_consonants` by konbraphat51 in 862
- Add `pythainlp.util.to_idn` by wannaphong in 875
- Add `pythainlp.corpus.find_synonyms` by wannaphong in 890
- Add `pythainlp.util.morse` by wannaphong in 891
- Add `pythainlp.morpheme` by wannaphong in 896
Improve
- Update code comments and clean up codes by BLKSerene in 845
- Improving the documentation byt fixing the typos, adding necesarry details and explanation of the code and the missing necessary details about model and example. by Saharshjain78 in 850
- Fix tests of khavee functions by BLKSerene in 854
- Update Git Actions versions by bact in 878
- Fix ruff args in workflow by bact in 880
- Revise ruff args in workflow by bact in 881
- Fix coref return type and add fallback by bact in 883
- Fix wrong/incompatible types, code readability by bact in 884
- Bump protobuf from 3.20 to 3.20.2 by 885
- Add license info to /tests and README_TH.md by bact in 886
- phayathaibert, khavee, parse: Code clean up by bact in 889
- ruff: docstring-code-format = true by bact in 892
Tokenizer
- Add wtpsplit engine to sentence_tokenize 804
- New `paragraph_tokenize` funtion to split Thai text to a paragraph 804
- Add `paragraph_threshold` into `paragraph_tokenize()` function 806 by pavaris-pm in
- Add 🪿 Han-solo by wannaphong in 830
- Fix `newmm` to better handle non-Thai characters in tokens 856 by konbraphat51
- Fix incorrect passing of flags to re.split by hauntsaninja in 832
- Add syllable_tokenize by wannaphong in 834
- Add wanchanberta_thai_grammarly by wannaphong in 836
- Add extra segmentation style for paragraph_tokenize function by pavaris-pm in 844
- Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" by konbraphat51 in 856
Tag
- Add function for pos tag with transformers by MpolaarbearM in 857
- Update pos_tag_transformers function by pavaris-pm in 865
- Add PhayaThaiBERT engine with new features by pavaris-pm in 873
Chat
- Fixed bug 828
Translate
- Add small100 to `pythainlp.translate` 815 by wannaphong
Transliterate
- Fix duplicate keys in ISO 11940 and IPA-RTGS phoneme mapping 851 852 by BLKSerene and bact
- Fix duplicate key in IPA to RTGS phoneme mapping by BLKSerene in 852
Corpus
- Add `pythainlp.corpus.thai_orst_words()` Thai word list from Royal Society of Thailand (ORST) 810 by wannaphong
- Add `pythainlp.corpus.thai_wikipedia_titles()` Thai word list (noun and noun phrases) from Thai Wikipedia titles 869 by konbraphat51
- Add `pythainlp.corpus.thai_volubilis_words()` Thai word list from Volubilis dictionary 870 by konbraphat51
- Add `pythainlp.corpus.thai_icu_words()` Thai word list from ICU BreakIterator dictionary 879 by pavaris-pm
- Rename Volubilis/Wikipedia corpus function names for consistency / Fix types by bact in 882
Util
- Add `pythainlp.util.encoding` 813 by wannaphong
- Add `pythainlp.util.spell_words` 817 by wannaphong
- Add `pythainlp.util.remove_trailing_repeat_consonants()` 862 by konbraphat51
New Contributors
- pavaris-pm made their first contribution in 806
- hauntsaninja made their first contribution in 832
- Saharshjain78 made their first contribution in 850
- konbraphat51 made their first contribution in 856
- MpolaarbearM made their first contribution in 857
**Full Changelog**: https://github.com/PyThaiNLP/pythainlp/compare/v4.0.2...v5.0.0
Contributors
<a href="https://github.com/PyThaiNLP/pythainlp/graphs/contributors">
<img src="https://contributors-img.firebaseapp.com/image?repo=PyThaiNLP/pythainlp" />
</a>
Thanks all the [contributors](https://github.com/PyThaiNLP/pythainlp/graphs/contributors). (Image made with [contributors-img](https://contributors-img.firebaseapp.com))