Thaixtransformers

Latest version: v0.1.0

Safety actively analyzes 626873 Python packages for vulnerabilities to keep your Python projects secure.

0.05261131376028061

'token': 6052,
'token_str': 'อินเทอร์เน็ต',
'sequence': 'ผมชอบอินเทอร์เน็ตมากๆ'},

0.03980604186654091

'token': 11893,
'token_str': 'อ่านหนังสือ',
'sequence': 'ผมชอบอ่านหนังสือมากๆ'},
...]

Preprocess

If you want to preprocessing data before training model, you can use preprocess.

> from thaixtransformers.preprocess import process_transformers

> process_transformers(str) -> str

**Example**

python
from thaixtransformers.preprocess import process_transformers

print(process_transformers("สวัสดี :D"))
output: 'สวัสดี<_>:d'

BibTeX entry and citation info

misc{lowphansirikul2021wangchanberta,
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
year={2021},
eprint={2101.09635},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

0.1.0

**Full Changelog**: https://github.com/PyThaiNLP/thaixtransformers/commits/v0.1.0

ThaiXtransformers

<a target="_blank" href="https://colab.research.google.com/github/PyThaiNLP/thaixtransformers/blob/main/notebooks/wangchanberta_getting_started_aireseach.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.**

Fork from [vistec-AI/thai2transformers](https://github.com/vistec-AI/thai2transformers).

This project build the tokenizer and preprocessing data for RoBERTa models from VISTEC-depa AI Research Institute of Thailand.

Paper: [WangchanBERTa: Pretraining transformer-based Thai Language Models](https://arxiv.org/abs/2101.09635)

Install

> pip install thaixtransformers

Usage

Tokenizer

> from thaixtransformers import Tokenizer

If you use models, you should load model by model name.

> Tokenizer(model_name) -> Tokeinzer

**Example**

python
from thaixtransformers import Tokenizer
from transformers import pipeline
from transformers import AutoModelForMaskedLM

tokenizer = Tokenizer("airesearch/wangchanberta-base-wiki-newmm")
model = AutoModelForMaskedLM.from_pretrained("airesearch/wangchanberta-base-wiki-newmm")

classifier = pipeline("fill-mask",model=model,tokenizer=tokenizer)
print(classifier("ผมชอบ<mask>มาก ๆ"))
output:

Releases

Has known vulnerabilities