Hi everyone, I'm happy to share the first minor release of calamanCy!
This release adds our first `tl_calamancy` models with varying sizes to suit any performance or accuracy requirements. The table below shows more information about these pipelines.
Models
The models are also [hosted on Huggingface](https://huggingface.co/ljvmiranda921), but you can also use the `calamancy` library to download and access them.
| Model | Pipelines | Description |
|-----------------------------|---------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| [tl_calamancy_md](https://huggingface.co/ljvmiranda921/tl_calamancy_md) (73.7 MB) | tok2vec, tagger, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys) |
| [tl_calamancy_lg](https://huggingface.co/ljvmiranda921/tl_calamancy_md) (431.9 MB) | tok2vec, tagger, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k keys) |
| [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) (775.6 MB) | transformer, tagger, parser, ner | GPU-optimized transformer Tagalog NLP model. Uses roberta-tagalog-base as context vectors. |
Data sources
The table below shows the data sources used to train the pipelines. Note that the Ugnayan treebank is not licensed for commercial use while TLUnified is under GNU GPL. Please consider these licenses when using the calamanCy pipelines in your application. I'd definitely want to gain access to commercial-friendly datasets (or develop my own). If you have any leads or just wanna help out, feel free to contact me by e-mail ([ljvmiranda at gmail dot com](mailto:ljvmirandagmail.com))!
| Source | Authors | License |
|----------------------------------------------------------------------------------------|--------------------------------------------------|-----------------|
| [TLUnified Dataset](https://aclanthology.org/2022.lrec-1.703/) | Jan Christian Blaise Cruz and Charibeth Cheng | GNU GPL 3.0 |