Main changes
- Add a Wasm demo https://github.com/daac-tools/vibrato/pull/115
- Handle locale on the Wasm demo https://github.com/daac-tools/vibrato/pull/119
- Add bi-gram feature info generator for MeCab models https://github.com/daac-tools/vibrato/pull/121
- Embed a magic number into a model https://github.com/daac-tools/vibrato/pull/129
Precompiled model files
We provide precompiled models for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release. The licenses are contained in each file.
All models were compiled and modified in the manners described in [compile.md](https://github.com/daac-tools/vibrato/blob/main/docs/compile.md) and [map.md](https://github.com/daac-tools/vibrato/blob/main/docs/map.md). We trained the mappings of connection ids using CORE data in [BCCWJ v1.1](https://clrd.ninjal.ac.jp/bccwj/) (except the PN category).
Note that all the models are compressed in [zstd](https://github.com/facebook/zstd) format. You can directly input them to Vibrato CLIs, but if using `vibrato` APIs, you need to extract them outside the APIs (see [README](https://github.com/daac-tools/vibrato/blob/main/README.md)).
Models trained using Vibrato
The three variants are trained using [BCCWJ v1.1](https://clrd.ninjal.ac.jp/bccwj/) (except the PN category) and [UniDic v3.1.1](https://clrd.ninjal.ac.jp/unidic/).
- `bccwj-suw+unidic-cwj-3_1_1`: A standard version.
- `bccwj-suw+unidic-cwj-3_1_1+compact`: A smaller (but slower) version that compresses the connection matrix in the manner described in [small-dic.md](https://github.com/daac-tools/vibrato/blob/main/docs/small-dic.md).
- `bccwj-suw+unidic-cwj-3_1_1+compact-dual`: An intermediate version of the above two by the `dual-connector` technique.
- `bccwj-suw+unidic-cwj-3_1_1-extracted+compact`: A further smaller version that contains only POS and pronunciation features.
- `bccwj-suw+unidic-cwj-3_1_1-extracted+compact-dual`: The `dual-connector` version.
These models were trained with L1-regularization.
Models converted from publicly-available resources
- `ipadic-mecab-2_7_0` from [IPADIC v2.7.0](https://taku910.github.io/mecab/)
- `jumandic-mecab-7_0` from [mecab-jumandic-utf8 v7.0](https://packages.ubuntu.com/ja/bionic/mecab-jumandic-utf8)
- `naist-jdic-mecab-0_6_3b` from [NAIST Japanese Dictionary v0.6.3b](https://ja.osdn.net/projects/naist-jdic/)
- `unidic-mecab-2_1_2` from [UniDic v2.1.2](https://clrd.ninjal.ac.jp/unidic/)
- `unidic-cwj-3_1_1` from [UniDic v3.1.1](https://clrd.ninjal.ac.jp/unidic/)
- `unidic-cwj-3_1_1+compact` from [UniDic v3.1.1](https://clrd.ninjal.ac.jp/unidic/), whose connection matrix is compressed in a manner of [mecab_smalldic](https://github.com/daac-tools/vibrato/tree/main/examples/mecab_smalldic).
- `unidic-cwj-3_1_1+compact-dual` from [UniDic v3.1.1](https://clrd.ninjal.ac.jp/unidic/), which is the `dual-connector` version.
Statistics for compressed UniDic models
The following table shows UniDic model sizes in the two versions: without and with `+compact` or `+compact-dual` (not in zstd format).
| Models | Standard | Compact | Compact-dual |
|---|---:|---:|---:|
| bccwj-suw+unidic-cwj-3_1_1 | 618 MB | 248 MB | 275 MB |
| unidic-cwj-3_1_1 | 717 MB | 252 MB | 300 MB |