Vibrato

Latest version: v0.2.0

Safety actively analyzes 626118 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.5.0

Main changes

- Add a Wasm demo https://github.com/daac-tools/vibrato/pull/115
- Handle locale on the Wasm demo https://github.com/daac-tools/vibrato/pull/119
- Add bi-gram feature info generator for MeCab models https://github.com/daac-tools/vibrato/pull/121
- Embed a magic number into a model https://github.com/daac-tools/vibrato/pull/129

Precompiled model files

We provide precompiled models for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release. The licenses are contained in each file.

All models were compiled and modified in the manners described in [compile.md](https://github.com/daac-tools/vibrato/blob/main/docs/compile.md) and [map.md](https://github.com/daac-tools/vibrato/blob/main/docs/map.md). We trained the mappings of connection ids using CORE data in [BCCWJ v1.1](https://clrd.ninjal.ac.jp/bccwj/) (except the PN category).

Note that all the models are compressed in [zstd](https://github.com/facebook/zstd) format. You can directly input them to Vibrato CLIs, but if using `vibrato` APIs, you need to extract them outside the APIs (see [README](https://github.com/daac-tools/vibrato/blob/main/README.md)).

Models trained using Vibrato

The three variants are trained using [BCCWJ v1.1](https://clrd.ninjal.ac.jp/bccwj/) (except the PN category) and [UniDic v3.1.1](https://clrd.ninjal.ac.jp/unidic/).

- `bccwj-suw+unidic-cwj-3_1_1`: A standard version.
- `bccwj-suw+unidic-cwj-3_1_1+compact`: A smaller (but slower) version that compresses the connection matrix in the manner described in [small-dic.md](https://github.com/daac-tools/vibrato/blob/main/docs/small-dic.md).
- `bccwj-suw+unidic-cwj-3_1_1+compact-dual`: An intermediate version of the above two by the `dual-connector` technique.
- `bccwj-suw+unidic-cwj-3_1_1-extracted+compact`: A further smaller version that contains only POS and pronunciation features.
- `bccwj-suw+unidic-cwj-3_1_1-extracted+compact-dual`: The `dual-connector` version.

These models were trained with L1-regularization.

Models converted from publicly-available resources

- `ipadic-mecab-2_7_0` from [IPADIC v2.7.0](https://taku910.github.io/mecab/)
- `jumandic-mecab-7_0` from [mecab-jumandic-utf8 v7.0](https://packages.ubuntu.com/ja/bionic/mecab-jumandic-utf8)
- `naist-jdic-mecab-0_6_3b` from [NAIST Japanese Dictionary v0.6.3b](https://ja.osdn.net/projects/naist-jdic/)
- `unidic-mecab-2_1_2` from [UniDic v2.1.2](https://clrd.ninjal.ac.jp/unidic/)
- `unidic-cwj-3_1_1` from [UniDic v3.1.1](https://clrd.ninjal.ac.jp/unidic/)
- `unidic-cwj-3_1_1+compact` from [UniDic v3.1.1](https://clrd.ninjal.ac.jp/unidic/), whose connection matrix is compressed in a manner of [mecab_smalldic](https://github.com/daac-tools/vibrato/tree/main/examples/mecab_smalldic).
- `unidic-cwj-3_1_1+compact-dual` from [UniDic v3.1.1](https://clrd.ninjal.ac.jp/unidic/), which is the `dual-connector` version.

Statistics for compressed UniDic models

The following table shows UniDic model sizes in the two versions: without and with `+compact` or `+compact-dual` (not in zstd format).

| Models | Standard | Compact | Compact-dual |
|---|---:|---:|---:|
| bccwj-suw+unidic-cwj-3_1_1 | 618 MB | 248 MB | 275 MB |
| unidic-cwj-3_1_1 | 717 MB | 252 MB | 300 MB |

0.4.0

Main changes

- Handle zstd-compressed dictionaries in all CLIs https://github.com/daac-tools/vibrato/pull/112

Precompiled dictionary files

We provide precompiled dictionaries for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release.

The following variants are distributed:

- `ipadic-mecab-2_7_0/system.dic.zst` from [IPADIC v2.7.0](https://taku910.github.io/mecab/)
- `ipadic-mecab-2_7_0-small/system.dic.zst` from [IPADIC v2.7.0](https://taku910.github.io/mecab/)
- A smaller version that contains only the features `品詞-品詞細分類1` and `発音`.
- `jumandic-mecab-7_0/system.dic.zst` from [mecab-jumandic-utf8 v7.0](https://packages.ubuntu.com/ja/bionic/mecab-jumandic-utf8)
- `naist-jdic-mecab-0_6_3b/system.dic.zst` from [NAIST Japanese Dictionary v0.6.3b](https://ja.osdn.net/projects/naist-jdic/)
- `unidic-mecab-2_1_2/system.dic.zst` from [UniDic v2.1.2](https://clrd.ninjal.ac.jp/unidic/)
- `unidic-cwj-3_1_1/system.dic.zst` from [UniDic v3.1.1](https://clrd.ninjal.ac.jp/unidic/)

These system dictionaries were compiled and modified in the manners described in [compile.md](https://github.com/daac-tools/vibrato/blob/main/docs/compile.md) and [map.md](https://github.com/daac-tools/vibrato/blob/main/docs/map.md). We trained the mappings of connection ids using license-expired data obtained from [Aozora Bunko](https://www.aozora.gr.jp/), following the [guideline](https://www.aozora.gr.jp/guide/kijyunn.html).

The licenses are contained in each file.

0.3.3

Main changes

- Publish members of WordIdx https://github.com/daac-tools/vibrato/pull/104
- Add const variable VERSION https://github.com/daac-tools/vibrato/pull/105

Precompiled dictionary files

You can use those distributed in [Release v0.3.1](https://github.com/daac-tools/vibrato/releases/tag/v0.3.1).

0.3.2

Main changes

- Add train feature flag https://github.com/daac-tools/vibrato/pull/93
- Publish WordIdx and Dictionary::word_feature() https://github.com/daac-tools/vibrato/pull/101
- Separate lifetime parameter in Worker and Tokenizer https://github.com/daac-tools/vibrato/pull/102

Precompiled dictionary files

You can use those distributed in [Release v0.3.1](https://github.com/daac-tools/vibrato/releases/tag/v0.3.1).

0.3.1

Main changes

- Remove preparation scripts and distribute precompiled binaries https://github.com/daac-tools/vibrato/pull/87
- Add DualConnector, a faster and smaller dictionary format https://github.com/daac-tools/vibrato/pull/86

Precompiled dictionary files

We provide precompiled dictionaries for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release.

The following three variants are distributed:

- `ipadic-mecab-2_7_0/system.dic` from [IPADIC v2.7.0](https://taku910.github.io/mecab/)
- `jumandic-mecab-7_0/system.dic` from [mecab-jumandic-utf8 v7.0](https://packages.ubuntu.com/ja/bionic/mecab-jumandic-utf8)
- `naist-jdic-mecab-0_6_3b/system.dic` from [NAIST Japanese Dictionary v0.6.3b](https://ja.osdn.net/projects/naist-jdic/)
- `unidic-mecab-2_1_2/system.dic` from [UniDic v2.1.2](https://clrd.ninjal.ac.jp/unidic/)
- `unidic-cwj-3_1_1/system.dic` from [UniDic v3.1.1](https://clrd.ninjal.ac.jp/unidic/)

These system dictionaries were compiled and modified in the manners described in [compile.md](https://github.com/daac-tools/vibrato/blob/main/docs/compile.md) and [map.md](https://github.com/daac-tools/vibrato/blob/main/docs/map.md). We trained the mappings of connection ids using license-expired data obtained from [Aozora Bunko](https://www.aozora.gr.jp/), following the [guideline](https://www.aozora.gr.jp/guide/kijyunn.html).

The licenses are contained in each file.

0.3.0

Main changes

- Reorganize builder modules https://github.com/daac-tools/vibrato/pull/74, https://github.com/daac-tools/vibrato/pull/77
- Reorganize workspaces https://github.com/daac-tools/vibrato/pull/80 and their docs https://github.com/daac-tools/vibrato/pull/71
- Add accuracy evaluator https://github.com/daac-tools/vibrato/pull/57
- Add smaller dictionary option https://github.com/daac-tools/vibrato/pull/63
- Support longer input sentences https://github.com/daac-tools/vibrato/pull/72
- Speed up the `tokenize` command when stdout is not TTY https://github.com/daac-tools/vibrato/pull/59

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.