🔥 Transformers.js v3.3 — StyleTTS 2 (Kokoro) for state-of-the-art text-to-speech, Grounding DINO for zero-shot object detection
- [🤖 New models: StyleTTS 2, Grounding Dino](new-models)
- [**StyleTTS 2**: High-quality speech synthesis](style_text_to_speech_2)
- [**Grounding DINO**: Zero-shot object detection](grounding-dino)
- [🛠️ Other improvements](other-improvements)
- [🤗 New contributors](new-contributors)
<h2 id="new-models">🤖 New models: StyleTTS 2, Grounding DINO</h2>
<h3 id="style_text_to_speech_2">StyleTTS 2 for high-quality speech synthesis</h3>
See https://github.com/huggingface/transformers.js/pull/1148 for more information and [here](https://huggingface.co/models?other=style_text_to_speech_2&library=transformers.js) for the list of supported models.
First, install the `kokoro-js` library, which uses Transformers.js, from [NPM](https://npmjs.com/package/kokoro-js) using:
bash
npm i kokoro-js
You can then generate speech as follows:
js
import { KokoroTTS } from "kokoro-js";
const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});
const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
// Use `tts.list_voices()` to list all available voices
voice: "af_bella",
});
audio.save("audio.wav");
<h3 id="grounding-dino">Grounding DINO for zero-shot object detection</h3>
See https://github.com/huggingface/transformers.js/pull/1137 for more information and [here](https://huggingface.co/models?other=grounding-dino&library=transformers.js) for the list of supported models.
**Example:** Zero-shot object detection with `onnx-community/grounding-dino-tiny-ONNX` using the `pipeline` API.
js
import { pipeline } from "huggingface/transformers";
const detector = await pipeline("zero-shot-object-detection", "onnx-community/grounding-dino-tiny-ONNX");
const url = "http://images.cocodataset.org/val2017/000000039769.jpg";
const candidate_labels = ["a cat."];
const output = await detector(url, candidate_labels, {
threshold: 0.3,
});
<details>
<summary>See example output</summary>
[
{ score: 0.45316222310066223, label: "a cat", box: { xmin: 343, ymin: 23, xmax: 637, ymax: 372 } },
{ score: 0.36190420389175415, label: "a cat", box: { xmin: 12, ymin: 52, xmax: 317, ymax: 472 } },
]
</details>
<h2 id="other-improvements">🛠️ Other improvements</h2>
* Add the RawAudio class by Th3G33k in https://github.com/huggingface/transformers.js/pull/682
* Update React guide for v3 by sroussey in https://github.com/huggingface/transformers.js/pull/1128
* Add option to skip special tokens in TextStreamer by sroussey in https://github.com/huggingface/transformers.js/pull/1139
<h2 id="new-contributors">🤗 New contributors</h2>
* sroussey made their first contribution in https://github.com/huggingface/transformers.js/pull/1128
**Full Changelog**: https://github.com/huggingface/transformers.js/compare/3.2.4...3.3.0