What's new?
🎄 7 new architectures!
This release adds support for many new multimodal architectures, bringing the total number of supported architectures to [80](https://huggingface.co/docs/transformers.js/index#models)! 🤯
1. [VITS](https://huggingface.co/docs/transformers/main/en/model_doc/vits) for multilingual text-to-speech across over 1000 languages! (https://github.com/xenova/transformers.js/pull/466)
js
import { pipeline } from 'xenova/transformers';
// Create English text-to-speech pipeline
const synthesizer = await pipeline('text-to-speech', 'Xenova/mms-tts-eng');
// Generate speech
const output = await synthesizer('I love transformers');
// {
// audio: Float32Array(26112) [...],
// sampling_rate: 16000
// }
https://github.com/xenova/transformers.js/assets/26504141/63c1a315-1ad6-44a2-9a2f-6689e2d9d14e
See [here](https://huggingface.co/models?library=transformers.js&other=vits&sort=trending) for the list of available models. To start, we've converted 12 of the [~1140](https://huggingface.co/models?other=mms,vits&sort=trending&search=facebook) models on the Hugging Face Hub. If we haven't added the one you wish to use, you can make it _web-ready_ using our [conversion script](https://huggingface.co/docs/transformers.js/custom_usage#convert-your-models-to-onnx).
2. [CLIPSeg](https://huggingface.co/docs/transformers/main/en/model_doc/clipseg) for zero-shot image segmentation. (https://github.com/xenova/transformers.js/pull/478)
js
import { AutoTokenizer, AutoProcessor, CLIPSegForImageSegmentation, RawImage } from 'xenova/transformers';
// Load tokenizer, processor, and model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clipseg-rd64-refined');
const processor = await AutoProcessor.from_pretrained('Xenova/clipseg-rd64-refined');
const model = await CLIPSegForImageSegmentation.from_pretrained('Xenova/clipseg-rd64-refined');
// Run tokenization
const texts = ['a glass', 'something to fill', 'wood', 'a jar'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });
// Read image and run processor
const image = await RawImage.read('https://github.com/timojl/clipseg/blob/master/example_image.jpg?raw=true');
const image_inputs = await processor(image);
// Run model with both text and pixel inputs
const { logits } = await model({ ...text_inputs, ...image_inputs });
// logits: Tensor {
// dims: [4, 352, 352],
// type: 'float32',
// data: Float32Array(495616)[ ... ],
// size: 495616
// }
You can visualize the predictions as follows:
js
const preds = logits
.unsqueeze_(1)
.sigmoid_()
.mul_(255)
.round_()
.to('uint8');
for (let i = 0; i < preds.dims[0]; ++i) {
const img = RawImage.fromTensor(preds[i]);
img.save(`prediction_${i}.png`);
}
| Original | `"a glass"` | `"something to fill"` | `"wood"` | `"a jar"` |
|--------|--------|--------|--------|--------|
| ![image](https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/B4wAIseP3SokRd7Flu1Y9.png) | ![prediction_0](https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/s3WBtlA9CyZmm9F5lrOG3.png) | ![prediction_1](https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/v4_3JqhAZSfOg60v5x1C2.png) | ![prediction_2](https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/MjZLENI9RMaMCGyk6G6V1.png) | ![prediction_3](https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/dIHO76NAPTMt9-677yNkg.png) |
See [here](https://huggingface.co/models?library=transformers.js&other=clipseg&sort=trending) for the list of available models.
3. [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer) for semantic segmentation and image classification. (https://github.com/xenova/transformers.js/pull/480)
js
import { pipeline } from 'xenova/transformers';
// Create an image segmentation pipeline
const segmenter = await pipeline('image-segmentation', 'Xenova/segformer_b2_clothes');
// Segment an image
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/young-man-standing-and-leaning-on-car.jpg';
const output = await segmenter(url);
![image](https://github.com/xenova/transformers.js/assets/26504141/30c9a07f-d6c2-4107-b393-a4ba100c94d3)
<details>
<summary>See output</summary>
js
[
{
score: null,
label: 'Background',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
},
{
score: null,
label: 'Hair',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
},
{
score: null,
label: 'Upper-clothes',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
},
{
score: null,
label: 'Pants',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
},
{
score: null,
label: 'Left-shoe',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
},
{
score: null,
label: 'Right-shoe',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
},
{
score: null,
label: 'Face',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
},
{
score: null,
label: 'Left-leg',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
},
{
score: null,
label: 'Right-leg',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
},
{
score: null,
label: 'Left-arm',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
},
{
score: null,
label: 'Right-arm',
mask: RawImage {
data: [Uint8ClampedArray],
width: 970,
height: 1455,
channels: 1
}
}
]
</details>
See [here](https://huggingface.co/models?library=transformers.js&other=segformer&sort=trending) for the list of available models.
4. [Table Transformer](https://huggingface.co/docs/transformers/main/en/model_doc/table-transformer) for table extraction from unstructured documents. (https://github.com/xenova/transformers.js/pull/477)
js
import { pipeline } from 'xenova/transformers';
// Create an object detection pipeline
const detector = await pipeline('object-detection', 'Xenova/table-transformer-detection', { quantized: false });
// Detect tables in an image
const img = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/invoice-with-table.png';
const output = await detector(img);
// [{ score: 0.9967531561851501, label: 'table', box: { xmin: 52, ymin: 322, xmax: 546, ymax: 525 } }]
<details>
<summary>Show example output</summary>
![image](https://github.com/xenova/transformers.js/assets/26504141/6ca5eea0-928c-4c13-9ccf-16ed62108054)
</details>
See [here](https://huggingface.co/models?library=transformers.js&other=table-transformer&sort=trending) for the list of available models.
5. [DiT](https://huggingface.co/docs/transformers/main/en/model_doc/dit) for document image classification. (https://github.com/xenova/transformers.js/pull/474)
js
import { pipeline } from 'xenova/transformers';
// Create an image classification pipeline
const classifier = await pipeline('image-classification', 'Xenova/dit-base-finetuned-rvlcdip');
// Classify an image
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/coca_cola_advertisement.png';
const output = await classifier(url);