<h3 id="llava_onevision">LLaVA-OneVision for Image-Text-to-Text</h3>
LLaVA-OneVision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone.
**Example:** Multi-round conversations w/ PKV caching
js
import { AutoProcessor, AutoTokenizer, LlavaOnevisionForConditionalGeneration, RawImage } from 'huggingface/transformers';
// Load tokenizer, processor and model
const model_id = 'llava-hf/llava-onevision-qwen2-0.5b-ov-hf';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaOnevisionForConditionalGeneration.from_pretrained(model_id, {
dtype: {
embed_tokens: 'fp16', // or 'fp32' or 'q8'
vision_encoder: 'fp16', // or 'fp32' or 'q8'
decoder_model_merged: 'q4', // or 'q8'
},
// device: 'webgpu',
});
// Prepare text inputs
const prompt = 'What does the text say?';
const messages = [
{ role: 'system', content: 'Answer the question.' },
{ role: 'user', content: `<image>\n${prompt}` }
]
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true });
const text_inputs = tokenizer(text);
// Prepare vision inputs
const url = 'https://huggingface.co/qnguyen3/nanoLLaVA/resolve/main/example_1.png';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);
// Generate response
const { past_key_values, sequences } = await model.generate({
...text_inputs,
...vision_inputs,
do_sample: false,
max_new_tokens: 64,
return_dict_in_generate: true,
});
// Decode output
const answer = tokenizer.decode(
sequences.slice(0, [text_inputs.input_ids.dims[1], null]),
{ skip_special_tokens: true },
);
console.log(answer);
// The text says "small but mighty" in a playful font.
const new_messages = [
...messages,
{ role: 'assistant', content: answer },
{ role: 'user', content: 'How does the text correlate to the context of the image?' }
]
const new_text = tokenizer.apply_chat_template(new_messages, { tokenize: false, add_generation_prompt: true });
const new_text_inputs = tokenizer(new_text);
// Generate another response
const output = await model.generate({
...new_text_inputs,
past_key_values,
do_sample: false,
max_new_tokens: 256,
});
const new_answer = tokenizer.decode(
output.slice(0, [new_text_inputs.input_ids.dims[1], null]),
{ skip_special_tokens: true },
);
console.log(new_answer);
// The text "small but mighty" is likely a playful or humorous reference to the image of the blue mouse with the orange dumbbell. It could be used as a motivational phrase or a playful way to express the idea that even small things can be impressive or powerful.
<h3 id="vitpose">ViTPose for pose-estimation</h3>
A state-of-the-art pose estimation model which employs a standard, non-hierarchical vision transformer as a backbone for the task of keypoint estimation (combined with a simple decoder head to predict heatmaps from a given image).
**Example:** Pose estimation w/ `onnx-community/vitpose-base-simple`.
js
import { AutoModel, AutoImageProcessor, RawImage } from 'huggingface/transformers';
// Load model and processor
const model_id = 'onnx-community/vitpose-base-simple';
const model = await AutoModel.from_pretrained(model_id);
const processor = await AutoImageProcessor.from_pretrained(model_id);
// Load image and prepare inputs
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/ryan-gosling.jpg';
const image = await RawImage.read(url);
const inputs = await processor(image);
// Predict heatmaps
const { heatmaps } = await model(inputs);
// Post-process heatmaps to get keypoints and scores
const boxes = [[[0, 0, image.width, image.height]]];
const results = processor.post_process_pose_estimation(heatmaps, boxes)[0][0];
console.log(results);
<details>
<summary>Optionally, visualize the outputs (Node.js usage shown here, using the node-canvas library):</summary>
js
import { createCanvas, createImageData } from 'canvas';
// Create canvas and draw image
const canvas = createCanvas(image.width, image.height);
const ctx = canvas.getContext('2d');
const imageData = createImageData(image.rgba().data, image.width, image.height);
ctx.putImageData(imageData, 0, 0);
// Draw edges between keypoints
const points = results.keypoints;
ctx.lineWidth = 4;
ctx.strokeStyle = 'blue';
for (const [i, j] of model.config.edges) {
const [x1, y1] = points[i];
const [x2, y2] = points[j];
ctx.beginPath();
ctx.moveTo(x1, y1);
ctx.lineTo(x2, y2);
ctx.stroke();
}
// Draw circle at each keypoint
ctx.fillStyle = 'red';
for (const [x, y] of points) {
ctx.beginPath();
ctx.arc(x, y, 8, 0, 2 * Math.PI);
ctx.fill();
}
// Save image to file
import fs from 'fs';
const out = fs.createWriteStream('pose.png');
const stream = canvas.createPNGStream();
stream.pipe(out)
out.on('finish', () => console.log('The PNG file was created.'));
</details>
| Input image | Output image |
| :----------:|:------------:|
| ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/QpXlLNyLDKZUxXjokbUyy.jpeg) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/xj0jaKo9aAOux-NSU8U7S.png) |
<h3 id="mgp-str">MGP-STR for Optical Character Recognition (OCR)</h3>
A simple yet powerful vision scene text recognition model, built upon the vision transformer (ViT).
**Example:** Optical Character Recognition (OCR) w/ `onnx-community/mgp-str-base`
js
import { MgpstrForSceneTextRecognition, MgpstrProcessor, RawImage } from 'huggingface/transformers';
const model_id = 'onnx-community/mgp-str-base';
const model = await MgpstrForSceneTextRecognition.from_pretrained(model_id);
const processor = await MgpstrProcessor.from_pretrained(model_id);
// Load image from the IIIT-5k dataset
const url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png";
const image = await RawImage.read(url);
// Preprocess the image
const result = await processor(image);
// Perform inference
const outputs = await model(result);
// Decode the model outputs
const generated_text = processor.batch_decode(outputs.logits).generated_text;
console.log(generated_text); // [ 'ticket' ]
<h3 id="patchtst-and-patchtsmixer">PatchTST and PatchTSMixer for time series forecasting.</h3>
**Example:** Time series forecasting w/ `onnx-community/granite-timeseries-patchtst`
Models which can be used for multivariate time series forecasting.
js
import { PatchTSTForPrediction, Tensor } from "huggingface/transformers";
const model_id = "onnx-community/granite-timeseries-patchtst";
const model = await PatchTSTForPrediction.from_pretrained(model_id, { dtype: "fp32" });
const dims = [64, 512, 7];
const prod = dims.reduce((a, b) => a * b, 1);
const past_values = new Tensor('float32',
Float32Array.from({ length: prod }, (_, i) => i / prod),
dims,
);
const { prediction_outputs } = await model({ past_values });
console.log(prediction_outputs);
**Example:** Time series forecasting w/ `onnx-community/granite-timeseries-patchtsmixer`
js
import { PatchTSMixerForPrediction, Tensor } from "huggingface/transformers";
const model_id = "onnx-community/granite-timeseries-patchtsmixer";
const model = await PatchTSMixerForPrediction.from_pretrained(model_id, { dtype: "fp32" });
const dims = [64, 512, 7];
const prod = dims.reduce((a, b) => a * b, 1);
const past_values = new Tensor('float32',
Float32Array.from({ length: prod }, (_, i) => i / prod),
dims,
);
const { prediction_outputs } = await model({ past_values });
console.log(prediction_outputs);
<h2 id="bug-fixes">🐛 Bug fixes</h2>
* When padding an image, the dimensions get stretched by BritishWerewolf in https://github.com/huggingface/transformers.js/pull/1015
* fix(scale): add missing scale element by tosinamuda in https://github.com/huggingface/transformers.js/pull/1017
<h2 id="documentation-improvements">📝 Documentation improvements</h2>
* Updated link to sentence similarity models. by uzyn in https://github.com/huggingface/transformers.js/pull/893
* fix(docs): fixed a broken link to quantization guide by ThomasWT in https://github.com/huggingface/transformers.js/pull/1014
* fix(docs): Fixed Typos in README and docs/snippets/6_supported-models.snippet by hitchhiker3010 in https://github.com/huggingface/transformers.js/pull/1030
<h2 id="other-improvements">🛠️ Other improvements</h2>
* Add option to maintain aspect ratio on resize by BritishWerewolf in https://github.com/huggingface/transformers.js/pull/971
* Add functionality to split RawImage into channels; Update slice documentation and tests by BritishWerewolf in https://github.com/huggingface/transformers.js/pull/978
* Avoid resizing images when they already have the desired size by nemphys in https://github.com/huggingface/transformers.js/pull/1027
* Add support for Split pretokenizer w/ `behavior=removed` & `invert=false` by xenova in https://github.com/huggingface/transformers.js/pull/1033
* Add type declaration for `progress_callback` by ocavue in https://github.com/huggingface/transformers.js/pull/1034
* Add support for op_block_list by pdufour in https://github.com/huggingface/transformers.js/pull/1036
<h2 id="new-contributors">🤗 New contributors</h2>
* uzyn made their first contribution in https://github.com/huggingface/transformers.js/pull/893
* ThomasWT made their first contribution in https://github.com/huggingface/transformers.js/pull/1014
* tosinamuda made their first contribution in https://github.com/huggingface/transformers.js/pull/1017
* nemphys made their first contribution in https://github.com/huggingface/transformers.js/pull/1027
* hitchhiker3010 made their first contribution in https://github.com/huggingface/transformers.js/pull/1030
* pdufour made their first contribution in https://github.com/huggingface/transformers.js/pull/1036
**Full Changelog**: https://github.com/huggingface/transformers.js/compare/3.0.2...3.1.0