Transformers-js-py

Latest version: v0.19.10

Safety actively analyzes 724020 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 16

3.2.1

What's new?
* Add support for ModernBert in https://github.com/huggingface/transformers.js/pull/1104. Check out the [blog post](https://huggingface.co/blog/modernbert) for more information!

**Example:**

js
import { pipeline } from 'huggingface/transformers';

const pipe = await pipeline('fill-mask', 'answerdotai/ModernBERT-base');
const answer = await pipe('The capital of France is [MASK].');
console.log(answer);


![image](https://github.com/user-attachments/assets/2d360994-9e5c-4734-a1c0-af4a037671dc)


**Full Changelog**: https://github.com/huggingface/transformers.js/compare/3.2.0...3.2.1

3.2.0

🔥 Transformers.js v3.2 — Moonshine for real-time speech recognition, Phi-3.5 Vision for multi-frame image understanding and reasoning, and more!

Table of contents:
- [🤖 New models: Moonshine, Phi-3.5 Vision, EXAONE](new-models)
- [**Moonshine**: Real-time speech recognition](moonshine)
- [**Phi-3.5 Vision**: Multi-frame image understanding and reasoning](phi3_v)
- [**EXAONE**: Bilingual (English and Korean) text generation](exaone)
- [🐛 Bug fixes](bug-fixes)
- [🛠️ Other improvements](other-improvements)


<h2 id="new-models">🤖 New models: Moonshine, Phi-3.5 Vision, EXAONE</h2>

<h3 id="moonshine">Moonshine for real-time speech recognition</h3>

Moonshine is a family of speech-to-text models optimized for fast and accurate automatic speech recognition (ASR) on resource-constrained devices. They are well-suited to real-time, on-device applications like live transcription and voice command recognition, and are perfect for in-browser usage (check out the online [demo](https://huggingface.co/spaces/webml-community/moonshine-web)). See https://github.com/huggingface/transformers.js/pull/1099 for more information and [here](https://huggingface.co/models?library=transformers.js&other=moonshine) for the list of supported models.

**Example:** Automatic speech recognition w/ Moonshine tiny.
js
import { pipeline } from "huggingface/transformers";

const transcriber = await pipeline("automatic-speech-recognition", "onnx-community/moonshine-tiny-ONNX");
const output = await transcriber("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav");
console.log(output);
// { text: 'And so my fellow Americans ask not what your country can do for you as what you can do for your country.' }


<details>

<summary>See example using the MoonshineForConditionalGeneration API</summary>

js
import { MoonshineForConditionalGeneration, AutoProcessor, read_audio } from "huggingface/transformers";

// Load model and processor
const model_id = "onnx-community/moonshine-tiny-ONNX";
const model = await MoonshineForConditionalGeneration.from_pretrained(model_id, {
dtype: "q4",
});
const processor = await AutoProcessor.from_pretrained(model_id);

// Load audio and prepare inputs
const audio = await read_audio("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav", 16000);
const inputs = await processor(audio);

// Generate outputs
const outputs = await model.generate({ ...inputs, max_new_tokens: 100 });

// Decode outputs
const decoded = processor.batch_decode(outputs, { skip_special_tokens: true });
console.log(decoded[0]);
// And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.


</details>

<h3 id="phi3_v">Phi-3.5 Vision for multi-frame image understanding and reasoning</h3>

Phi-3.5 Vision is a lightweight, state-of-the-art, open multimodal model that can be used for multi-frame image understanding and reasoning. See https://github.com/huggingface/transformers.js/pull/1094 for more information and [here](https://huggingface.co/models?library=transformers.js&other=phi3_v) for the list of supported models.

**Examples:**

<table>
<tr>
<th style="width: 40%;">Input</th>
<th style="width: 60%;">Output</th>
</tr>
<tr>
<td>
"What's funny about this image?"
<img src="https://github.com/user-attachments/assets/631df9f7-6076-47f6-ba8a-8fd34eb34439" width="400" />
</td>
<td>The humor in this image stems from the exaggerated depiction of human evolution, using the Shiba Inu dog breed to represent both ancient and modern humans. The left side shows a muscular, hunter-like figure labeled as 'Humans 100,000 years ago' with the caption 'me hungry me hunt mammoth,' suggesting a time when humans were physically robust and actively hunting. The right side contrasts this with a modern, slim Shiba Inu labeled as 'Humans today' with the caption 'why food delivery slow,' humorously commenting on the modern human's reliance on convenience and technology, such as food delivery services, rather than hunting for sustenance. The use of a dog, which is often associated with loyalty and companionship, adds a layer of irony and humor as it portrays humans in a more diminished, dependent state.</td>
</tr>
<tr>
<td>
"Summarize the deck of slides."
<img src="https://github.com/user-attachments/assets/78f45879-fa75-4636-9a69-57fc33747a6e" width="400" />
<img src="https://github.com/user-attachments/assets/3616cc53-ac1c-4f7e-8fba-55542354fd9a" width="400" />
<img src="https://github.com/user-attachments/assets/a6897365-c502-4815-9160-1092bda679cf" width="400" />
</td>
<td>

<p>To summarize, the slides are composed of these sections:</p>

<ul>
<li>
<strong>Introduction to Azure:</strong>
<p>The presentation introduces Microsoft Azure, a cloud computing platform. It highlights Azure's three service tiers: Hyper-scale, Enterprise, and Hybrid. The presenter is Dinesh Kumar Wickramasinghe, a Senior Software Engineer from CMS Private Limited in Sri Lanka.</p>
</li>
<li>
<strong>Azure Overview:</strong>
<p>Azure is described as Microsoft's cloud computing platform, continuously expanding to meet current and future business challenges. It offers freedom to build, manage, and deploy applications on a global network using preferred tools and frameworks.</p>
</li>
<li>
<strong>Cloud Computing Services:</strong>
<p>The presentation outlines three types of cloud computing services provided by Azure: Infrastructure-as-a-Service (IaaS) with a 'host' component, Platform-as-a-Service (PaaS) with a 'build' component, and Software-as-a-Service (SaaS) with a 'consume' component.</p>
</li>
</ul>

</td>
</tr>
</table>

<details>

<summary>See example code</summary>

**Example:** Single-frame (critique an image)
js
import {
AutoProcessor,
AutoModelForCausalLM,
TextStreamer,
load_image,
} from "huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Phi-3.5-vision-instruct";
const processor = await AutoProcessor.from_pretrained(model_id, {
legacy: true, // Use legacy to match python version
});
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
dtype: {
vision_encoder: "q4", // 'q4' or 'q4f16'
prepare_inputs_embeds: "q4", // 'q4' or 'q4f16'
model: "q4f16", // 'q4f16'
},
});

// Load image
const image = await load_image("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/meme.png");

// Prepare inputs
const messages = [
{ role: "user", content: "<|image_1|>What's funny about this image?" },
];
const prompt = processor.tokenizer.apply_chat_template(messages, {
tokenize: false,
add_generation_prompt: true,
});
const inputs = await processor(prompt, image, { num_crops: 4 });

// (Optional) Set up text streamer
const streamer = new TextStreamer(processor.tokenizer, {
skip_prompt: true,
skip_special_tokens: true,
});

// Generate response
const output = await model.generate({
...inputs,
streamer,
max_new_tokens: 256,
});


Or, decode the output at the end:
js
// Decode and display the answer
const generated_ids = output.slice(null, [inputs.input_ids.dims[1], null]);
const answer = processor.batch_decode(generated_ids, {
skip_special_tokens: true,
});
console.log(answer[0]);


---

**Example:** Multi-frame (summarize slides)
js
import {
AutoProcessor,
AutoModelForCausalLM,
TextStreamer,
load_image,
} from "huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Phi-3.5-vision-instruct";
const processor = await AutoProcessor.from_pretrained(model_id, {
legacy: true, // Use legacy to match python version
});
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
dtype: {
vision_encoder: "q4", // 'q4' or 'q4f16'
prepare_inputs_embeds: "q4", // 'q4' or 'q4f16'
model: "q4f16", // 'q4f16'
},
});

// Load images
const urls = [
"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-1-2048.jpg",
"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-2-2048.jpg",
"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-3-2048.jpg",
];
const images = await Promise.all(urls.map(load_image));

// Prepare inputs
const placeholder = images.map((_, i) => `<|image_${i + 1}|>\n`).join("");
const messages = [
{ role: "user", content: placeholder + "Summarize the deck of slides." },
];
const prompt = processor.tokenizer.apply_chat_template(messages, {
tokenize: false,
add_generation_prompt: true,
});
const inputs = await processor(prompt, images, { num_crops: 4 });

// (Optional) Set up text streamer
const streamer = new TextStreamer(processor.tokenizer, {
skip_prompt: true,
skip_special_tokens: true,
});

// Generate response
const output = await model.generate({
...inputs,
streamer,
max_new_tokens: 256,
});



</details>


<h3 id="exaone">EXAONE 3.5 for bilingual (English and Korean) text generation</h3>

EXAONE 3.5 is a collection of instruction-tuned bilingual (English and Korean) generative models, developed and released by LG AI Research. See https://github.com/huggingface/transformers.js/pull/1084 for more information and [here](https://huggingface.co/models?library=transformers.js&other=exaone) for the list of supported models.


**Example:** Text-generation w/ `EXAONE-3.5-2.4B-Instruct`:

js
import { pipeline } from "huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
"text-generation",
"onnx-community/EXAONE-3.5-2.4B-Instruct",
{ dtype: "q4f16" },
);

// Define the list of messages
const messages = [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Tell me a joke." },
];

// Generate a response
const output = await generator(messages, { max_new_tokens: 128 });
console.log(output[0].generated_text.at(-1).content);


<details>

<summary>See example output</summary>


Sure! Here's a light joke for you:

Why don't scientists trust atoms?

Because they make up everything!

I hope you found that amusing! If you want another one, feel free to ask!


</details>

<h2 id="bug-fixes">🐛 Bug fixes</h2>

* Fix pyannote processor `post_process_speaker_diarization` in https://github.com/huggingface/transformers.js/pull/1082. Thanks to patrick-ve for reporting the issue!

<h2 id="other-improvements">🛠️ Other improvements</h2>

* Improve unit testing framework in https://github.com/huggingface/transformers.js/pull/1083 and https://github.com/huggingface/transformers.js/pull/1095, bring coverage up to 91% (from 84%).


**Full Changelog**: https://github.com/huggingface/transformers.js/compare/3.1.2...3.2.0

3.1.2

🤖 New models

* Add support for PaliGemma (& PaliGemma2) in https://github.com/huggingface/transformers.js/pull/1074

**Example:** Image captioning with `onnx-community/paligemma2-3b-ft-docci-448`.
js
import { AutoProcessor, PaliGemmaForConditionalGeneration, load_image } from 'huggingface/transformers';

// Load processor and model
const model_id = 'onnx-community/paligemma2-3b-ft-docci-448';
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await PaliGemmaForConditionalGeneration.from_pretrained(model_id, {
dtype: {
embed_tokens: 'fp16', // or 'q8'
vision_encoder: 'fp16', // or 'q4', 'q8'
decoder_model_merged: 'q4', // or 'q4f16'
},
});

// Prepare inputs
const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg'
const raw_image = await load_image(url);
const prompt = '<image>caption en'; // Caption the image in English
const inputs = await processor(raw_image, prompt);

// Generate a response
const output = await model.generate({
...inputs,
max_new_tokens: 100,
})

const generated_ids = output.slice(null, [inputs.input_ids.dims[1], null]);
const answer = processor.batch_decode(
generated_ids,
{ skip_special_tokens: true },
);
console.log(answer[0]);
// A side view of a light blue 1970s Volkswagen Beetle parked on a gray cement road. It is facing to the right. It has a reflection on the side of it. Behind it is a yellow building with a brown double door on the right. It has a white frame around it. Part of a gray cement wall is visible on the far left.



List of supported models: https://huggingface.co/models?library=transformers.js&other=paligemma

* Add support for I-JEPA in https://github.com/huggingface/transformers.js/pull/1073

**Example:** Image feature extraction with `onnx-community/ijepa_vith14_1k`.

js
import { pipeline, cos_sim } from "huggingface/transformers";

// Create an image feature extraction pipeline
const extractor = await pipeline(
"image-feature-extraction",
"onnx-community/ijepa_vith14_1k",
{ dtype: "q8" },
);

// Compute image embeddings
const url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
const url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
const output = await extractor([url_1, url_2]);
const pooled_output = output.mean(1); // Apply mean pooling

// Compute cosine similarity
const similarity = cos_sim(pooled_output[0].data, pooled_output[1].data);
console.log(similarity); // 0.5168613045518973


List of supported models: https://huggingface.co/models?library=transformers.js&other=ijepa

* Add support for OLMo2 in https://github.com/huggingface/transformers.js/pull/1076. List of supported models: https://huggingface.co/models?library=transformers.js&other=olmo2

🐛 Bug fixes
* Fix whisper timestamp extraction for tokenizers with added tokens by aravindMahadevan in https://github.com/huggingface/transformers.js/pull/804
* Add missing 'ready' status in the ProgressInfo type by ocavue in https://github.com/huggingface/transformers.js/pull/1070

🛠️ Other improvements
* Add function to apply mask to RawImage by BritishWerewolf in https://github.com/huggingface/transformers.js/pull/1020
* Bump versions + webpack improvements in https://github.com/huggingface/transformers.js/pull/1075

🤗 New contributors
* aravindMahadevan made their first contribution in https://github.com/huggingface/transformers.js/pull/804

**Full Changelog**: https://github.com/huggingface/transformers.js/compare/3.1.1...3.1.2

3.1.1

🤖 New models
* Add support for Idefics3 (SmolVLM) in https://github.com/huggingface/transformers.js/pull/1059

js
import {
AutoProcessor,
AutoModelForVision2Seq,
load_image,
} from "huggingface/transformers";

// Initialize processor and model
const model_id = "HuggingFaceTB/SmolVLM-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
dtype: {
embed_tokens: "fp16", // "fp32", "fp16", "q8"
vision_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
decoder_model_merged: "q4", // "q8", "q4", "q4f16"
}
});

// Load images
const image1 = await load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg");
const image2 = await load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg");

// Create input messages
const messages = [
{
role: "user",
content: [
{ type: "image" },
{ type: "image" },
{ type: "text", text: "Can you describe the two images?" },
],
},
];

// Prepare inputs
const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
const inputs = await processor(text, [image1, image2], {
// Set `do_image_splitting: true` to split images into multiple patches.
// NOTE: This uses more memory, but can provide more accurate results.
do_image_splitting: false,
});

// Generate outputs
const generated_ids = await model.generate({
...inputs,
max_new_tokens: 500,
});
const generated_texts = processor.batch_decode(
generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
{ skip_special_tokens: true },
);
console.log(generated_texts[0]);
// ' In the first image, there is a green statue of liberty on a pedestal in the middle of the water. The water is surrounded by trees and buildings in the background. In the second image, there are pink and red flowers with a bee on the pink flower.'



🐛 Bug fixes
* Fix repetition penalty logits processor in https://github.com/huggingface/transformers.js/pull/1062
* Fix optional chaining for batch size calculation in PreTrainedModel by emojiiii in https://github.com/huggingface/transformers.js/pull/1063


📝 Documentation improvements
* Add an example and type enhancement for TextStreamer by seonglae in https://github.com/huggingface/transformers.js/pull/1066
* The smallest typo fix for webgpu.md by JoramMillenaar in https://github.com/huggingface/transformers.js/pull/1068


🛠️ Other improvements
* Only log warning if type not explicitly set to "custom" in https://github.com/huggingface/transformers.js/pull/1061
* Improve browser vs. webworker detection in https://github.com/huggingface/transformers.js/pull/1067


🤗 New contributors
* emojiiii made their first contribution in https://github.com/huggingface/transformers.js/pull/1063
* seonglae made their first contribution in https://github.com/huggingface/transformers.js/pull/1066
* JoramMillenaar made their first contribution in https://github.com/huggingface/transformers.js/pull/1068

**Full Changelog**: https://github.com/huggingface/transformers.js/compare/3.1.0...3.1.1

3.1.0

🚀 Transformers.js v3.1 — any-to-any, text-to-image, image-to-text, pose estimation, time series forecasting, and more!

Table of contents:

- [🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.](new-models)
- [**Janus**: Any-to-Any generation](janus)
- [**Qwen2-VL**: Image-Text-to-Text](qwen2vl)
- [**JinaCLIP**: Multimodal embeddings](jina_clip)
- [**LLaVA-OneVision**: Image-Text-to-Text](llava_onevision)
- [**ViTPose**: Pose-estimation](vitpose)
- [**MGP-STR**: Optical Character Recognition (OCR)](mgp-str)
- [**PatchTST and PatchTSMixer**: Time series forecasting.](patchtst-and-patchtsmixer)
- [🐛 Bug fixes](bug-fixes)
- [📝 Documentation improvements](documentation-improvements)
- [🛠️ Other improvements](other-improvements)
- [🤗 New contributors](new-contributors)

<h2 id="new-models">🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.</h2>

<h3 id="janus">Janus for Any-to-Any generation (e.g., image-to-text and text-to-image)</h3>

First of all, this release adds support for Janus, a novel autoregressive framework that unifies multimodal understanding and generation. The most popular model, [deepseek-ai/Janus-1.3B](https://huggingface.co/deepseek-ai/Janus-1.3B), is tagged as an "any-to-any" model, and has specifically been trained for the following tasks:

**Example:** Image-Text-to-Text

js
import { AutoProcessor, MultiModalityCausalLM } from "huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Janus-1.3B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id);

// Prepare inputs
const conversation = [
{
role: "User",
content: "<image_placeholder>\nConvert the formula into latex code.",
images: ["https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/quadratic_formula.png"],
},
];
const inputs = await processor(conversation);

// Generate response
const outputs = await model.generate({
...inputs,
max_new_tokens: 150,
do_sample: false,
});

// Decode output
const new_tokens = outputs.slice(null, [inputs.input_ids.dims.at(-1), null]);
const decoded = processor.batch_decode(new_tokens, { skip_special_tokens: true });
console.log(decoded[0]);


Sample output:

`
Sure, here is the LaTeX code for the given formula:


x = \frac{-b \pm \sqrt{b^2 - 4a c}}{2a}


This code represents the mathematical expression for the variable \( x \).
`

**Example:** Text-to-Image

js
import { AutoProcessor, MultiModalityCausalLM } from "huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Janus-1.3B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id);

// Prepare inputs
const conversation = [
{
role: "User",
content: "A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
},
];
const inputs = await processor(conversation, { chat_template: "text_to_image" });

// Generate response
const num_image_tokens = processor.num_image_tokens;
const outputs = await model.generate_images({
...inputs,
min_new_tokens: num_image_tokens,
max_new_tokens: num_image_tokens,
do_sample: true,
});

// Save the generated image
await outputs[0].save("test.png");


Sample outputs:

| ![fox_1](https://github.com/user-attachments/assets/c8a4f588-655f-440e-bd55-79d19505edae) | ![fox_2](https://github.com/user-attachments/assets/88b5003a-82de-4ef9-8315-6cb59aee607d) | ![fox_3](https://github.com/user-attachments/assets/f92ed498-4a32-4757-86de-cac37bc8fbf6) | ![fox_4](https://github.com/user-attachments/assets/51b9d0a6-c737-499d-983e-d89ff023282d) |
|---|---|---|---|
| ![fox_5](https://github.com/user-attachments/assets/8876ebb0-fea2-4443-b458-fdd6c035a69f) | ![fox_6](https://github.com/user-attachments/assets/1989f128-5fd4-4b0c-83b4-dc5f33b388c2) | ![fox_7](https://github.com/user-attachments/assets/1fa9ac58-ca14-4ee3-84ca-47e69de2589c) | ![fox_8](https://github.com/user-attachments/assets/20a20642-a336-4277-9056-f45d7ddb3bbe) |

What to play around with the model? Check out our [online WebGPU demo](https://huggingface.co/spaces/webml-community/Janus-1.3B-WebGPU)! 👇

https://github.com/user-attachments/assets/513b3119-ba8c-4a2d-b5fe-6869be47abfa

<h3 id="qwen2vl">Qwen2-VL for Image-Text-to-Text</h3>

**Example:** Image-Text-to-Text

Next, we added support for Qwen2-VL, the multimodal large language model series developed by Qwen team, Alibaba Cloud. It introduces the Naive Dynamic Resolution mechanism, allowing the model to process images of varying resolutions and leading to more efficient and accurate visual representations.

js
import { AutoProcessor, Qwen2VLForConditionalGeneration, RawImage } from "huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Qwen2-VL-2B-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Qwen2VLForConditionalGeneration.from_pretrained(model_id);

// Prepare inputs
const url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg";
const image = await (await RawImage.read(url)).resize(448, 448);
const conversation = [
{
role: "user",
content: [
{ type: "image" },
{ type: "text", text: "Describe this image." },
],
},
];
const text = processor.apply_chat_template(conversation, { add_generation_prompt: true });
const inputs = await processor(text, image);

// Perform inference
const outputs = await model.generate({
...inputs,
max_new_tokens: 128,
});

// Decode output
const decoded = processor.batch_decode(
outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
{ skip_special_tokens: true },
);
console.log(decoded[0]);
// The image depicts a serene beach scene with a woman and a dog. The woman is sitting on the sand, wearing a plaid shirt, and appears to be engaged in a playful interaction with the dog. The dog, which is a large breed, is sitting on its hind legs and appears to be reaching out to the woman, possibly to give her a high-five or a paw. The background shows the ocean with gentle waves, and the sky is clear, suggesting it might be either sunrise or sunset. The overall atmosphere is calm and relaxed, capturing a moment of connection between the woman and the dog.


<h3 id="jina_clip">JinaCLIP for multimodal embeddings</h3>

JinaCLIP is a series of general-purpose multilingual multimodal embedding models for text & images, created by Jina AI.

**Example:** Compute text and/or image embeddings with `jinaai/jina-clip-v2`:
js
import { AutoModel, AutoProcessor, RawImage, matmul } from "huggingface/transformers";

// Load processor and model
const model_id = "jinaai/jina-clip-v2";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModel.from_pretrained(model_id, { dtype: "q4" /* e.g., "fp16", "q8", or "q4" */ });

// Prepare inputs
const urls = ["https://i.ibb.co/nQNGqL0/beach1.jpg", "https://i.ibb.co/r5w8hG8/beach2.jpg"];
const images = await Promise.all(urls.map(url => RawImage.read(url)));
const sentences = [
"غروب جميل على الشاطئ", // Arabic
"海滩上美丽的日落", // Chinese
"Un beau coucher de soleil sur la plage", // French
"Ein wunderschöner Sonnenuntergang am Strand", // German
"Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία", // Greek
"समुद्र तट पर एक खूबसूरत सूर्यास्त", // Hindi
"Un bellissimo tramonto sulla spiaggia", // Italian
"浜辺に沈む美しい夕日", // Japanese
"해변 위로 아름다운 일몰", // Korean
];

// Encode text and images
const inputs = await processor(sentences, images, { padding: true, truncation: true });
const { l2norm_text_embeddings, l2norm_image_embeddings } = await model(inputs);

// Encode query (text-only)
const query_prefix = "Represent the query for retrieving evidence documents: ";
const query_inputs = await processor(query_prefix + "beautiful sunset over the beach");
const { l2norm_text_embeddings: query_embeddings } = await model(query_inputs);

// Compute text-image similarity scores
const text_to_image_scores = await matmul(query_embeddings, l2norm_image_embeddings.transpose(1, 0));

3.0.2

What's new?
* Add support for MobileLLM in https://github.com/huggingface/transformers.js/pull/1003

**Example:** Text generation with `onnx-community/MobileLLM-125M`.

js
import { pipeline } from "huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
"text-generation",
"onnx-community/MobileLLM-125M",
{ dtype: "fp32" },
);

// Define the list of messages
const text = "Q: What is the capital of France?\nA: Paris\nQ: What is the capital of England?\nA:";

// Generate a response
const output = await generator(text, { max_new_tokens: 30 });
console.log(output[0].generated_text);


<details>

<summary>Example output</summary>


Q: What is the capital of France?
A: Paris
Q: What is the capital of England?
A: London
Q: What is the capital of Scotland?
A: Edinburgh
Q: What is the capital of Wales?
A: Cardiff

</details>



* Add support for OLMo in https://github.com/huggingface/transformers.js/pull/1011

**Example:** Text generation with `onnx-community/AMD-OLMo-1B-SFT-DPO"`.

js
import { pipeline } from "huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
"text-generation",
"onnx-community/AMD-OLMo-1B-SFT-DPO",
{ dtype: "q4" },
);

// Define the list of messages
const messages = [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Tell me a joke." },
];

// Generate a response
const output = await generator(messages, { max_new_tokens: 128 });
console.log(output[0].generated_text.at(-1).content);



<details>

<summary>Example output</summary>


Why don't scientists trust atoms?

Because they make up everything!

</details>

* Fix CommonJS bundling in https://github.com/huggingface/transformers.js/pull/1012. Thanks jens-ghc for reporting!
* Doc fixes by roschler in https://github.com/huggingface/transformers.js/pull/1002
* Remove duplicate `gemma` value from `NO_PER_CHANNEL_REDUCE_RANGE_MODEL` by bekzod in https://github.com/huggingface/transformers.js/pull/1005

🤗 New contributors
* roschler made their first contribution in https://github.com/huggingface/transformers.js/pull/1002
* bekzod made their first contribution in https://github.com/huggingface/transformers.js/pull/1005

**Full Changelog**: https://github.com/huggingface/transformers.js/compare/3.0.1...3.0.2

Page 3 of 16

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.