Transformers-js-py

Latest version: v0.19.4

Safety actively analyzes 714712 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 14

20.166624069213867

// ...
// ],
// size: 201028
// },
// past_key_values: { ... }
// }

Examples for computing perplexity: https://github.com/xenova/transformers.js/issues/137#issuecomment-1595496161

More accurate quantization parameters for whisper models

We've updated the quantization parameters used for the pre-converted whisper models on the [hub](https://huggingface.co/models?library=transformers.js&other=whisper). You can test them out with [whisper web](https://huggingface.co/spaces/Xenova/whisper-web)! Thanks to jozefchutka for [reporting](https://github.com/xenova/transformers.js/issues/156) this issue.

![image](https://github.com/xenova/transformers.js/assets/26504141/d5ab1372-2589-46c7-8179-0cc289f663b0)

Thanks to jozefchutka for [reporting](https://github.com/xenova/transformers.js/issues/156) this issue!

Misc bug fixes and improvements
* Do not use spread operator to concatenate large arrays (https://github.com/xenova/transformers.js/pull/154)
* Set chunk timestamp to rounded time by PushpenderSaini0 (https://github.com/xenova/transformers.js/pull/160)

13.5

// ]
// }

*Note:* For now, you need to choose the `output_attentions` revision (see above). In future, we may merge these models into the main branch. Also, we currently do not have exports for the medium and large models, simply because I don't have enough RAM to do the export myself (>25GB needed) 😅 ... so, if you would like to use our [conversion script](https://huggingface.co/docs/transformers.js/custom_usage#convert-your-models-to-onnx) to do the conversion yourself, please make a PR on the hub with these new models (under a new `output_attentions` branch)!

From our testing, the JS implementation exactly matches the output produced by the Python implementation (when using the same model of course)! 🥳

![image](https://github.com/xenova/transformers.js/assets/26504141/5389443f-3d6a-4edd-99f4-8440120ad97d)

Python (left) vs. JavaScript (right)

<details>
<summary>surprise me</summary>
<br>

![image](https://github.com/xenova/transformers.js/assets/26504141/8ec87dcc-303d-461d-838c-adef920d446a)

</details>

I'm excited to see what you all build with this! Please tag me on [twitter](https://twitter.com/xenovacom) if you use it in your project - I'd love to see! I'm also planning on adding this as an option to [whisper-web](https://github.com/xenova/whisper-web), so stay tuned! 🚀

Misc bug fixes and improvements
* Fix loading of grayscale images in node.js (178)

10.22

9.92

3.1.0

🚀 Transformers.js v3.1 — any-to-any, text-to-image, image-to-text, pose estimation, time series forecasting, and more!

Table of contents:

- [🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.](new-models)
- [**Janus**: Any-to-Any generation](janus)
- [**Qwen2-VL**: Image-Text-to-Text](qwen2vl)
- [**JinaCLIP**: Multimodal embeddings](jina_clip)
- [**LLaVA-OneVision**: Image-Text-to-Text](llava_onevision)
- [**ViTPose**: Pose-estimation](vitpose)
- [**MGP-STR**: Optical Character Recognition (OCR)](mgp-str)
- [**PatchTST and PatchTSMixer**: Time series forecasting.](patchtst-and-patchtsmixer)
- [🐛 Bug fixes](bug-fixes)
- [📝 Documentation improvements](documentation-improvements)
- [🛠️ Other improvements](other-improvements)
- [🤗 New contributors](new-contributors)

<h2 id="new-models">🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.</h2>

<h3 id="janus">Janus for Any-to-Any generation (e.g., image-to-text and text-to-image)</h3>

First of all, this release adds support for Janus, a novel autoregressive framework that unifies multimodal understanding and generation. The most popular model, [deepseek-ai/Janus-1.3B](https://huggingface.co/deepseek-ai/Janus-1.3B), is tagged as an "any-to-any" model, and has specifically been trained for the following tasks:

**Example:** Image-Text-to-Text

js
import { AutoProcessor, MultiModalityCausalLM } from "huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Janus-1.3B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id);

// Prepare inputs
const conversation = [
{
role: "User",
content: "<image_placeholder>\nConvert the formula into latex code.",
images: ["https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/quadratic_formula.png"],
},
];
const inputs = await processor(conversation);

// Generate response
const outputs = await model.generate({
...inputs,
max_new_tokens: 150,
do_sample: false,
});

// Decode output
const new_tokens = outputs.slice(null, [inputs.input_ids.dims.at(-1), null]);
const decoded = processor.batch_decode(new_tokens, { skip_special_tokens: true });
console.log(decoded[0]);

Sample output:

`
Sure, here is the LaTeX code for the given formula:

x = \frac{-b \pm \sqrt{b^2 - 4a c}}{2a}

This code represents the mathematical expression for the variable \( x \).
`

**Example:** Text-to-Image

js
import { AutoProcessor, MultiModalityCausalLM } from "huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Janus-1.3B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id);

// Prepare inputs
const conversation = [
{
role: "User",
content: "A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
},
];
const inputs = await processor(conversation, { chat_template: "text_to_image" });

// Generate response
const num_image_tokens = processor.num_image_tokens;
const outputs = await model.generate_images({
...inputs,
min_new_tokens: num_image_tokens,
max_new_tokens: num_image_tokens,
do_sample: true,
});

// Save the generated image
await outputs[0].save("test.png");

Sample outputs:

| ![fox_1](https://github.com/user-attachments/assets/c8a4f588-655f-440e-bd55-79d19505edae) | ![fox_2](https://github.com/user-attachments/assets/88b5003a-82de-4ef9-8315-6cb59aee607d) | ![fox_3](https://github.com/user-attachments/assets/f92ed498-4a32-4757-86de-cac37bc8fbf6) | ![fox_4](https://github.com/user-attachments/assets/51b9d0a6-c737-499d-983e-d89ff023282d) |
|---|---|---|---|
| ![fox_5](https://github.com/user-attachments/assets/8876ebb0-fea2-4443-b458-fdd6c035a69f) | ![fox_6](https://github.com/user-attachments/assets/1989f128-5fd4-4b0c-83b4-dc5f33b388c2) | ![fox_7](https://github.com/user-attachments/assets/1fa9ac58-ca14-4ee3-84ca-47e69de2589c) | ![fox_8](https://github.com/user-attachments/assets/20a20642-a336-4277-9056-f45d7ddb3bbe) |

What to play around with the model? Check out our [online WebGPU demo](https://huggingface.co/spaces/webml-community/Janus-1.3B-WebGPU)! 👇

https://github.com/user-attachments/assets/513b3119-ba8c-4a2d-b5fe-6869be47abfa

<h3 id="qwen2vl">Qwen2-VL for Image-Text-to-Text</h3>

**Example:** Image-Text-to-Text

Next, we added support for Qwen2-VL, the multimodal large language model series developed by Qwen team, Alibaba Cloud. It introduces the Naive Dynamic Resolution mechanism, allowing the model to process images of varying resolutions and leading to more efficient and accurate visual representations.

js
import { AutoProcessor, Qwen2VLForConditionalGeneration, RawImage } from "huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Qwen2-VL-2B-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Qwen2VLForConditionalGeneration.from_pretrained(model_id);

// Prepare inputs
const url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg";
const image = await (await RawImage.read(url)).resize(448, 448);
const conversation = [
{
role: "user",
content: [
{ type: "image" },
{ type: "text", text: "Describe this image." },
],
},
];
const text = processor.apply_chat_template(conversation, { add_generation_prompt: true });
const inputs = await processor(text, image);

// Perform inference
const outputs = await model.generate({
...inputs,
max_new_tokens: 128,
});

// Decode output
const decoded = processor.batch_decode(
outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
{ skip_special_tokens: true },
);
console.log(decoded[0]);
// The image depicts a serene beach scene with a woman and a dog. The woman is sitting on the sand, wearing a plaid shirt, and appears to be engaged in a playful interaction with the dog. The dog, which is a large breed, is sitting on its hind legs and appears to be reaching out to the woman, possibly to give her a high-five or a paw. The background shows the ocean with gentle waves, and the sky is clear, suggesting it might be either sunrise or sunset. The overall atmosphere is calm and relaxed, capturing a moment of connection between the woman and the dog.

<h3 id="jina_clip">JinaCLIP for multimodal embeddings</h3>

JinaCLIP is a series of general-purpose multilingual multimodal embedding models for text & images, created by Jina AI.

**Example:** Compute text and/or image embeddings with `jinaai/jina-clip-v2`:
js
import { AutoModel, AutoProcessor, RawImage, matmul } from "huggingface/transformers";

// Load processor and model
const model_id = "jinaai/jina-clip-v2";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModel.from_pretrained(model_id, { dtype: "q4" /* e.g., "fp16", "q8", or "q4" */ });

// Prepare inputs
const urls = ["https://i.ibb.co/nQNGqL0/beach1.jpg", "https://i.ibb.co/r5w8hG8/beach2.jpg"];
const images = await Promise.all(urls.map(url => RawImage.read(url)));
const sentences = [
"غروب جميل على الشاطئ", // Arabic
"海滩上美丽的日落", // Chinese
"Un beau coucher de soleil sur la plage", // French
"Ein wunderschöner Sonnenuntergang am Strand", // German
"Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία", // Greek
"समुद्र तट पर एक खूबसूरत सूर्यास्त", // Hindi
"Un bellissimo tramonto sulla spiaggia", // Italian
"浜辺に沈む美しい夕日", // Japanese
"해변 위로 아름다운 일몰", // Korean
];

// Encode text and images
const inputs = await processor(sentences, images, { padding: true, truncation: true });
const { l2norm_text_embeddings, l2norm_image_embeddings } = await model(inputs);

// Encode query (text-only)
const query_prefix = "Represent the query for retrieving evidence documents: ";
const query_inputs = await processor(query_prefix + "beautiful sunset over the beach");
const { l2norm_text_embeddings: query_embeddings } = await model(query_inputs);

// Compute text-image similarity scores
const text_to_image_scores = await matmul(query_embeddings, l2norm_image_embeddings.transpose(1, 0));

3.0.2

What's new?
* Add support for MobileLLM in https://github.com/huggingface/transformers.js/pull/1003

**Example:** Text generation with `onnx-community/MobileLLM-125M`.

js
import { pipeline } from "huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
"text-generation",
"onnx-community/MobileLLM-125M",
{ dtype: "fp32" },
);

// Define the list of messages
const text = "Q: What is the capital of France?\nA: Paris\nQ: What is the capital of England?\nA:";

// Generate a response
const output = await generator(text, { max_new_tokens: 30 });
console.log(output[0].generated_text);

<details>

<summary>Example output</summary>

Q: What is the capital of France?
A: Paris
Q: What is the capital of England?
A: London
Q: What is the capital of Scotland?
A: Edinburgh
Q: What is the capital of Wales?
A: Cardiff

</details>

* Add support for OLMo in https://github.com/huggingface/transformers.js/pull/1011

**Example:** Text generation with `onnx-community/AMD-OLMo-1B-SFT-DPO"`.

js
import { pipeline } from "huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
"text-generation",
"onnx-community/AMD-OLMo-1B-SFT-DPO",
{ dtype: "q4" },
);

// Define the list of messages
const messages = [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Tell me a joke." },
];

// Generate a response
const output = await generator(messages, { max_new_tokens: 128 });
console.log(output[0].generated_text.at(-1).content);

<details>

<summary>Example output</summary>

Why don't scientists trust atoms?

Because they make up everything!

</details>

* Fix CommonJS bundling in https://github.com/huggingface/transformers.js/pull/1012. Thanks jens-ghc for reporting!
* Doc fixes by roschler in https://github.com/huggingface/transformers.js/pull/1002
* Remove duplicate `gemma` value from `NO_PER_CHANNEL_REDUCE_RANGE_MODEL` by bekzod in https://github.com/huggingface/transformers.js/pull/1005

🤗 New contributors
* roschler made their first contribution in https://github.com/huggingface/transformers.js/pull/1002
* bekzod made their first contribution in https://github.com/huggingface/transformers.js/pull/1005

**Full Changelog**: https://github.com/huggingface/transformers.js/compare/3.0.1...3.0.2

Page 1 of 14

Releases

Has known vulnerabilities

Transformers-js-py

Page 1 of 14

20.166624069213867

13.5

10.22

9.92

3.1.0

3.0.2

Page 1 of 14

Links

Releases