What's new?
💬 Chat templates!
This release adds support for **chat templates**, a highly-requested feature that enables users to convert conversations (represented as a list of chat objects) into a single tokenizable string, in the format that the model expects. As you may know, chat templates can vary greatly across model types, so it was important to design a system that: (1) supports complex chat templates; (2) is generalizable, and (3) is easy to use. So, how did we do it? 🤔
This is made possible with [`huggingface/jinja`](https://www.npmjs.com/package/huggingface/jinja), a minimalistic JavaScript implementation of the Jinja templating engine, that we created to align with how [transformers](https://github.com/huggingface/transformers) handles templating. Although it was originally designed for parsing and rendering ChatML templates, we decided to separate out the templating logic into an external (optional) library due to its usefulness in other types of applications. Special thanks to tlaceby for his amazing ["Guide to Interpreters"](https://github.com/tlaceby/guide-to-interpreters-series) series, which provided the basis for our implementation. 🤗
Anyway, let's take a look at an example:
js
import { AutoTokenizer } from "xenova/transformers";
// Load tokenizer from the Hugging Face Hub
const tokenizer = await AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1");
// Define chat messages
const chat = [
{ role: "user", content: "Hello, how are you?" },
{ role: "assistant", content: "I'm doing great. How can I help you today?" },
{ role: "user", content: "I'd like to show off how chat templating works!" },
]
const text = tokenizer.apply_chat_template(chat, { tokenize: false });
// "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"
Notice how the entire chat is condensed into a single string. If you would instead like to return the tokenized version (i.e., a list of token IDs), you can use the following:
js
const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false });
// [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, 28737, 28742, 28719, 2548, 1598, 28723, 1602, 541, 315, 1316, 368, 3154, 28804, 2, 28705, 733, 16289, 28793, 315, 28742, 28715, 737, 298, 1347, 805, 910, 10706, 5752, 1077, 3791, 28808, 733, 28748, 16289, 28793]
For more information about chat templates, check out the [transformers documentation](https://huggingface.co/docs/transformers/main/en/chat_templating).
🐛 Bug fixes
- Incorrect encoding/decoding of whitespace around special characters with Fast Llama tokenizers. These bugs will also soon be fixed in the transformers library. For backwards compatibility reasons, if the tokenizer was exported with the legacy behaviour, it will still act in the same way unless explicitly set otherwise. Newer exports won't be affected. If you wish to override this default, to either still use the legacy behaviour (for backwards compatibility reasons), or to upgrade to the fixed version, you can do so with:
js
// Use the default behaviour (specified in tokenizer_config.json, which in the case is `{legacy: false}`).
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/llama2-tokenizer');
const { input_ids } = tokenizer('<s>\n', { add_special_tokens: false, return_tensor: false });
console.log(input_ids); // [1, 13]
// Use the legacy behaviour
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/llama2-tokenizer', { legacy: true });
const { input_ids } = tokenizer('<s>\n', { add_special_tokens: false, return_tensor: false });
console.log(input_ids); // [1, 29871, 13]
- Strip whitespace around special tokens for wav2vec tokenizers.
🔨 Improvements
- More comprehensive tokenizer test suite: including both static and dynamic tokenizer tests for encoding, decoding, and chat templates.
**Full Changelog**: https://github.com/xenova/transformers.js/compare/2.11.0...2.12.0