Tensorrt-llm

Latest version: v0.18.0

Safety actively analyzes 723882 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 3

0.18.0

Hi,

We are very pleased to announce the [0.18.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.18.0) version of TensorRT-LLM. This update includes:

Key Features and Enhancements
- **Features that were previously available in the 0.18.0.dev pre-releases are not included in this release**.
- [BREAKING CHANGE] Windows platform support is deprecated as of v0.18.0. All Windows-related code and functionality will be completely removed in future releases.

Known Issues
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.

Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.03-py3`.
- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.03-py3`.
- The dependent TensorRT version is updated to 10.9.
- The dependent CUDA version is updated to 12.8.1.
- The dependent NVIDIA ModelOpt version is updated to 0.25 for Linux platform.

0.17.0

Hi,

We are very pleased to announce the [0.17.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.17.0) version of TensorRT-LLM. This update includes:

Key Features and Enhancements
- **Blackwell support**
- **NOTE: pip installation is not supported for TRT-LLM 0.17 on Blackwell platforms only. Instead, it is recommended that the user build from source using NVIDIA NGC 25.01 PyTorch container.**
- Added support for B200.
- Added support for GeForce RTX 50 series using Windows Subsystem for Linux (WSL) for limited models.
- Added NVFP4 Gemm support for Llama and Mixtral models.
- Added NVFP4 support for the `LLM` API and `trtllm-bench` command.
- GB200 NVL is not fully supported.
- Added benchmark script to measure perf benefits of KV cache host offload with expected runtime improvements from GH200.
- **PyTorch workflow**
- The PyTorch workflow is an **experimental** feature in `tensorrt_llm._torch`. The following is a list of supported infrastructure, models, and features that can be used with the PyTorch workflow.
- Added support for H100/H200/B200.
- Added support for Llama models, Mixtral, QWen, Vila.
- Added support for FP16/BF16/FP8/NVFP4 Gemm and fused Mixture-Of-Experts (MOE), FP16/BF16/FP8 KVCache.
- Added custom context and decoding attention kernels support via PyTorch custom op.
- Added support for chunked context (default off).
- Added CudaGraph support for decoding only.
- Added overlap scheduler support to overlap prepare inputs and model forward by decoding 1 extra token.
- Added FP8 context FMHA support for the W4A8 quantization workflow.
- Added ModelOpt quantized checkpoint support for the `LLM` API.
- Added FP8 support for the Llama-3.2 VLM model. Refer to the “MLLaMA” section in `examples/multimodal/README.md`.
- Added PDL support for `userbuffer` based AllReduce-Norm fusion kernel.
- Added runtime support for seamless lookahead decoding.
- Added token-aligned arbitrary output tensors support for the C++ `executor` API.

API Changes
- [BREAKING CHANGE] KV cache reuse is enabled automatically when `paged_context_fmha` is enabled.
- Added `--concurrency` support for the `throughput` subcommand of `trtllm-bench`.

Fixed Issues
- Fixed incorrect LoRA output dimension. Thanks for the contribution from akhoroshev in 2484.
- Added NVIDIA H200 GPU into the `cluster_key` for auto parallelism feature. (2552)
- Fixed a typo in the `__post_init__` function of `LLmArgs` Class. Thanks for the contribution from topenkoff in 2691.
- Fixed workspace size issue in the GPT attention plugin. Thanks for the contribution from AIDC-AI.
- Fixed Deepseek-V2 model accuracy.

Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.01-py3`.
- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.01-py3`.
- The dependent TensorRT version is updated to 10.8.0.
- The dependent CUDA version is updated to 12.8.0.
- The dependent ModelOpt version is updated to 0.23 for Linux platform, while 0.17 is still used on Windows platform.

Known Issues
- Need `--extra-index-url https://pypi.nvidia.com` when running `pip install tensorrt-llm` due to new third-party dependencies.
- The PYPI SBSA wheel is incompatible with PyTorch 2.5.1 due to a break in the PyTorch ABI/API, as detailed in the related [GitHub issue](https://github.com/pytorch/pytorch/issues/144966).

0.16.0

Hi,

We are very pleased to announce the [0.16.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.16.0) version of TensorRT-LLM. This update includes:

Key Features and Enhancements
- Added guided decoding support with XGrammar backend.
- Added quantization support for RecurrentGemma. Refer to `examples/recurrentgemma/README.md`.
- Added ulysses context parallel support. Refer to an example on building LLaMA 7B using 2-way tensor parallelism and 2-way context parallelism at `examples/llama/README.md`.
- Added W4A8 quantization support to BF16 models on Ada (SM89).
- Added PDL support for the FP8 GEMM plugins.
- Added a runtime `max_num_tokens` dynamic tuning feature, which can be enabled by setting `--enable_max_num_tokens_tuning` to `gptManagerBenchmark`.
- Added typical acceptance support for EAGLE.
- Supported chunked context and sliding window attention to be enabled together.
- Added head size 64 support for the XQA kernel.
- Added the following features to the LLM API:
- Lookahead decoding.
- DeepSeek V1 support.
- Medusa support.
- `max_num_tokens` and `max_batch_size` arguments to control the runtime parameters.
- `extended_runtime_perf_knob_config` to enable various performance configurations.
- Added LogN scaling support for Qwen models.
- Added `AutoAWQ` checkpoints support for Qwen. Refer to the “INT4-AWQ” section in `examples/qwen/README.md`.
- Added `AutoAWQ` and `AutoGPTQ` Hugging Face checkpoints support for LLaMA. (2458)
- Added `allottedTimeMs` to the C++ `Request` class to support per-request timeout.
- [BREAKING CHANGE] Removed NVIDIA V100 GPU support.

API Changes
- [BREAKING CHANGE] Removed `enable_xqa` argument from `trtllm-build`.
- [BREAKING CHANGE] Chunked context is enabled by default when KV cache and paged context FMHA is enabled on non-RNN based models.
- [BREAKING CHANGE] Enabled embedding sharing automatically when possible and remove the flag `--use_embedding_sharing` from convert checkpoints scripts.
- [BREAKING CHANGE] The `if __name__ == "__main__"` entry point is required for both single-GPU and multi-GPU cases when using the `LLM` API.
- [BREAKING CHANGE] Cancelled requests now return empty results.
- Added the `enable_chunked_prefill` flag to the `LlmArgs` of the `LLM` API.
- Integrated BERT and RoBERTa models to the `trtllm-build` command.

Model Updates
- Added Qwen2-VL support. Refer to the “Qwen2-VL” section of `examples/multimodal/README.md`.
- Added multimodal evaluation examples. Refer to `examples/multimodal`.
- Added Stable Diffusion XL support. Refer to `examples/sdxl/README.md`. Thanks for the contribution from Zars19 in 1514.

Fixed Issues
- Fixed unnecessary batch logits post processor calls. (2439)
- Fixed a typo in the error message. (2473)
- Fixed the in-place clamp operation usage in smooth quant. Thanks for the contribution from StarrickLiu in 2485.
- Fixed `sampling_params` to only be setup if `end_id` is None and `tokenizer` is not None in the `LLM` API. Thanks to the contribution from mfuntowicz in 2573.

Infrastructure Changes
- Updated the base Docker image for TensorRT-LLM to `nvcr.io/nvidia/pytorch:24.11-py3`.
- Updated the base Docker image for TensorRT-LLM Backend to `nvcr.io/nvidia/tritonserver:24.11-py3`.
- Updated to TensorRT v10.7.
- Updated to CUDA v12.6.3.
- Added support for Python 3.10 and 3.12 to TensorRT-LLM Python wheels on PyPI.
- Updated to ModelOpt v0.21 for Linux platform, while v0.17 is still used on Windows platform.

Known Issues
- There is a known AllReduce performance issue on AMD-based CPU platforms on NCCL 2.23.4, which can be workarounded by `export NCCL_P2P_LEVEL=SYS`.

We are updating the `main` branch regularly with new features, bug fixes and performance optimizations. The `rel` branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

0.15.0

Hi,

We are very pleased to announce the [0.15.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.15.0) version of TensorRT-LLM. This update includes:

Key Features and Enhancements
- Added support for EAGLE. Refer to `examples/eagle/README.md`.
- Added functional support for GH200 systems.
- Added AutoQ (mixed precision) support.
- Added a `trtllm-serve` command to start a FastAPI based server.
- Added FP8 support for Nemotron NAS 51B. Refer to `examples/nemotron_nas/README.md`.
- Added INT8 support for GPTQ quantization.
- Added TensorRT native support for INT8 Smooth Quantization.
- Added quantization support for Exaone model. Refer to `examples/exaone/README.md`.
- Enabled Medusa for Qwen2 models. Refer to “Medusa with Qwen2” section in `examples/medusa/README.md`.
- Optimized pipeline parallelism with ReduceScatter and AllGather for Mixtral models.
- Added support for `Qwen2ForSequenceClassification` model architecture.
- Added Python plugin support to simplify plugin development efforts. Refer to `examples/python_plugin/README.md`.
- Added different rank dimensions support for LoRA modules when using the Hugging Face format. Thanks for the contribution from AlessioNetti in 2366.
- Enabled embedding sharing by default. Refer to "Embedding Parallelism, Embedding Sharing, and Look-Up Plugin" section in `docs/source/performance/perf-best-practices.md` for information about the required conditions for embedding sharing.
- Added support for per-token per-channel FP8 (namely row-wise FP8) on Ada.
- Extended the maximum supported `beam_width` to `256`.
- Added FP8 and INT8 SmoothQuant quantization support for the InternVL2-4B variant (LLM model only). Refer to `examples/multimodal/README.md`.
- Added support for prompt-lookup speculative decoding. Refer to `examples/prompt_lookup/README.md`.
- Integrated the QServe w4a8 per-group/per-channel quantization. Refer to “w4aINT8 quantization (QServe)” section in `examples/llama/README.md`.
- Added a C++ example for fast logits using the `executor` API. Refer to “executorExampleFastLogits” section in `examples/cpp/executor/README.md`.
- [BREAKING CHANGE] NVIDIA Volta GPU support is removed in this and future releases.
- Added the following enhancements to the [LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/index.html):
- [BREAKING CHANGE] Moved the runtime initialization from the first invocation of `LLM.generate` to `LLM.__init__` for better generation performance without warmup.
- Added `n` and `best_of` arguments to the `SamplingParams` class. These arguments enable returning multiple generations for a single request.
- Added `ignore_eos`, `detokenize`, `skip_special_tokens`, `spaces_between_special_tokens`, and `truncate_prompt_tokens` arguments to the `SamplingParams` class. These arguments enable more control over the tokenizer behavior.
- Added support for incremental detokenization to improve the detokenization performance for streaming generation.
- Added the `enable_prompt_adapter` argument to the `LLM` class and the `prompt_adapter_request` argument for the `LLM.generate` method. These arguments enable prompt tuning.
- Added support for a `gpt_variant` argument to the `examples/gpt/convert_checkpoint.py` file. This enhancement enables checkpoint conversion with more GPT model variants. Thanks to the contribution from tonylek in 2352.

API Changes
- [BREAKING CHANGE] Moved the flag `builder_force_num_profiles` in `trtllm-build` command to the `BUILDER_FORCE_NUM_PROFILES` environment variable.
- [BREAKING CHANGE] Modified defaults for `BuildConfig` class so that they are aligned with the `trtllm-build` command.
- [BREAKING CHANGE] Removed Python bindings of `GptManager`.
- [BREAKING CHANGE] `auto` is used as the default value for `--dtype` option in quantize and checkpoints conversion scripts.
- [BREAKING CHANGE] Deprecated `gptManager` API path in `gptManagerBenchmark`.
- [BREAKING CHANGE] Deprecated the `beam_width` and `num_return_sequences` arguments to the `SamplingParams` class in the LLM API. Use the `n`, `best_of` and `use_beam_search` arguments instead.
- Exposed `--trust_remote_code` argument to the OpenAI API server. (2357)

Model Updates
- Added support for Llama 3.2 and llama 3.2-Vision model. Refer to `examples/mllama/README.md` for more details on the llama 3.2-Vision model.
- Added support for Deepseek-v2. Refer to `examples/deepseek_v2/README.md`.
- Added support for Cohere Command R models. Refer to `examples/commandr/README.md`.
- Added support for Falcon 2, refer to `examples/falcon/README.md`, thanks to the contribution from puneeshkhanna in 1926.
- Added support for InternVL2. Refer to `examples/multimodal/README.md`.
- Added support for Qwen2-0.5B and Qwen2.5-1.5B model. (2388)
- Added support for Minitron. Refer to `examples/nemotron`.
- Added a GPT Variant - Granite(20B and 34B). Refer to “GPT Variant - Granite” section in `examples/gpt/README.md`.
- Added support for LLaVA-OneVision model. Refer to “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in `examples/multimodal/README.md`.

Fixed Issues
- Fixed a slice error in forward function. (1480)
- Fixed an issue that appears when building BERT. (2373)
- Fixed an issue that model is not loaded when building BERT. (2379)
- Fixed the broken executor examples. (2294)
- Fixed the issue that the kernel `moeTopK()` cannot find the correct expert when the number of experts is not a power of two. Thanks dongjiyingdjy for reporting this bug.
- Fixed an assertion failure on `crossKvCacheFraction`. (2419)
- Fixed an issue when using smoothquant to quantize Qwen2 model. (2370)
- Fixed a PDL typo in `docs/source/performance/perf-benchmarking.md`, thanks MARD1NO for pointing it out in 2425.

Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.10-py3`.
- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:24.10-py3`.
- The dependent TensorRT version is updated to 10.6.
- The dependent CUDA version is updated to 12.6.2.
- The dependent PyTorch version is updated to 2.5.1.
- The dependent ModelOpt version is updated to 0.19 for Linux platform, while 0.17 is still used on Windows platform.

Documentation
- Added a copy button for code snippets in the documentation. (2288)

We are updating the `main` branch regularly with new features, bug fixes and performance optimizations. The `rel` branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

0.14.0

Hi,

We are very pleased to announce the [0.14.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.14.0) version of TensorRT-LLM. This update includes:

Key Features and Enhancements
- Enhanced the `LLM` class in the [LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/index.html).
- Added support for calibration with offline dataset.
- Added support for Mamba2.
- Added support for `finish_reason` and `stop_reason`.
- Added FP8 support for CodeLlama.
- Added `__repr__` methods for class `Module`, thanks to the contribution from 1ytic in 2191.
- Added BFloat16 support for fused gated MLP.
- Updated ReDrafter beam search logic to match Apple ReDrafter v1.1.
- Improved `customAllReduce` performance.
- Draft model now can copy logits directly over MPI to the target model's process in `orchestrator` mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference.
- NVIDIA Volta GPU support is deprecated and will be removed in a future release.

API Changes
- [BREAKING CHANGE] The default `max_batch_size` of the `trtllm-build` command is set to `2048`.
- [BREAKING CHANGE] Remove `builder_opt` from the `BuildConfig` class and the `trtllm-build` command.
- Add logits post-processor support to the `ModelRunnerCpp` class.
- Added `isParticipant` method to the C++ `Executor` API to check if the current process is a participant in the executor instance.

Model Updates
- Added support for NemotronNas, see `examples/nemotron_nas/README.md`.
- Added support for Deepseek-v1, see `examples/deepseek_v1/README.md`.
- Added support for Phi-3.5 models, see `examples/phi/README.md`.

Fixed Issues
- Fixed a typo in `tensorrt_llm/models/model_weights_loader.py`, thanks to the contribution from wangkuiyi in 2152.
- Fixed duplicated import module in `tensorrt_llm/runtime/generation.py`, thanks to the contribution from lkm2835 in 2182.
- Enabled `share_embedding` for the models that have no `lm_head` in legacy checkpoint conversion path, thanks to the contribution from lkm2835 in 2232.
- Fixed `kv_cache_type` issue in the Python benchmark, thanks to the contribution from qingquansong in 2219.
- Fixed an issue with SmoothQuant calibration with custom datasets. Thanks to the contribution by Bhuvanesh09 in 2243.
- Fixed an issue surrounding `trtllm-build --fast-build` with fake or random weights. Thanks to ZJLi2013 for flagging it in 2135.
- Fixed missing `use_fused_mlp` when constructing `BuildConfig` from dict, thanks for the fix from ethnzhng in 2081.
- Fixed lookahead batch layout for `numNewTokensCumSum`. (2263)

Infrastructure Changes
- The dependent ModelOpt version is updated to v0.17.

Documentation
- Sherlock113 added a [tech blog](https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml) to the latest news in #2169, thanks for the contribution.

Known Issues
- Replit Code is not supported with the transformers 4.45+

We are updating the `main` branch regularly with new features, bug fixes and performance optimizations. The `rel` branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

0.13.0

Hi,

We are very pleased to announce the [0.13.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0) version of TensorRT-LLM. This update includes:

Key Features and Enhancements
- Supported lookahead decoding (experimental), see `docs/source/speculative_decoding.md`.
- Added some enhancements to the `ModelWeightsLoader` (a unified checkpoint converter, see `docs/source/architecture/model-weights-loader.md`).
- Supported Qwen models.
- Supported auto-padding for indivisible TP shape in INT4-wo/INT8-wo/INT4-GPTQ.
- Improved performance on `*.bin` and `*.pth`.
- Supported OpenAI Whisper in C++ runtime.
- Added some enhancements to the `LLM` class.
- Supported LoRA.
- Supported engine building using dummy weights.
- Supported `trust_remote_code` for customized models and tokenizers downloaded from Hugging Face Hub.
- Supported beam search for streaming mode.
- Supported tensor parallelism for Mamba2.
- Supported returning generation logits for streaming mode.
- Added `curand` and `bfloat16` support for `ReDrafter`.
- Added sparse mixer normalization mode for MoE models.
- Added support for QKV scaling in FP8 FMHA.
- Supported FP8 for MoE LoRA.
- Supported KV cache reuse for P-Tuning and LoRA.
- Supported in-flight batching for CogVLM models.
- Supported LoRA for the `ModelRunnerCpp` class.
- Supported `head_size=48` cases for FMHA kernels.
- Added FP8 examples for DiT models, see `examples/dit/README.md`.
- Supported decoder with encoder input features for the C++ `executor` API.

API Changes
- [BREAKING CHANGE] Set `use_fused_mlp` to `True` by default.
- [BREAKING CHANGE] Enabled `multi_block_mode` by default.
- [BREAKING CHANGE] Enabled `strongly_typed` by default in `builder` API.
- [BREAKING CHANGE] Renamed `maxNewTokens`, `randomSeed` and `minLength` to `maxTokens`, `seed` and `minTokens` following OpenAI style.
- The `LLM` class
- [BREAKING CHANGE] Updated `LLM.generate` arguments to include `PromptInputs` and `tqdm`.
- The C++ `executor` API
- [BREAKING CHANGE] Added `LogitsPostProcessorConfig`.
- Added `FinishReason` to `Result`.

Model Updates
- Supported Gemma 2, see "Run Gemma 2" section in `examples/gemma/README.md`.

Fixed Issues
- Fixed an accuracy issue when enabling remove padding issue for cross attention. (1999)
- Fixed the failure in converting qwen2-0.5b-instruct when using `smoothquant`. (2087)
- Matched the `exclude_modules` pattern in `convert_utils.py` to the changes in `quantize.py`. (2113)
- Fixed build engine error when `FORCE_NCCL_ALL_REDUCE_STRATEGY` is set.
- Fixed unexpected truncation in the quant mode of `gpt_attention`.
- Fixed the hang caused by race condition when canceling requests.
- Fixed the default factory for `LoraConfig`. (1323)

Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.07-py3`.
- Base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:24.07-py3`.
- The dependent TensorRT version is updated to 10.4.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.
- The dependent ModelOpt version is updated to v0.15.

We are updating the `main` branch regularly with new features, bug fixes and performance optimizations. The `rel` branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

Page 1 of 3

Releases

Has known vulnerabilities

Tensorrt-llm

Page 1 of 3

0.18.0

0.17.0

0.16.0

0.15.0

0.14.0

0.13.0

Page 1 of 3

Links

Releases