Tensorrt-llm

Latest version: v0.14.0

Safety actively analyzes 681844 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.14.0

Hi,

We are very pleased to announce the [0.14.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.14.0) version of TensorRT-LLM. This update includes:

Key Features and Enhancements
- Enhanced the `LLM` class in the [LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/index.html).
- Added support for calibration with offline dataset.
- Added support for Mamba2.
- Added support for `finish_reason` and `stop_reason`.
- Added FP8 support for CodeLlama.
- Added `__repr__` methods for class `Module`, thanks to the contribution from 1ytic in 2191.
- Added BFloat16 support for fused gated MLP.
- Updated ReDrafter beam search logic to match Apple ReDrafter v1.1.
- Improved `customAllReduce` performance.
- Draft model now can copy logits directly over MPI to the target model's process in `orchestrator` mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference.
- NVIDIA Volta GPU support is deprecated and will be removed in a future release.

API Changes
- [BREAKING CHANGE] The default `max_batch_size` of the `trtllm-build` command is set to `2048`.
- [BREAKING CHANGE] Remove `builder_opt` from the `BuildConfig` class and the `trtllm-build` command.
- Add logits post-processor support to the `ModelRunnerCpp` class.
- Added `isParticipant` method to the C++ `Executor` API to check if the current process is a participant in the executor instance.

Model Updates
- Added support for NemotronNas, see `examples/nemotron_nas/README.md`.
- Added support for Deepseek-v1, see `examples/deepseek_v1/README.md`.
- Added support for Phi-3.5 models, see `examples/phi/README.md`.

Fixed Issues
- Fixed a typo in `tensorrt_llm/models/model_weights_loader.py`, thanks to the contribution from wangkuiyi in 2152.
- Fixed duplicated import module in `tensorrt_llm/runtime/generation.py`, thanks to the contribution from lkm2835 in 2182.
- Enabled `share_embedding` for the models that have no `lm_head` in legacy checkpoint conversion path, thanks to the contribution from lkm2835 in 2232.
- Fixed `kv_cache_type` issue in the Python benchmark, thanks to the contribution from qingquansong in 2219.
- Fixed an issue with SmoothQuant calibration with custom datasets. Thanks to the contribution by Bhuvanesh09 in 2243.
- Fixed an issue surrounding `trtllm-build --fast-build` with fake or random weights. Thanks to ZJLi2013 for flagging it in 2135.
- Fixed missing `use_fused_mlp` when constructing `BuildConfig` from dict, thanks for the fix from ethnzhng in 2081.
- Fixed lookahead batch layout for `numNewTokensCumSum`. (2263)

Infrastructure Changes
- The dependent ModelOpt version is updated to v0.17.

Documentation
- Sherlock113 added a [tech blog](https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml) to the latest news in #2169, thanks for the contribution.

Known Issues
- Replit Code is not supported with the transformers 4.45+


We are updating the `main` branch regularly with new features, bug fixes and performance optimizations. The `rel` branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

0.13.0

Hi,

We are very pleased to announce the [0.13.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0) version of TensorRT-LLM. This update includes:

Key Features and Enhancements
- Supported lookahead decoding (experimental), see `docs/source/speculative_decoding.md`.
- Added some enhancements to the `ModelWeightsLoader` (a unified checkpoint converter, see `docs/source/architecture/model-weights-loader.md`).
- Supported Qwen models.
- Supported auto-padding for indivisible TP shape in INT4-wo/INT8-wo/INT4-GPTQ.
- Improved performance on `*.bin` and `*.pth`.
- Supported OpenAI Whisper in C++ runtime.
- Added some enhancements to the `LLM` class.
- Supported LoRA.
- Supported engine building using dummy weights.
- Supported `trust_remote_code` for customized models and tokenizers downloaded from Hugging Face Hub.
- Supported beam search for streaming mode.
- Supported tensor parallelism for Mamba2.
- Supported returning generation logits for streaming mode.
- Added `curand` and `bfloat16` support for `ReDrafter`.
- Added sparse mixer normalization mode for MoE models.
- Added support for QKV scaling in FP8 FMHA.
- Supported FP8 for MoE LoRA.
- Supported KV cache reuse for P-Tuning and LoRA.
- Supported in-flight batching for CogVLM models.
- Supported LoRA for the `ModelRunnerCpp` class.
- Supported `head_size=48` cases for FMHA kernels.
- Added FP8 examples for DiT models, see `examples/dit/README.md`.
- Supported decoder with encoder input features for the C++ `executor` API.

API Changes
- [BREAKING CHANGE] Set `use_fused_mlp` to `True` by default.
- [BREAKING CHANGE] Enabled `multi_block_mode` by default.
- [BREAKING CHANGE] Enabled `strongly_typed` by default in `builder` API.
- [BREAKING CHANGE] Renamed `maxNewTokens`, `randomSeed` and `minLength` to `maxTokens`, `seed` and `minTokens` following OpenAI style.
- The `LLM` class
- [BREAKING CHANGE] Updated `LLM.generate` arguments to include `PromptInputs` and `tqdm`.
- The C++ `executor` API
- [BREAKING CHANGE] Added `LogitsPostProcessorConfig`.
- Added `FinishReason` to `Result`.

Model Updates
- Supported Gemma 2, see "Run Gemma 2" section in `examples/gemma/README.md`.

Fixed Issues
- Fixed an accuracy issue when enabling remove padding issue for cross attention. (1999)
- Fixed the failure in converting qwen2-0.5b-instruct when using `smoothquant`. (2087)
- Matched the `exclude_modules` pattern in `convert_utils.py` to the changes in `quantize.py`. (2113)
- Fixed build engine error when `FORCE_NCCL_ALL_REDUCE_STRATEGY` is set.
- Fixed unexpected truncation in the quant mode of `gpt_attention`.
- Fixed the hang caused by race condition when canceling requests.
- Fixed the default factory for `LoraConfig`. (1323)

Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.07-py3`.
- Base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:24.07-py3`.
- The dependent TensorRT version is updated to 10.4.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.
- The dependent ModelOpt version is updated to v0.15.

We are updating the `main` branch regularly with new features, bug fixes and performance optimizations. The `rel` branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

0.12.0

Hi,

We are very pleased to announce the [0.12.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.12.0) version of TensorRT-LLM. This update includes:

Key Features and Enhancements
- Supported LoRA for MoE models.
- The `ModelWeightsLoader` is enabled for LLaMA family models (experimental), see `docs/source/architecture/model-weights-loader.md`.
- Supported FP8 FMHA for NVIDIA Ada Lovelace Architecture.
- Supported GPT-J, Phi, Phi-3, Qwen, GPT, GLM, Baichuan, Falcon and Gemma models for the `LLM` class.
- Supported FP8 OOTB MoE.
- Supported Starcoder2 SmoothQuant. (1886)
- Supported ReDrafter Speculative Decoding, see “ReDrafter” section in `docs/source/speculative_decoding.md`.
- Supported padding removal for BERT, thanks to the contribution from Altair-Alpha in 1834.
- Added in-flight batching support for GLM 10B model.
- Supported `gelu_pytorch_tanh` activation function, thanks to the contribution from ttim in 1897.
- Added `chunk_length` parameter to Whisper, thanks to the contribution from MahmoudAshraf97 in 1909.
- Added `concurrency` argument for `gptManagerBenchmark`.
- Executor API supports requests with different beam widths, see `docs/source/executor.mdsending-requests-with-different-beam-widths`.
- Added the flag `--fast_build` to `trtllm-build` command (experimental).

API Changes
- [BREAKING CHANGE] `max_output_len` is removed from `trtllm-build` command, if you want to limit sequence length on engine build stage, specify `max_seq_len`.
- [BREAKING CHANGE] The `use_custom_all_reduce` argument is removed from `trtllm-build`.
- [BREAKING CHANGE] The `multi_block_mode` argument is moved from build stage (`trtllm-build` and builder API) to the runtime.
- [BREAKING CHANGE] The build time argument `context_fmha_fp32_acc` is moved to runtime for decoder models.
- [BREAKING CHANGE] The arguments `tp_size`, `pp_size` and `cp_size` is removed from `trtllm-build` command.
- The C++ batch manager API is deprecated in favor of the C++ `executor` API, and it will be removed in a future release of TensorRT-LLM.
- Added a version API to the C++ library, a `cpp/include/tensorrt_llm/executor/version.h` file is going to be generated.

Model Updates
- Supported LLaMA 3.1 model.
- Supported Mamba-2 model.
- Supported EXAONE model, see `examples/exaone/README.md`.
- Supported Qwen 2 model.
- Supported GLM4 models, see `examples/chatglm/README.md`.
- Added LLaVa-1.6 (LLaVa-NeXT) multimodal support, see “LLaVA, LLaVa-NeXT and VILA” section in `examples/multimodal/README.md`.

Fixed Issues
- Fixed wrong pad token for the CodeQwen models. (1953)
- Fixed typo in `cluster_infos` defined in `tensorrt_llm/auto_parallel/cluster_info.py`, thanks to the contribution from saeyoonoh in 1987.
- Removed duplicated flags in the command at `docs/source/reference/troubleshooting.md`, thanks for the contribution from hattizai in 1937.
- Fixed segmentation fault in TopP sampling layer, thanks to the contribution from akhoroshev in 2039. (2040)
- Fixed the failure when converting the checkpoint for Mistral Nemo model. (1985)
- Propagated `exclude_modules` to weight-only quantization, thanks to the contribution from fjosw in 2056.
- Fixed wrong links in README, thanks to the contribution from Tayef-Shah in 2028.
- Fixed some typos in the documentation, thanks to the contribution from lfz941 in 1939.
- Fixed the engine build failure when deduced `max_seq_len` is not an integer. (2018)

Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.07-py3`.
- Base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:24.07-py3`.
- The dependent TensorRT version is updated to 10.3.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.
- The dependent ModelOpt version is updated to v0.15.0.

Known Issues

- On Windows, installation of TensorRT-LLM may succeed, but you might hit `OSError: exception: access violation reading 0x0000000000000000` when importing the library in Python. See [Installing on Windows](https://nvidia.github.io/TensorRT-LLM/installation/windows.html) for workarounds.


Currently, there are two key branches in the project:
* The [rel](https://github.com/NVIDIA/TensorRT-LLM/tree/rel) branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
* The [main](https://github.com/NVIDIA/TensorRT-LLM/tree/main) branch is the dev branch. It is more experimental.

We are updating the `main` branch regularly with new features, bug fixes and performance optimizations. The `rel` branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

0.11.0

Hi,

We are very pleased to announce the [0.11.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.11.0) version of TensorRT-LLM. This update includes:

Key Features and Enhancements
- Supported very long context for LLaMA (see “Long context evaluation” section in `examples/llama/README.md`).
- Low latency optimization
- Added a reduce-norm feature which aims to fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, which is recommended to be enabled when the batch size is small and the generation phase time is dominant.
- Added FP8 support to the GEMM plugin, which benefits the cases when batch size is smaller than 4.
- Added a fused GEMM-SwiGLU plugin for FP8 on SM90.
- LoRA enhancements
- Supported running FP8 LLaMA with FP16 LoRA checkpoints.
- Added support for quantized base model and FP16/BF16 LoRA.
- SQ OOTB (- INT8 A/W) + FP16/BF16/FP32 LoRA​
- INT8/ INT4 Weight-Only (INT8 /W) + FP16/BF16/FP32 LoRA​
- Weight-Only Group-wise + FP16/BF16/FP32 LoRA
- Added LoRA support to Qwen2, see “Run models with LoRA” section in `examples/qwen/README.md`.
- Added support for Phi-3-mini/small FP8 base + FP16/BF16 LoRA, see “Run Phi-3 with LoRA” section in `examples/phi/README.md`.
- Added support for starcoder-v2 FP8 base + FP16/BF16 LoRA, see “Run StarCoder2 with LoRA” section in `examples/gpt/README.md`.
- Encoder-decoder models C++ runtime enhancements
- Supported paged KV cache and inflight batching. (800)
- Supported tensor parallelism.
- Supported INT8 quantization with embedding layer excluded.
- Updated default model for Whisper to `distil-whisper/distil-large-v3`, thanks to the contribution from IbrahimAmin1 in 1337.
- Supported HuggingFace model automatically download for the Python high level API.
- Supported explicit draft tokens for in-flight batching.
- Supported local custom calibration datasets, thanks to the contribution from DreamGenX in 1762.
- Added batched logits post processor.
- Added Hopper qgmma kernel to XQA JIT codepath.
- Supported tensor parallelism and expert parallelism enabled together for MoE.
- Supported the pipeline parallelism cases when the number of layers cannot be divided by PP size.
- Added `numQueuedRequests` to the iteration stats log of the executor API.
- Added `iterLatencyMilliSec` to the iteration stats log of the executor API.
- Add HuggingFace model zoo from the community, thanks to the contribution from matichon-vultureprime in 1674.

API Changes
- [BREAKING CHANGE] `trtllm-build` command
- Migrated Whisper to unified workflow (`trtllm-build` command), see documents: examples/whisper/README.md.
- `max_batch_size` in `trtllm-build` command is switched to 256 by default.
- `max_num_tokens` in `trtllm-build` command is switched to 8192 by default.
- Deprecated `max_output_len` and added `max_seq_len`.
- Removed unnecessary `--weight_only_precision` argument from `trtllm-build` command.
- Removed `attention_qk_half_accumulation` argument from `trtllm-build` command.
- Removed `use_context_fmha_for_generation` argument from `trtllm-build` command.
- Removed `strongly_typed` argument from `trtllm-build` command.
- The default value of `max_seq_len` reads from the HuggingFace mode config now.
- C++ runtime
- [BREAKING CHANGE] Renamed `free_gpu_memory_fraction` in `ModelRunnerCpp` to `kv_cache_free_gpu_memory_fraction`.
- [BREAKING CHANGE] Refactored `GptManager` API
- Moved `maxBeamWidth` into `TrtGptModelOptionalParams`.
- Moved `schedulerConfig` into `TrtGptModelOptionalParams`.
- Added some more options to `ModelRunnerCpp`, including `max_tokens_in_paged_kv_cache`, `kv_cache_enable_block_reuse` and `enable_chunked_context`.
- [BREAKING CHANGE] Python high-level API
- Removed the `ModelConfig` class, and all the options are moved to `LLM` class.
- Refactored the `LLM` class, please refer to `examples/high-level-api/README.md`
- Moved the most commonly used options in the explicit arg-list, and hidden the expert options in the kwargs.
- Exposed `model` to accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine.
- Support downloading model from HuggingFace model hub, currently only Llama variants are supported.
- Support build cache to reuse the built TensorRT-LLM engines by setting environment variable `TLLM_HLAPI_BUILD_CACHE=1` or passing `enable_build_cache=True` to `LLM` class.
- Exposed low-level options including `BuildConfig`, `SchedulerConfig` and so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.
- Refactored `LLM.generate()` and `LLM.generate_async()` API.
- Removed `SamplingConfig`.
- Added `SamplingParams` with more extensive parameters, see `tensorrt_llm/hlapi/utils.py`.
- The new `SamplingParams` contains and manages fields from Python bindings of `SamplingConfig`, `OutputConfig`, and so on.
- Refactored `LLM.generate()` output as `RequestOutput`, see `tensorrt_llm/hlapi/llm.py`.
- Updated the `apps` examples, specially by rewriting both `chat.py` and `fastapi_server.py` using the `LLM` APIs, please refer to the `examples/apps/README.md` for details.
- Updated the `chat.py` to support multi-turn conversation, allowing users to chat with a model in the terminal.
- Fixed the `fastapi_server.py` and eliminate the need for `mpirun` in multi-GPU scenarios.
- [BREAKING CHANGE] Speculative decoding configurations unification
- Introduction of `SpeculativeDecodingMode.h` to choose between different speculative decoding techniques.
- Introduction of `SpeculativeDecodingModule.h` base class for speculative decoding techniques.
- Removed `decodingMode.h`.
- `gptManagerBenchmark`
- [BREAKING CHANGE] `api` in `gptManagerBenchmark` command is `executor` by default now.
- Added a runtime `max_batch_size`.
- Added a runtime `max_num_tokens`.
- [BREAKING CHANGE] Added a `bias` argument to the `LayerNorm` module, and supports non-bias layer normalization.
- [BREAKING CHANGE] Removed `GptSession` Python bindings.

Model Updates
- Supported Jais, see `examples/jais/README.md`.
- Supported DiT, see `examples/dit/README.md`.
- Supported VILA 1.5.
- Supported Video NeVA, see `Video NeVA`section in `examples/multimodal/README.md`.
- Supported Grok-1, see `examples/grok/README.md`.
- Supported Qwen1.5-110B with FP8 PTQ.
- Supported Phi-3 small model with block sparse attention.
- Supported InternLM2 7B/20B, thanks to the contribution from RunningLeon in 1392.
- Supported Phi-3-medium models, see `examples/phi/README.md`.
- Supported Qwen1.5 MoE A2.7B.
- Supported phi 3 vision multimodal.

Fixed Issues
- Fixed brokens outputs for the cases when batch size is larger than 1. (1539)
- Fixed `top_k` type in `executor.py`, thanks to the contribution from vonjackustc in 1329.
- Fixed stop and bad word list pointer offset in Python runtime, thanks to the contribution from fjosw in 1486.
- Fixed some typos for Whisper model, thanks to the contribution from Pzzzzz5142 in 1328.
- Fixed export failure with CUDA driver < 526 and pynvml >= 11.5.0, thanks to the contribution from CoderHam in 1537.
- Fixed an issue in NMT weight conversion, thanks to the contribution from Pzzzzz5142 in 1660.
- Fixed LLaMA Smooth Quant conversion, thanks to the contribution from lopuhin in 1650.
- Fixed `qkv_bias` shape issue for Qwen1.5-32B (1589), thanks to the contribution from Tlntin in 1637.
- Fixed the error of Ada traits for `fpA_intB`, thanks to the contribution from JamesTheZ in 1583.
- Update `examples/qwenvl/requirements.txt`, thanks to the contribution from ngoanpv in 1248.
- Fixed rsLoRA scaling in `lora_manager`, thanks to the contribution from TheCodeWrangler in 1669.
- Fixed Qwen1.5 checkpoint convert failure 1675.
- Fixed Medusa safetensors and AWQ conversion, thanks to the contribution from Tushar-ml in 1535.
- Fixed `convert_hf_mpt_legacy` call failure when the function is called in other than global scope, thanks to the contribution from bloodeagle40234 in 1534.
- Fixed `use_fp8_context_fmha` broken outputs (1539).
- Fixed pre-norm weight conversion for NMT models, thanks to the contribution from Pzzzzz5142 in 1723.
- Fixed random seed initialization issue, thanks to the contribution from pathorn in 1742.
- Fixed stop words and bad words in python bindings. (1642)
- Fixed the issue that when converting checkpoint for Mistral 7B v0.3, thanks to the contribution from Ace-RR: 1732.
- Fixed broken inflight batching for fp8 Llama and Mixtral, thanks to the contribution from bprus: 1738
- Fixed the failure when `quantize.py` is export data to config.json, thanks to the contribution from janpetrov: 1676
- Raise error when autopp detects unsupported quant plugin 1626.
- Fixed the issue that `shared_embedding_table` is not being set when loading Gemma 1799, thanks to the contribution from mfuntowicz.
- Fixed stop and bad words list contiguous for `ModelRunner` 1815, thanks to the contribution from Marks101.
- Fixed missing comment for `FAST_BUILD`, thanks to the support from lkm2835 in 1851.
- Fixed the issues that Top-P sampling occasionally produces invalid tokens. 1590
- Fixed 1424.
- Fixed 1529.
- Fixed `benchmarks/cpp/README.md` for 1562 and 1552.
- Fixed dead link, thanks to the help from DefTruth, buvnswrn and sunjiabin17 in: https://github.com/triton-inference-server/tensorrtllm_backend/pull/478, https://github.com/triton-inference-server/tensorrtllm_backend/pull/482 and https://github.com/triton-inference-server/tensorrtllm_backend/pull/449.

Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.05-py3`.
- Base Docker image for TensorRT-LLM backend is updated to `nvcr.io/nvidia/tritonserver:24.05-py3`.
- The dependent TensorRT version is updated to 10.1.0.
- The dependent CUDA version is updated to 12.4.1.
- The dependent PyTorch version is updated to 2.3.1.
- The dependent ModelOpt version is updated to v0.13.0.

Known Issues

- In a conda environment on Windows, installation of TensorRT-LLM may succeed. However, when importing the library in Python, you may receive an error message of `OSError: exception: access violation reading 0x0000000000000000`. This issue is under investigation.


Currently, there are two key branches in the project:
* The [rel](https://github.com/NVIDIA/TensorRT-LLM/tree/rel) branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
* The [main](https://github.com/NVIDIA/TensorRT-LLM/tree/main) branch is the dev branch. It is more experimental.

We are updating the `main` branch regularly with new features, bug fixes and performance optimizations. The `rel` branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

0.10.0

Hi,

We are very pleased to announce the [0.10.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.10.0) version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

Key Features and Enhancements
- The Python high level API
- Added embedding parallel, embedding sharing, and fused MLP support.
- Enabled the usage of the `executor` API.
- Added a weight-stripping feature with a new `trtllm-refit` command. For more information, refer to `examples/sample_weight_stripping/README.md`.
- Added a weight-streaming feature. For more information, refer to `docs/source/advanced/weight-streaming.md`.
- Enhanced the multiple profiles feature; `--multiple_profiles` argument in `trtllm-build` command builds more optimization profiles now for better performance.
- Added FP8 quantization support for Mixtral.
- Added support for pipeline parallelism for GPT.
- Optimized `applyBiasRopeUpdateKVCache` kernel by avoiding re-computation.
- Reduced overheads between `enqueue` calls of TensorRT engines.
- Added support for paged KV cache for enc-dec models. The support is limited to beam width 1.
- Added W4A(fp)8 CUTLASS kernels for the NVIDIA Ada Lovelace architecture.
- Added debug options (`--visualize_network` and `--dry_run`) to the `trtllm-build` command to visualize the TensorRT network before engine build.
- Integrated the new NVIDIA Hopper XQA kernels for LLaMA 2 70B model.
- Improved the performance of pipeline parallelism when enabling in-flight batching.
- Supported quantization for Nemotron models.
- Added LoRA support for Mixtral and Qwen.
- Added in-flight batching support for ChatGLM models.
- Added support to `ModelRunnerCpp` so that it runs with the `executor` API for IFB-compatible models.
- Enhanced the custom `AllReduce` by adding a heuristic; fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance.
- Optimized the performance of checkpoint conversion process for LLaMA.
- Benchmark
- [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to `gptManagerBenchmark`.
- Enabled streaming and support `Time To the First Token (TTFT)` latency and `Inter-Token Latency (ITL)` metrics for `gptManagerBenchmark`.
- Added the `--max_attention_window` option to `gptManagerBenchmark`.

API Changes
- [BREAKING CHANGE] Set the default `tokens_per_block` argument of the `trtllm-build` command to 64 for better performance.
- [BREAKING CHANGE] Migrated enc-dec models to the unified workflow.
- [BREAKING CHANGE] Renamed `GptModelConfig` to `ModelConfig`.
- [BREAKING CHANGE] Added speculative decoding mode to the builder API.
- [BREAKING CHANGE] Refactor scheduling configurations
- Unified the `SchedulerPolicy` with the same name in `batch_scheduler` and `executor`, and renamed it to `CapacitySchedulerPolicy`.
- Expanded the existing configuration scheduling strategy from `SchedulerPolicy` to `SchedulerConfig` to enhance extensibility. The latter also introduces a chunk-based configuration called `ContextChunkingPolicy`.
- [BREAKING CHANGE] The input prompt was removed from the generation output in the `generate()` and `generate_async()` APIs. For example, when given a prompt as `A B`, the original generation result could be `<s>A B C D E` where only `C D E` is the actual output, and now the result is `C D E`.
- [BREAKING CHANGE] Switched default `add_special_token` in the TensorRT-LLM backend to `True`.
- Deprecated `GptSession` and `TrtGptModelV1`.

Model Updates
- Support DBRX
- Support Qwen2
- Support CogVLM
- Support ByT5
- Support LLaMA 3
- Support Arctic (w/ FP8)
- Support Fuyu
- Support Persimmon
- Support Deplot
- Support Phi-3-Mini with long Rope
- Support Neva
- Support Kosmos-2
- Support RecurrentGemma

Fixed Issues
- Fixed some unexpected behaviors in beam search and early stopping, so that the outputs are more accurate.
- Fixed segmentation fault with pipeline parallelism and `gather_all_token_logits`. (1284)
- Removed the unnecessary check in XQA to fix code Llama 70b Triton crashes. (1256)
- Fixed an unsupported ScalarType issue for BF16 LoRA. (https://github.com/triton-inference-server/tensorrtllm_backend/issues/403)
- Eliminated the load and save of prompt table in multimodal. (https://github.com/NVIDIA/TensorRT-LLM/discussions/1436)
- Fixed an error when converting the models weights of Qwen 72B INT4-GPTQ. (1344)
- Fixed early stopping and failures on in-flight batching cases of Medusa. (1449)
- Added support for more NVLink versions for auto parallelism. (1467)
- Fixed the assert failure caused by default values of sampling config. (1447)
- Fixed a requirement specification on Windows for nvidia-cudnn-cu12. (1446)
- Fixed MMHA relative position calculation error in `gpt_attention_plugin` for enc-dec models. (1343)


Infrastructure changes
- Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.03-py3`.
- Base Docker image for TensorRT-LLM backend is updated to `nvcr.io/nvidia/tritonserver:24.03-py3`.
- The dependent TensorRT version is updated to 10.0.1.
- The dependent CUDA version is updated to 12.4.0.
- The dependent PyTorch version is updated to 2.2.2.


Currently, there are two key branches in the project:
* The [rel](https://github.com/NVIDIA/TensorRT-LLM/tree/rel) branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
* The [main](https://github.com/NVIDIA/TensorRT-LLM/tree/main) branch is the dev branch. It is more experimental.

We are updating the `main` branch regularly with new features, bug fixes and performance optimizations. The `rel` branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

0.9.0

Hi,

We are very pleased to announce the [0.9.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.9.0) version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

* Model Support
- Support distil-whisper, thanks to the contribution from Bhuvanesh09 in PR 1061
- Support HuggingFace StarCoder2
- Support VILA
- Support Smaug-72B-v0.1
- Migrate BLIP-2 examples to `examples/multimodal`
* Features
- **[BREAKING CHANGE]** TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
- **[BREAKING CHANGE]** Support embedding sharing for Gemma
- Add support to context chunking to work with KV cache reuse
- Enable different rewind tokens per sequence for Medusa
- BART LoRA support (limited to the Python runtime)
- Enable multi-LoRA for BART LoRA
- Support `early_stopping=False` in beam search for C++ Runtime
- Add logits post processor to the batch manager (see docs/source/batch_manager.mdlogits-post-processor-optional)
- Support import and convert HuggingFace Gemma checkpoints, thanks for the contribution from mfuntowicz in 1147
- Support loading Gemma from HuggingFace
- Support auto parallelism planner for high-level API and unified builder workflow
- Support run `GptSession` without OpenMPI 1220
- Medusa IFB support
- **[Experimental]** Support FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
- More head sizes support for LLaMA-like models
- Ampere (sm80, sm86), Ada (sm89), Hopper(sm90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256] now.
- OOTB functionality support
- T5
- Mixtral 8x7B
* API
- C++ `executor` API
- Add Python bindings, see documentation and examples in `examples/bindings`
- Add advanced and multi-GPU examples for Python binding of `executor` C++ API, see `examples/bindings/README.md`
- Add documents for C++ `executor` API, see `docs/source/executor.md`
- High-level API (refer to `examples/high-level-api/README.md` for guidance)
- **[BREAKING CHANGE]** Reuse the `QuantConfig` used in `trtllm-build` tool, support broader quantization features
- Support in `LLM()` API to accept engines built by `trtllm-build` command
- Add support for TensorRT-LLM checkpoint as model input
- Refine `SamplingConfig` used in `LLM.generate` or `LLM.generate_async` APIs, with the support of beam search, a variety of penalties, and more features
- Add support for the StreamingLLM feature, enable it by setting `LLM(streaming_llm=...)`
- Migrate Mixtral to high level API and unified builder workflow
- **[BREAKING CHANGE]** Refactored Qwen model to the unified build workflow, see `examples/qwen/README.md` for the latest commands
- **[BREAKING CHANGE]** Move LLaMA convert checkpoint script from examples directory into the core library
- **[BREAKING CHANGE]** Refactor GPT with unified building workflow, see `examples/gpt/README.md` for the latest commands
- **[BREAKING CHANGE]** Removed all the lora related flags from convert_checkpoint.py script and the checkpoint content to `trtllm-build` command, to generalize the feature better to more models
- **[BREAKING CHANGE]** Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the `trtllm-build --max_prompt_embedding_table_size` instead.
- **[BREAKING CHANGE]** Changed the `trtllm-build --world_size` flag to `--auto_parallel` flag, the option is used for auto parallel planner only.
- **[BREAKING CHANGE]** `AsyncLLMEngine` is removed, `tensorrt_llm.GenerationExecutor` class is refactored to work with both explicitly launching with `mpirun` in the application level, and accept an MPI communicator created by `mpi4py`
- **[BREAKING CHANGE]** `examples/server` are removed, see `examples/app` instead.
- **[BREAKING CHANGE]** Remove LoRA related parameters from convert checkpoint scripts
- **[BREAKING CHANGE]** Simplify Qwen convert checkpoint script
- **[BREAKING CHANGE]** Remove `model` parameter from `gptManagerBenchmark` and `gptSessionBenchmark`
* Bug fixes
- Fix a weight-only quant bug for Whisper to make sure that the `encoder_input_len_range` is not 0, thanks to the contribution from Eddie-Wang1120 in 992
- Fix the issue that log probabilities in Python runtime are not returned 983
- Multi-GPU fixes for multimodal examples 1003
- Fix wrong `end_id` issue for Qwen 987
- Fix a non-stopping generation issue 1118 1123
- Fix wrong link in examples/mixtral/README.md 1181
- Fix LLaMA2-7B bad results when int8 kv cache and per-channel int8 weight only are enabled 967
- Fix wrong `head_size` when importing Gemma model from HuggingFace Hub, thanks for the contribution from mfuntowicz in 1148
- Fix ChatGLM2-6B building failure on INT8 1239
- Fix wrong relative path in Baichuan documentation 1242
- Fix wrong `SamplingConfig` tensors in `ModelRunnerCpp` 1183
- Fix error when converting SmoothQuant LLaMA 1267
- Fix the issue that `examples/run.py` only load one line from `--input_file`
- Fix the issue that `ModelRunnerCpp` does not transfer `SamplingConfig` tensor fields correctly 1183
* Benchmark
- Add emulated static batching in `gptManagerBenchmark`
- Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in `benchmarks/cpp/README.md`
- Add percentile latency report to `gptManagerBenchmark`
* Performance
- Optimize `gptDecoderBatch` to support batched sampling
- Enable FMHA for models in BART, Whisper and NMT family
- Remove router tensor parallelism to improve performance for MoE models, thanks to the contribution from megha95 in 1091
- Improve custom all-reduce kernel
* Infra
- Base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:24.02-py3`
- Base Docker image for TensorRT-LLM backend is updated to `nvcr.io/nvidia/tritonserver:24.02-py3`
- The dependent TensorRT version is updated to 9.3
- The dependent PyTorch version is updated to 2.2
- The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)

Currently, there are two key branches in the project:

* The [rel](https://github.com/NVIDIA/TensorRT-LLM/tree/rel) branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
* The [main](https://github.com/NVIDIA/TensorRT-LLM/tree/main) branch is the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,

The TensorRT-LLM Engineering Team

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.