NVIDIA/TensorRT-LLM View GitHub Homepage for Latest Official Releases

An open-source large language model inference optimization library developed by NVIDIA, providing state-of-the-art performance optimization for GPU inference through TensorRT technology.

Apache-2.0C++TensorRT-LLMNVIDIA 12.0k Last Updated: October 27, 2025

Detailed Introduction to the TensorRT-LLM Project

Project Overview

TensorRT-LLM is an open-source library developed by NVIDIA, specifically designed to optimize the inference performance of large language models (LLMs) on NVIDIA GPUs. It provides an easy-to-use Python API for defining LLMs and supports state-of-the-art optimization techniques for efficient inference execution on NVIDIA GPUs.

Core Features

1. Advanced Optimization Techniques

TensorRT-LLM offers a variety of advanced optimization features, including:

Custom Attention Kernels: Specially optimized implementations of attention mechanisms
Dynamic Batching (Inflight Batching): Real-time processing of input sequences with varying lengths
Paged KV Cache: Efficient key-value cache management
Speculative Decoding: Accelerates generation by predicting multiple tokens
Multiple Quantization Support: FP8, FP4, INT4 AWQ, INT8 SmoothQuant, etc.

2. Detailed Quantization Techniques

TensorRT-LLM provides an industry-leading unified quantization toolkit that significantly accelerates the deployment of deep learning/generative AI on NVIDIA hardware while maintaining model accuracy.

Key Quantization Methods:

FP8: Typically offers the best performance and accuracy in large-batch inference scenarios, suitable for batch sizes ≥ 16.
INT8 SmoothQuant: Weight smoothing and INT8 channel-wise quantization, with tensor-level scaling for activation ranges.
INT4 AWQ: Weight re-scaling and block-wise quantization to INT4, recommended for small-batch inference scenarios (batch sizes ≤ 4).
W4A8 AWQ: Weight quantization to INT4, activation quantization to INT8.

Performance Improvements:

According to benchmark tests, quantization techniques can bring significant performance improvements:

FP8 Quantization: Llama 3 8B model achieves 1.45x acceleration, and the 70B model achieves 1.81x acceleration compared to the FP16 baseline.
INT4 AWQ: In scenarios with a batch size of 1, the 70B model can achieve up to 2.66x performance improvement.
Memory Optimization: All quantized versions of the Llama 3 70B model can run on a single NVIDIA H100 GPU, whereas FP16 precision requires at least two GPUs.

3. Multi-GPU and Multi-Node Support

TensorRT-LLM includes pre- and post-processing steps and multi-GPU multi-node communication primitives, enabling breakthrough LLM inference performance through a simple, open-source model definition API.

4. Extensive Hardware Support

TensorRT-LLM supports GPUs based on NVIDIA Hopper, NVIDIA Ada Lovelace, and NVIDIA Ampere architectures. Specifically:

H100 GPU: Supports automatic conversion to FP8 format and optimized kernels.
H200 GPU: Can achieve nearly 12,000 tokens/second performance on Llama2-13B.
RTX Series: Supports large model inference on consumer-grade GPUs.

Installation and Usage

Docker Installation (Recommended)

# Run the pre-built Docker container
docker run --ipc host --gpus all -it nvcr.io/nvidia/tensorrt-llm/release

LLM API Usage Example

from tensorrt_llm import BuildConfig, SamplingParams
from tensorrt_llm._tensorrt_engine import LLM

def main():
    build_config = BuildConfig()
    build_config.max_batch_size = 256
    build_config.max_num_tokens = 1024
    
    # Supports HuggingFace model names, local HF model paths, or TensorRT model optimizer quantization checkpoints
    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", 
              build_config=build_config)
    
    # Example prompts
    prompts = [
        "Hello, my name is",
        "The capital of France is",
        "The future of AI is",
    ]
    
    # Create sampling parameters
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    
    for output in llm.generate(prompts, sampling_params):
        print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

Online Service Deployment

# Start an OpenAI-compatible server
trtllm-serve --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8000

Quantization Workflow

Basic Quantization Commands

# FP8 Quantization
python quantize.py --model_dir $MODEL_PATH --qformat fp8 --kv_cache_dtype fp8 --output_dir $OUTPUT_PATH

# INT4 AWQ Quantization
python quantize.py --model_dir $MODEL_PATH --qformat int4_awq --awq_block_size 64 --tp_size 4 --output_dir $OUTPUT_PATH

# INT8 SmoothQuant Quantization
python quantize.py --model_dir $MODEL_PATH --qformat int8_sq --kv_cache_dtype int8 --output_dir $OUTPUT_PATH

# Auto Quantization (combination of multiple methods)
python quantize.py --model_dir $MODEL_PATH --autoq_format fp8,int4_awq,w4a8_awq --output_dir $OUTPUT_PATH --auto_quantize_bits 5 --tp_size 2

Supported Models

TensorRT-LLM supports a wide range of popular LLM architectures, including but not limited to:

Llama Series: Llama 2, Llama 3, Llama 3.1, Llama 3.3
Falcon Series: Including Falcon-180B
GPT Series: ChatGPT-related architectures
Gemma Series: Google's open-source models
Mixtral Series: Mixture-of-Experts models
DeepSeek Series: Including DeepSeek R1
CodeLlama: Code generation specific models

Ecosystem Integration

NVIDIA Ecosystem

NVIDIA NeMo: An end-to-end framework for building, customizing, and deploying generative AI applications.
Triton Inference Server: A production-grade inference server.
NVIDIA Dynamo: A data center-scale distributed inference serving framework.

Third-Party Integrations

HuggingFace Hub: Provides pre-quantized models.
LlamaIndex: For RAG application development.
SageMaker LMI: AWS managed inference.

Performance Benchmarks

Examples of performance improvements:

Compared to CPU platforms: Inference speed increased by up to 36x.
Compared to unoptimized RTX: LLM speed increased by up to 4x on Windows RTX platforms.
Falcon-180B: Achieves inference using INT4 AWQ on a single H200 GPU.
Llama-70B: Achieves 6.7x speed improvement compared to A100.

Best Practices and Recommendations

Quantization Method Selection

Choose the appropriate quantization method based on different scenarios:

Small-batch inference (batch size ≤ 4):
- Recommended to use weight-only quantization methods (e.g., INT4 AWQ).
- Primarily considers memory bandwidth limitations.
Large-batch inference (batch size ≥ 16):
- Prioritize FP8 quantization, which typically offers the best performance and accuracy.
- If results are not satisfactory, try INT8 SmoothQuant, then AWQ and/or GPTQ.
Domain-specific applications:
- For highly specialized applications like code completion, it is recommended to use domain-specific datasets for calibration.

Technical Advantages

Ease of Use: Provides high-level Python APIs, simplifying the LLM definition and optimization process.
Performance: Includes all mainstream optimization techniques, such as kernel fusion, quantization, runtime optimization, etc.
Scalability: Supports various deployment scenarios, from single-GPU to multi-node.
Compatibility: Deeply integrated with PyTorch, supporting major inference ecosystems.
Open Source: Fully open-source, with community-driven continuous development.

Future Development

TensorRT-LLM enhances ease of use and scalability through an open-source modular model definition API, used for defining, optimizing, and executing new architectures and enhancements, allowing for easy customization as LLMs evolve.

The project's continuous development directions include:

More model architecture support
More advanced quantization techniques
Better multi-node scalability
Tighter ecosystem integration