An open-source large language model inference optimization library developed by NVIDIA, providing state-of-the-art performance optimization for GPU inference through TensorRT technology.

Apache-2.0C++TensorRT-LLMNVIDIA 11.6k Last Updated: September 12, 2025

Detailed Introduction to the TensorRT-LLM Project

Project Overview

TensorRT-LLM is an open-source library developed by NVIDIA, specifically designed to optimize the inference performance of large language models (LLMs) on NVIDIA GPUs. It provides an easy-to-use Python API for defining LLMs and supports state-of-the-art optimization techniques for efficient inference execution on NVIDIA GPUs.

Core Features

1. Advanced Optimization Techniques

TensorRT-LLM offers a variety of advanced optimization features, including:

  • Custom Attention Kernels: Specially optimized implementations of attention mechanisms
  • Dynamic Batching (Inflight Batching): Real-time processing of input sequences with varying lengths
  • Paged KV Cache: Efficient key-value cache management
  • Speculative Decoding: Accelerates generation by predicting multiple tokens
  • Multiple Quantization Support: FP8, FP4, INT4 AWQ, INT8 SmoothQuant, etc.

2. Detailed Quantization Techniques

TensorRT-LLM provides an industry-leading unified quantization toolkit that significantly accelerates the deployment of deep learning/generative AI on NVIDIA hardware while maintaining model accuracy.

Key Quantization Methods:

  • FP8: Typically offers the best performance and accuracy in large-batch inference scenarios, suitable for batch sizes ≥ 16.
  • INT8 SmoothQuant: Weight smoothing and INT8 channel-wise quantization, with tensor-level scaling for activation ranges.
  • INT4 AWQ: Weight re-scaling and block-wise quantization to INT4, recommended for small-batch inference scenarios (batch sizes ≤ 4).
  • W4A8 AWQ: Weight quantization to INT4, activation quantization to INT8.

Performance Improvements:

According to benchmark tests, quantization techniques can bring significant performance improvements:

  • FP8 Quantization: Llama 3 8B model achieves 1.45x acceleration, and the 70B model achieves 1.81x acceleration compared to the FP16 baseline.
  • INT4 AWQ: In scenarios with a batch size of 1, the 70B model can achieve up to 2.66x performance improvement.
  • Memory Optimization: All quantized versions of the Llama 3 70B model can run on a single NVIDIA H100 GPU, whereas FP16 precision requires at least two GPUs.

3. Multi-GPU and Multi-Node Support

TensorRT-LLM includes pre- and post-processing steps and multi-GPU multi-node communication primitives, enabling breakthrough LLM inference performance through a simple, open-source model definition API.

4. Extensive Hardware Support

TensorRT-LLM supports GPUs based on NVIDIA Hopper, NVIDIA Ada Lovelace, and NVIDIA Ampere architectures. Specifically:

  • H100 GPU: Supports automatic conversion to FP8 format and optimized kernels.
  • H200 GPU: Can achieve nearly 12,000 tokens/second performance on Llama2-13B.
  • RTX Series: Supports large model inference on consumer-grade GPUs.

Installation and Usage

Docker Installation (Recommended)

# Run the pre-built Docker container
docker run --ipc host --gpus all -it nvcr.io/nvidia/tensorrt-llm/release

LLM API Usage Example

from tensorrt_llm import BuildConfig, SamplingParams
from tensorrt_llm._tensorrt_engine import LLM

def main():
    build_config = BuildConfig()
    build_config.max_batch_size = 256
    build_config.max_num_tokens = 1024
    
    # Supports HuggingFace model names, local HF model paths, or TensorRT model optimizer quantization checkpoints
    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", 
              build_config=build_config)
    
    # Example prompts
    prompts = [
        "Hello, my name is",
        "The capital of France is",
        "The future of AI is",
    ]
    
    # Create sampling parameters
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    
    for output in llm.generate(prompts, sampling_params):
        print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

Online Service Deployment

# Start an OpenAI-compatible server
trtllm-serve --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8000

Quantization Workflow

Basic Quantization Commands

# FP8 Quantization
python quantize.py --model_dir $MODEL_PATH --qformat fp8 --kv_cache_dtype fp8 --output_dir $OUTPUT_PATH

# INT4 AWQ Quantization
python quantize.py --model_dir $MODEL_PATH --qformat int4_awq --awq_block_size 64 --tp_size 4 --output_dir $OUTPUT_PATH

# INT8 SmoothQuant Quantization
python quantize.py --model_dir $MODEL_PATH --qformat int8_sq --kv_cache_dtype int8 --output_dir $OUTPUT_PATH

# Auto Quantization (combination of multiple methods)
python quantize.py --model_dir $MODEL_PATH --autoq_format fp8,int4_awq,w4a8_awq --output_dir $OUTPUT_PATH --auto_quantize_bits 5 --tp_size 2

Supported Models

TensorRT-LLM supports a wide range of popular LLM architectures, including but not limited to:

  • Llama Series: Llama 2, Llama 3, Llama 3.1, Llama 3.3
  • Falcon Series: Including Falcon-180B
  • GPT Series: ChatGPT-related architectures
  • Gemma Series: Google's open-source models
  • Mixtral Series: Mixture-of-Experts models
  • DeepSeek Series: Including DeepSeek R1
  • CodeLlama: Code generation specific models

Ecosystem Integration

NVIDIA Ecosystem

  • NVIDIA NeMo: An end-to-end framework for building, customizing, and deploying generative AI applications.
  • Triton Inference Server: A production-grade inference server.
  • NVIDIA Dynamo: A data center-scale distributed inference serving framework.

Third-Party Integrations

  • HuggingFace Hub: Provides pre-quantized models.
  • LlamaIndex: For RAG application development.
  • SageMaker LMI: AWS managed inference.

Performance Benchmarks

Examples of performance improvements:

  • Compared to CPU platforms: Inference speed increased by up to 36x.
  • Compared to unoptimized RTX: LLM speed increased by up to 4x on Windows RTX platforms.
  • Falcon-180B: Achieves inference using INT4 AWQ on a single H200 GPU.
  • Llama-70B: Achieves 6.7x speed improvement compared to A100.

Best Practices and Recommendations

Quantization Method Selection

Choose the appropriate quantization method based on different scenarios:

  1. Small-batch inference (batch size ≤ 4):

    • Recommended to use weight-only quantization methods (e.g., INT4 AWQ).
    • Primarily considers memory bandwidth limitations.
  2. Large-batch inference (batch size ≥ 16):

    • Prioritize FP8 quantization, which typically offers the best performance and accuracy.
    • If results are not satisfactory, try INT8 SmoothQuant, then AWQ and/or GPTQ.
  3. Domain-specific applications:

    • For highly specialized applications like code completion, it is recommended to use domain-specific datasets for calibration.

Technical Advantages

  1. Ease of Use: Provides high-level Python APIs, simplifying the LLM definition and optimization process.
  2. Performance: Includes all mainstream optimization techniques, such as kernel fusion, quantization, runtime optimization, etc.
  3. Scalability: Supports various deployment scenarios, from single-GPU to multi-node.
  4. Compatibility: Deeply integrated with PyTorch, supporting major inference ecosystems.
  5. Open Source: Fully open-source, with community-driven continuous development.

Future Development

TensorRT-LLM enhances ease of use and scalability through an open-source modular model definition API, used for defining, optimizing, and executing new architectures and enhancements, allowing for easy customization as LLMs evolve.

The project's continuous development directions include:

  • More model architecture support
  • More advanced quantization techniques
  • Better multi-node scalability
  • Tighter ecosystem integration

Star History Chart