An open-source large language model inference optimization library developed by NVIDIA, providing state-of-the-art performance optimization for GPU inference through TensorRT technology.
Detailed Introduction to the TensorRT-LLM Project
Project Overview
TensorRT-LLM is an open-source library developed by NVIDIA, specifically designed to optimize the inference performance of large language models (LLMs) on NVIDIA GPUs. It provides an easy-to-use Python API for defining LLMs and supports state-of-the-art optimization techniques for efficient inference execution on NVIDIA GPUs.
Core Features
1. Advanced Optimization Techniques
TensorRT-LLM offers a variety of advanced optimization features, including:
- Custom Attention Kernels: Specially optimized implementations of attention mechanisms
- Dynamic Batching (Inflight Batching): Real-time processing of input sequences with varying lengths
- Paged KV Cache: Efficient key-value cache management
- Speculative Decoding: Accelerates generation by predicting multiple tokens
- Multiple Quantization Support: FP8, FP4, INT4 AWQ, INT8 SmoothQuant, etc.
2. Detailed Quantization Techniques
TensorRT-LLM provides an industry-leading unified quantization toolkit that significantly accelerates the deployment of deep learning/generative AI on NVIDIA hardware while maintaining model accuracy.
Key Quantization Methods:
- FP8: Typically offers the best performance and accuracy in large-batch inference scenarios, suitable for batch sizes ≥ 16.
- INT8 SmoothQuant: Weight smoothing and INT8 channel-wise quantization, with tensor-level scaling for activation ranges.
- INT4 AWQ: Weight re-scaling and block-wise quantization to INT4, recommended for small-batch inference scenarios (batch sizes ≤ 4).
- W4A8 AWQ: Weight quantization to INT4, activation quantization to INT8.
Performance Improvements:
According to benchmark tests, quantization techniques can bring significant performance improvements:
- FP8 Quantization: Llama 3 8B model achieves 1.45x acceleration, and the 70B model achieves 1.81x acceleration compared to the FP16 baseline.
- INT4 AWQ: In scenarios with a batch size of 1, the 70B model can achieve up to 2.66x performance improvement.
- Memory Optimization: All quantized versions of the Llama 3 70B model can run on a single NVIDIA H100 GPU, whereas FP16 precision requires at least two GPUs.
3. Multi-GPU and Multi-Node Support
TensorRT-LLM includes pre- and post-processing steps and multi-GPU multi-node communication primitives, enabling breakthrough LLM inference performance through a simple, open-source model definition API.
4. Extensive Hardware Support
TensorRT-LLM supports GPUs based on NVIDIA Hopper, NVIDIA Ada Lovelace, and NVIDIA Ampere architectures. Specifically:
- H100 GPU: Supports automatic conversion to FP8 format and optimized kernels.
- H200 GPU: Can achieve nearly 12,000 tokens/second performance on Llama2-13B.
- RTX Series: Supports large model inference on consumer-grade GPUs.
Installation and Usage
Docker Installation (Recommended)
# Run the pre-built Docker container
docker run --ipc host --gpus all -it nvcr.io/nvidia/tensorrt-llm/release
LLM API Usage Example
from tensorrt_llm import BuildConfig, SamplingParams
from tensorrt_llm._tensorrt_engine import LLM
def main():
build_config = BuildConfig()
build_config.max_batch_size = 256
build_config.max_num_tokens = 1024
# Supports HuggingFace model names, local HF model paths, or TensorRT model optimizer quantization checkpoints
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
build_config=build_config)
# Example prompts
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
# Create sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
for output in llm.generate(prompts, sampling_params):
print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")
Online Service Deployment
# Start an OpenAI-compatible server
trtllm-serve --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8000
Quantization Workflow
Basic Quantization Commands
# FP8 Quantization
python quantize.py --model_dir $MODEL_PATH --qformat fp8 --kv_cache_dtype fp8 --output_dir $OUTPUT_PATH
# INT4 AWQ Quantization
python quantize.py --model_dir $MODEL_PATH --qformat int4_awq --awq_block_size 64 --tp_size 4 --output_dir $OUTPUT_PATH
# INT8 SmoothQuant Quantization
python quantize.py --model_dir $MODEL_PATH --qformat int8_sq --kv_cache_dtype int8 --output_dir $OUTPUT_PATH
# Auto Quantization (combination of multiple methods)
python quantize.py --model_dir $MODEL_PATH --autoq_format fp8,int4_awq,w4a8_awq --output_dir $OUTPUT_PATH --auto_quantize_bits 5 --tp_size 2
Supported Models
TensorRT-LLM supports a wide range of popular LLM architectures, including but not limited to:
- Llama Series: Llama 2, Llama 3, Llama 3.1, Llama 3.3
- Falcon Series: Including Falcon-180B
- GPT Series: ChatGPT-related architectures
- Gemma Series: Google's open-source models
- Mixtral Series: Mixture-of-Experts models
- DeepSeek Series: Including DeepSeek R1
- CodeLlama: Code generation specific models
Ecosystem Integration
NVIDIA Ecosystem
- NVIDIA NeMo: An end-to-end framework for building, customizing, and deploying generative AI applications.
- Triton Inference Server: A production-grade inference server.
- NVIDIA Dynamo: A data center-scale distributed inference serving framework.
Third-Party Integrations
- HuggingFace Hub: Provides pre-quantized models.
- LlamaIndex: For RAG application development.
- SageMaker LMI: AWS managed inference.
Performance Benchmarks
Examples of performance improvements:
- Compared to CPU platforms: Inference speed increased by up to 36x.
- Compared to unoptimized RTX: LLM speed increased by up to 4x on Windows RTX platforms.
- Falcon-180B: Achieves inference using INT4 AWQ on a single H200 GPU.
- Llama-70B: Achieves 6.7x speed improvement compared to A100.
Best Practices and Recommendations
Quantization Method Selection
Choose the appropriate quantization method based on different scenarios:
Small-batch inference (batch size ≤ 4):
- Recommended to use weight-only quantization methods (e.g., INT4 AWQ).
- Primarily considers memory bandwidth limitations.
Large-batch inference (batch size ≥ 16):
- Prioritize FP8 quantization, which typically offers the best performance and accuracy.
- If results are not satisfactory, try INT8 SmoothQuant, then AWQ and/or GPTQ.
Domain-specific applications:
- For highly specialized applications like code completion, it is recommended to use domain-specific datasets for calibration.
Technical Advantages
- Ease of Use: Provides high-level Python APIs, simplifying the LLM definition and optimization process.
- Performance: Includes all mainstream optimization techniques, such as kernel fusion, quantization, runtime optimization, etc.
- Scalability: Supports various deployment scenarios, from single-GPU to multi-node.
- Compatibility: Deeply integrated with PyTorch, supporting major inference ecosystems.
- Open Source: Fully open-source, with community-driven continuous development.
Future Development
TensorRT-LLM enhances ease of use and scalability through an open-source modular model definition API, used for defining, optimizing, and executing new architectures and enhancements, allowing for easy customization as LLMs evolve.
The project's continuous development directions include:
- More model architecture support
- More advanced quantization techniques
- Better multi-node scalability
- Tighter ecosystem integration