An efficient collection of Triton kernels developed by LinkedIn, specifically optimized for large language model training, capable of improving training speed by 20% and reducing memory usage by 60%.
Liger-Kernel Project Details
Project Overview
Liger-Kernel is a collection of Triton kernels developed by LinkedIn, specifically designed for training large language models (LLMs). This project effectively improves multi-GPU training throughput by 20% and reduces memory usage by 60%. The project name "Liger" stands for "LinkedIn GPU Efficient Runtime," reflecting its core concept of efficient GPU runtime.
Core Features
Performance Advantages
- Training Speed Improvement: Through kernel fusion, in-place replacement, and chunking techniques, multi-GPU training throughput is increased by 20%.
- Memory Efficiency: Memory usage is reduced by 60%, supporting longer context lengths, larger batch sizes, and massive vocabularies.
- Post-Training Optimization: Post-training kernels can save up to 80% of memory for alignment and distillation tasks.
Technical Implementation
- Precise Calculation: No approximate calculations, with rigorous unit tests for both forward and backward propagation.
- Lightweight Dependencies: Only requires Torch and Triton, with no additional library dependencies.
- Strong Compatibility: Ready to use out of the box, compatible with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed.
Supported Models and Operations
Supported Model Architectures
The project supports various mainstream large language model architectures, including:
- LLaMA Series: LLaMA 2, LLaMA 3, LLaMA 3.2-Vision
- Mistral Series: Mistral, Mixtral
- Gemma Series: Gemma1, Gemma2, Gemma3
- Qwen Series: Qwen2, Qwen2.5, Qwen2-VL, Qwen3, etc.
- Other Models: Phi3, Granite, OLMo2, GLM-4, etc.
Core Kernel Operations
The project implements various optimized kernel operations:
Basic Kernels
LigerRMSNorm
: RMS NormalizationLigerLayerNorm
: Layer Normalizationliger_rotary_pos_emb
: Rotary Position Embedding (RoPE)LigerSwiGLUMLP
: SwiGLU Activation FunctionLigerGEGLUMLP
: GeGLU Activation FunctionLigerCrossEntropyLoss
: Cross-Entropy LossLigerFusedLinearCrossEntropyLoss
: Fused Linear Cross-Entropy Loss
Post-Training Kernels
Supports various alignment and preference optimization loss functions:
LigerFusedLinearDPOLoss
: DPO LossLigerFusedLinearORPOLoss
: ORPO LossLigerFusedLinearCPOLoss
: CPO LossLigerFusedLinearSimPOLoss
: SimPO LossLigerFusedLinearKTOLoss
: KTO Loss
Usage Methods
1. Automatic Integration
from liger_kernel.transformers import AutoLigerKernelForCausalLM
model = AutoLigerKernelForCausalLM.from_pretrained("path/to/some/model")
2. Manual Patching
import transformers
from liger_kernel.transformers import apply_liger_kernel_to_llama
apply_liger_kernel_to_llama()
apply_liger_kernel_to_llama(
rope=True,
swiglu=True,
cross_entropy=True,
fused_linear_cross_entropy=False,
rms_norm=False
)
model = transformers.AutoModelForCausalLM("path/to/llama/model")
3. Low-Level API
from liger_kernel.transformers import LigerFusedLinearCrossEntropyLoss
import torch.nn as nn
import torch
model = nn.Linear(128, 256).cuda()
loss_fn = LigerFusedLinearCrossEntropyLoss()
input = torch.randn(4, 128, requires_grad=True, device="cuda")
target = torch.randint(256, (4, ), device="cuda")
loss = loss_fn(model.weight, input, target)
loss.backward()
4. Post-Training Loss Example
from liger_kernel.chunked_loss import LigerFusedLinearORPOLoss
orpo_loss = LigerFusedLinearORPOLoss()
y = orpo_loss(lm_head.weight, x, target)
Installation Methods
Stable Version Installation
pip install liger-kernel
Development Version Installation
pip install liger-kernel-nightly
Installation from Source
git clone https://github.com/linkedin/Liger-Kernel.git
cd Liger-Kernel
pip install -e .
Development Environment Installation
pip install -e ".[dev]"
System Requirements
NVIDIA GPU Environment
torch >= 2.1.2
triton >= 2.3.0
AMD GPU Environment
torch >= 2.5.0
triton >= 3.0.0
Other Dependencies
transformers >= 4.x
: If using transformers model patching API
Performance Benchmarks
Benchmark Conditions:
- Model: LLaMA 3-8B
- Batch Size: 8
- Data Type: bf16
- Optimizer: AdamW
- Gradient Checkpointing: Enabled
- Distributed Strategy: FSDP1, 8 A100 GPUs
Test results show:
- Hugging Face models start to experience out-of-memory errors at a 4K context length, while Hugging Face + Liger Kernel can scale to 16K.
- Training throughput increased by over 20%.
- Memory usage reduced by 60%.
Framework Integration
Liger-Kernel has been integrated into several mainstream training frameworks:
- Axolotl
- LLaMa-Factory
- SFTTrainer
- Hugging Face Trainer
- SWIFT
- oumi
Technical Principles
Kernel Fusion Technology
By fusing multiple operations into a single kernel, the number of GPU memory accesses is reduced, improving computational efficiency.
Chunked Computation
For memory-intensive operations, a chunked processing technique is used to break down large calculations into smaller chunks, reducing peak memory usage.
In-Place Operations
Whenever possible, in-place operations are used to avoid additional memory allocation, further optimizing memory efficiency.
Summary
Liger-Kernel represents a significant advancement in large language model training optimization. With carefully designed Triton kernels, memory optimization techniques, and broad model support, it provides researchers and engineers with a powerful and easy-to-use tool that can significantly improve training efficiency and reduce computational costs. The project's open-source nature and active community support make it an important resource in the LLM training field.