Home
Login

An efficient collection of Triton kernels developed by LinkedIn, specifically optimized for large language model training, capable of improving training speed by 20% and reducing memory usage by 60%.

BSD-2-ClausePython 5.2klinkedin Last Updated: 2025-06-20

Liger-Kernel Project Details

Project Overview

Liger-Kernel is a collection of Triton kernels developed by LinkedIn, specifically designed for training large language models (LLMs). This project effectively improves multi-GPU training throughput by 20% and reduces memory usage by 60%. The project name "Liger" stands for "LinkedIn GPU Efficient Runtime," reflecting its core concept of efficient GPU runtime.

Core Features

Performance Advantages

  • Training Speed Improvement: Through kernel fusion, in-place replacement, and chunking techniques, multi-GPU training throughput is increased by 20%.
  • Memory Efficiency: Memory usage is reduced by 60%, supporting longer context lengths, larger batch sizes, and massive vocabularies.
  • Post-Training Optimization: Post-training kernels can save up to 80% of memory for alignment and distillation tasks.

Technical Implementation

  • Precise Calculation: No approximate calculations, with rigorous unit tests for both forward and backward propagation.
  • Lightweight Dependencies: Only requires Torch and Triton, with no additional library dependencies.
  • Strong Compatibility: Ready to use out of the box, compatible with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed.

Supported Models and Operations

Supported Model Architectures

The project supports various mainstream large language model architectures, including:

  • LLaMA Series: LLaMA 2, LLaMA 3, LLaMA 3.2-Vision
  • Mistral Series: Mistral, Mixtral
  • Gemma Series: Gemma1, Gemma2, Gemma3
  • Qwen Series: Qwen2, Qwen2.5, Qwen2-VL, Qwen3, etc.
  • Other Models: Phi3, Granite, OLMo2, GLM-4, etc.

Core Kernel Operations

The project implements various optimized kernel operations:

Basic Kernels

  • LigerRMSNorm: RMS Normalization
  • LigerLayerNorm: Layer Normalization
  • liger_rotary_pos_emb: Rotary Position Embedding (RoPE)
  • LigerSwiGLUMLP: SwiGLU Activation Function
  • LigerGEGLUMLP: GeGLU Activation Function
  • LigerCrossEntropyLoss: Cross-Entropy Loss
  • LigerFusedLinearCrossEntropyLoss: Fused Linear Cross-Entropy Loss

Post-Training Kernels

Supports various alignment and preference optimization loss functions:

  • LigerFusedLinearDPOLoss: DPO Loss
  • LigerFusedLinearORPOLoss: ORPO Loss
  • LigerFusedLinearCPOLoss: CPO Loss
  • LigerFusedLinearSimPOLoss: SimPO Loss
  • LigerFusedLinearKTOLoss: KTO Loss

Usage Methods

1. Automatic Integration

from liger_kernel.transformers import AutoLigerKernelForCausalLM


model = AutoLigerKernelForCausalLM.from_pretrained("path/to/some/model")

2. Manual Patching

import transformers
from liger_kernel.transformers import apply_liger_kernel_to_llama


apply_liger_kernel_to_llama()


apply_liger_kernel_to_llama(
    rope=True,
    swiglu=True,
    cross_entropy=True,
    fused_linear_cross_entropy=False,
    rms_norm=False
)


model = transformers.AutoModelForCausalLM("path/to/llama/model")

3. Low-Level API

from liger_kernel.transformers import LigerFusedLinearCrossEntropyLoss
import torch.nn as nn
import torch

model = nn.Linear(128, 256).cuda()
loss_fn = LigerFusedLinearCrossEntropyLoss()

input = torch.randn(4, 128, requires_grad=True, device="cuda")
target = torch.randint(256, (4, ), device="cuda")
loss = loss_fn(model.weight, input, target)
loss.backward()

4. Post-Training Loss Example

from liger_kernel.chunked_loss import LigerFusedLinearORPOLoss

orpo_loss = LigerFusedLinearORPOLoss()
y = orpo_loss(lm_head.weight, x, target)

Installation Methods

Stable Version Installation

pip install liger-kernel

Development Version Installation

pip install liger-kernel-nightly

Installation from Source

git clone https://github.com/linkedin/Liger-Kernel.git
cd Liger-Kernel
pip install -e .

Development Environment Installation

pip install -e ".[dev]"

System Requirements

NVIDIA GPU Environment

  • torch >= 2.1.2
  • triton >= 2.3.0

AMD GPU Environment

  • torch >= 2.5.0
  • triton >= 3.0.0

Other Dependencies

  • transformers >= 4.x: If using transformers model patching API

Performance Benchmarks

Benchmark Conditions:

  • Model: LLaMA 3-8B
  • Batch Size: 8
  • Data Type: bf16
  • Optimizer: AdamW
  • Gradient Checkpointing: Enabled
  • Distributed Strategy: FSDP1, 8 A100 GPUs

Test results show:

  • Hugging Face models start to experience out-of-memory errors at a 4K context length, while Hugging Face + Liger Kernel can scale to 16K.
  • Training throughput increased by over 20%.
  • Memory usage reduced by 60%.

Framework Integration

Liger-Kernel has been integrated into several mainstream training frameworks:

  • Axolotl
  • LLaMa-Factory
  • SFTTrainer
  • Hugging Face Trainer
  • SWIFT
  • oumi

Technical Principles

Kernel Fusion Technology

By fusing multiple operations into a single kernel, the number of GPU memory accesses is reduced, improving computational efficiency.

Chunked Computation

For memory-intensive operations, a chunked processing technique is used to break down large calculations into smaller chunks, reducing peak memory usage.

In-Place Operations

Whenever possible, in-place operations are used to avoid additional memory allocation, further optimizing memory efficiency.

Summary

Liger-Kernel represents a significant advancement in large language model training optimization. With carefully designed Triton kernels, memory optimization techniques, and broad model support, it provides researchers and engineers with a powerful and easy-to-use tool that can significantly improve training efficiency and reduce computational costs. The project's open-source nature and active community support make it an important resource in the LLM training field.