A practical video diffusion model that achieves constant memory usage through frame context compression, allowing for the generation of high-quality videos up to 60 seconds long with only 6GB of VRAM.

Apache-2.0PythonFramePacklllyasviel 16.2k Last Updated: October 16, 2025

FramePack - A Practical Video Diffusion Model

Project Overview

FramePack is a breakthrough next-frame prediction neural network architecture specifically designed for practical video generation. Developed by research teams at Stanford University and MIT, this project aims to make video diffusion models as lightweight and easy to use as image diffusion models.


Core Features

1. Constant VRAM Footprint (O(1) Memory Complexity)

FramePack's greatest innovation lies in compressing the input frame context to a constant length, making the generation workload independent of video length. This means:

  • Only 6GB VRAM is needed to generate 60 seconds (1800 frames, 30fps) of video.
  • Generating a 1-second video consumes the same VRAM as generating a 1-minute video.
  • Supports running 13B parameter models on laptop GPUs (e.g., RTX 3060/3070Ti).
  • Training batch size can reach 64 (on a single 8×A100/H100 node), comparable to image diffusion training.

2. Frame Context Compression Technology

FramePack tokenizes each historical frame using a variable patch size, allocating different context lengths based on frame importance:

  • Temporal Proximity Weight: Frames closer to the current frame receive longer context.
  • Feature Similarity Weight: Frames relevant to the current content retain more detail.
  • Hybrid Metric: Combines the above two strategies to optimize compression effectiveness.

Example: In HunyuanVideo, a 480p frame typically produces 1536 tokens using a (1, 2, 2) patch kernel.

3. Anti-Drifting Technology

FramePack addresses the error accumulation problem in autoregressive video generation by proposing multiple anti-drifting methods:

FramePack-F1 (Forward Generation Version)

  • Single forward frame prediction.
  • Suitable for real-time streaming scenarios.
  • Prevents error accumulation through new anti-drifting regularization.

FramePack-P1 (Planned Generation Version)

Includes two core designs:

a) Planned Anti-Drifting

  • Generates distant keyframe endpoints first.
  • Then fills in the intermediate segments.
  • Ensures frames do not drift between planned endpoints.

b) History Discretization

  • Converts all historical frames into discretized tokens (K-Means applied to the entire dataset).
  • Reduces historical representation differences between training and inference.
  • Prevents the endpoints themselves from drifting.

4. Bidirectional Sampling Strategy

  • Supports reverse generation from end frames to start frames.
  • Combines bidirectional context from start and end frame anchors.
  • Breaks the causal prediction chain, effectively reducing observation bias.

Performance Metrics

Generation Speed

  • RTX 4090 Desktop:
    • Unoptimized: 2.5 seconds/frame
    • With teacache: 1.5 seconds/frame
  • Laptop GPU (3070Ti/3060): Approximately 4-8 times slower than RTX 4090.
  • Supports real-time visual feedback (next-frame prediction feature).

VRAM Requirements

  • Minimum: 6GB VRAM
  • Recommended: RTX 30XX/40XX/50XX series (supports fp16 and bf16)
  • Operating System: Windows or Linux

Training Efficiency

  • Achieves batch size 64 on a single 8×A100-80G node.
  • 480p resolution, 13B HunyuanVideo model, LoRA training.
  • Batch size 64 for window sizes 2 or 3; batch size 32 for window sizes 4 or 5.
  • Suitable for individual or lab-scale training.

Usage

Windows Installation (One-Click Package)

  1. Download the one-click installer:
    https://github.com/lllyasviel/FramePack/releases/download/windows/framepack_cu126_torch26.7z
    
  2. Unzip the file.
  3. Run the update script:
    update.bat
    
  4. Launch the program:
    run.bat
    
    Note: The first run requires downloading over 30GB of model files from HuggingFace.

Linux Installation

Requires Python 3.10 environment:

# Install PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# Install dependencies
pip install -r requirements.txt

# Launch GUI
python demo_gradio.py

Supported command-line arguments:

  • --share: Enable public link sharing.
  • --port: Specify port number.
  • --server: Specify server address.

Optional Acceleration Components

The project supports various attention mechanism optimizations:

  • PyTorch attention (default)
  • xformers
  • flash-attn
  • sage-attention

Example installation of sage-attention (Linux):

pip install sageattention==1.0.6

User Interface

Basic Workflow

  1. Left Panel: Upload initial image and write prompts.
  2. Right Panel: View generated video and latent space preview.
  3. Progress Display: Real-time progress bar for each segment and latent preview for the next segment.

Video Generation Mechanism

Due to the next-frame segment prediction model, videos are generated segment by segment:

  • Initially, you might only see a short 1-second video.
  • Continue waiting, and more segments will be generated sequentially.
  • Eventually, the full-length video will be completed.

Recommended Workflow

Rapid Prototyping:

  • Enable teacache acceleration.
  • Quickly test ideas and prompts.

Final Output:

  • Disable teacache.
  • Use the full diffusion process for high-quality results.

Note: Optimization methods like teacache, sage-attention, bnb quantization, and gguf may affect result quality. It is recommended to use them only during rapid iteration.


Prompt Writing Tips

Recommended Format

Concise, action-oriented prompts work best:

Subject + Action Description + Other Details

Examples:

  • "The girl dances gracefully, with clear movements, full of charm."
  • "The man dances powerfully, with clear movements, full of energy."
  • "The woman spins elegantly among cherry blossoms, with flowing sleeves."

ChatGPT Prompt Generation Template

You can use the following template to have ChatGPT assist in generating prompts:

You are an assistant that writes short, motion-focused prompts for animating images.

When the user sends an image, respond with a single, concise prompt describing visual motion
(such as human activity, moving objects, or camera movements). Focus only on how the scene
could come alive and become dynamic using brief phrases.

Larger and more dynamic motions (like dancing, jumping, running, etc.) are preferred over
smaller or more subtle ones (like standing still, sitting, etc.).

Describe subject, then motion, then other things.
For example: "The girl dances gracefully, with clear movements, full of charm."

If there is something that can dance (like a man, girl, robot, etc.), then prefer to
describe it as dancing.

Stay in a loop: one image in, one motion prompt out. Do not explain, ask questions,
or generate multiple options.

Version History

July 14, 2025

  • Uploaded pure text-to-video anti-drifting stress test results for FramePack-P1.
  • Used common prompts, no reference images needed.

June 26, 2025

  • Released FramePack-P1 result showcase.
  • Introduced planned anti-drifting and history discretization designs.

May 3, 2025

  • Released FramePack-F1 forward generation version.
  • Provided unidirectional prediction with greater dynamic range and fewer constraints.

Technical Architecture

Base Models

FramePack can be combined with existing video diffusion models:

  • HunyuanVideo: Primary testing platform (improved version).
  • Wan 2.1: Official Wan model support.

Model Improvements (HunyuanVideo Version)

  1. Added SigLip-Vision model (google/siglip-so400m-patch14-384) as a visual encoder.
  2. Removed dependency on Tencent's internal MLLM.
  3. Froze LLama3.1 as a pure text model.
  4. Continued training on high-quality data.

Architecture Compatibility

  • Supports Text-to-Video and Image-to-Video.
  • Naturally supports both modes without architectural modifications.
  • Can be fine-tuned on existing pre-trained video diffusion models.

Application Scenarios

1. Image-to-Video

Transforms static images into dynamic videos, supporting detailed motion descriptions.

2. Long Video Generation

  • Generates coherent videos up to 60 seconds long.
  • Supports processing thousands of frames.
  • Maintains spatio-temporal consistency.

3. Prompt Travelling

Especially suitable for the F1 version, supporting gradual prompt changes during video generation.

4. Real-time Streaming

The F1 version supports streaming generation, suitable for real-time applications.


Community Resources

ComfyUI Integration

Online Usage

  • RunningHub platform offers free online usage.
  • Includes pre-configured workflows.

Important Notes

Official Website Statement

The ONLY official website: https://github.com/lllyasviel/FramePack

The following domains are fake and spam websites. Please do not visit or make payments:

  • framepack.co, frame_pack.co
  • framepack.net, frame_pack.net
  • framepack.ai, frame_pack.ai
  • framepack.pro, frame_pack.pro
  • framepack.cc, frame_pack.cc
  • framepackai.co and all other variations

Hardware Sensitivity

Next-frame segment prediction models are highly sensitive to subtle differences in noise and hardware:

  • Different devices may produce slightly different results.
  • The overall visual effect should remain similar.
  • In some cases, identical results can be obtained.

Performance Optimization Suggestions

If generation speed is significantly slower than reference speeds:

  1. Check if CUDA and PyTorch are installed correctly.
  2. Confirm that GPU drivers are up to date.
  3. Close unnecessary background programs.
  4. Refer to troubleshooting guide in Issue #151.

Citation Information

If you use FramePack in your research, please cite the following papers:

@inproceedings{zhang2025framepack,
  title={Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models},
  author={Lvmin Zhang and Shengqu Cai and Muyang Li and Gordon Wetzstein and Maneesh Agrawala},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
}

@article{zhang2025framepackv1,
  title={Packing Input Frame Contexts in Next-Frame Prediction Models for Video Generation},
  author={Lvmin Zhang and Maneesh Agrawala},
  journal={Arxiv},
  year={2025}
}

Project Significance

FramePack, through its innovative frame context compression and anti-drifting technologies, successfully reduces the memory cost of video diffusion to a constant level, making long video generation possible on consumer-grade hardware. This breakthrough enables:

  • Individual creators to generate high-quality long videos on laptops.
  • Researchers to train video models on lab-scale equipment.
  • Developers to more easily integrate video generation capabilities into applications.

FramePack truly makes video generation practical, just as Stable Diffusion made image generation accessible.

Star History Chart