ComfyUI wrapper for the WanVideo model, supporting Alibaba's WanVideo 2.1 series of AI video generation models.
Detailed Introduction to ComfyUI-WanVideoWrapper Project
Project Overview
ComfyUI-WanVideoWrapper is a wrapper plugin specifically developed for the ComfyUI platform, primarily designed to support WanVideo and related models. Developed and maintained by kijai, this project serves as an experimental "sandbox" environment for rapidly testing and implementing new AI video generation models and features.
Project Background
Due to the complexity of ComfyUI's core code and the developer's lack of extensive coding experience, it is often easier and faster to implement new models and features within a standalone wrapper than directly within the core system. This project was born from this very philosophy.
Design Philosophy
- Rapid Testing Platform: Serves as a quick validation environment for new features
- Personal Sandbox: An experimental platform open for everyone to use
- Avoid Compatibility Issues: Runs independently without affecting the stability of the main system
- Continuous Development: The code is always under development and may contain issues
Core Features
Supported WanVideo Model Series
This wrapper primarily supports Alibaba's open-source Wan 2.1 series models, an advanced video generation model with leading performance:
Wan 2.1 Model Features:
- High Performance: Consistently outperforms existing open-source models and state-of-the-art commercial solutions in multiple benchmarks
- Bilingual Text Generation: The first video model capable of generating both Chinese and English text, boasting powerful text generation capabilities
- Multi-Resolution Support: Supports 480P and 720P video generation
- Physical Simulation: Generates videos that accurately simulate real-world physical effects and interactions between real-world objects
Model Specifications:
T2V-1.3B Model:
- Requires only 8.19 GB VRAM, compatible with almost all consumer-grade GPUs
- Can generate a 5-second 480P video in approximately 4 minutes on an RTX 4090
- Lightweight, suitable for general users
T2V-14B/I2V-14B Models:
- Achieves SOTA (State-Of-The-Art) performance in both open-source and closed-source models
- Supports complex visual scenes and motion patterns
- Suitable for professional-grade applications
Main Functional Modules
- Text-to-Video (T2V)
- Image-to-Video (I2V)
- Video Editing
- Text-to-Image
- Video-to-Audio
Technical Architecture
Core Technical Components
Wan2.1 is designed based on the mainstream diffusion transformer paradigm, achieving significant improvements in generation capabilities through a series of innovations:
- Wan-VAE: A novel 3D causal VAE architecture specifically designed for video generation, improving spatio-temporal compression, reducing memory usage, and ensuring temporal causality through various strategies
- Scalable Training Strategy
- Large-scale Data Construction
- Automated Evaluation Metrics
Performance Characteristics
- Memory Efficiency: Wan-VAE can encode and decode 1080P videos of infinite length without losing historical temporal information
- GPU Compatibility: Supports running on consumer-grade GPUs
- Processing Capability: Supports long video generation and complex scene processing
Installation and Usage
Installation Steps
Clone Repository:
git clone https://github.com/kijai/ComfyUI-WanVideoWrapper.git
Install Dependencies:
pip install -r requirements.txt
For portable installation:
python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\requirements.txt
Model Download
Main model download addresses:
- Standard Models: https://huggingface.co/Kijai/WanVideo_comfy/tree/main
- FP8 Optimized Models (Recommended): https://huggingface.co/Kijai/WanVideo_comfy_fp8_scaled
Model File Structure
Place the downloaded model files in the corresponding ComfyUI directories:
- Text encoders →
ComfyUI/models/text_encoders
- Clip vision →
ComfyUI/models/clip_vision
- Transformer (main video model) →
ComfyUI/models/diffusion_models
- VAE →
ComfyUI/models/vae
Supported Extended Models
This wrapper also supports several related AI video generation models:
- SkyReels: A video generation model developed by Skywork
- WanVideoFun: An entertainment-oriented model developed by Alibaba PAI Team
- ReCamMaster: A video reconstruction model developed by Kuaishou VGI
- VACE: A video enhancement model from Alibaba Vision Lab
- Phantom: A multi-subject video generation model from ByteDance Research Institute
- ATI: An attention transfer model from ByteDance Research Institute
- Uni3C: A unified video understanding model from Alibaba DAMO Academy
- EchoShot: A multi-shot portrait video generation model
- MultiTalk: A multi-person dialogue video generation model
Application Cases and Examples
Long Video Generation Test
- 1025-frame Test: Using an 81-frame window size with 16 frames overlap
- 1.3B T2V Model: Less than 5GB VRAM used on a 5090 graphics card, generation time 10 minutes
- Memory Optimization: Approximately 16GB memory used for 512x512x81 specifications, supporting 20/40 block offload
TeaCache Acceleration Optimization
- The new version's threshold setting should be 10 times the original
- Recommended coefficient range: 0.25-0.30
- Starting steps can begin from 0
- More aggressive threshold values are recommended to start later to avoid skipping early steps
Technical Advantages
- Open-Source Ecosystem: Fully open-source, including source code and all models
- Leading Performance: Consistently outperforms existing open-source models and state-of-the-art commercial solutions in multiple internal and external benchmarks
- Comprehensive Coverage: Covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to 8 tasks
- Consumer-Friendly: The 1.3B model demonstrates excellent resource efficiency, requiring only 8.19GB VRAM and compatible with a wide range of consumer-grade GPUs
Project Status and Development
Future Development
- Not intended to compete with native workflows or provide alternatives
- The ultimate goal is to help explore newly released models and features
- Some features may eventually be integrated into the ComfyUI core system
Usage Recommendations
Applicable Scenarios
- AI video generation research and experimentation
- Rapid testing and validation of new models
- Creative video content production
- Educational and learning purposes
Important Notes
- The code is under continuous development and may have stability issues
- Recommended for testing and use in an an isolated environment
- Requires a certain level of technical background and GPU resources
Conclusion
ComfyUI-WanVideoWrapper is an innovative AI video generation tool wrapper that provides users with convenient access to the latest video generation technologies. Based on Alibaba's open-source Wan 2.1 series models, this project not only maintains technological leadership but also embodies the collaborative spirit of the open-source community. Although the project is still under continuous development, its powerful features and extensive model support make it an important tool in the field of AI video generation.