The world's first infinite-length movie generation model, using the Diffusion Forcing architecture to achieve professional film-level video generation.
SkyReels-V2: Infinite-Length Film Generation Model
Project Overview
SkyReels-V2, developed by SkyworkAI, is the world's first infinite-length film generation model. It utilizes an AutoRegressive Diffusion-Forcing architecture and achieves SOTA (State-Of-The-Art) performance among publicly available models. This project represents a significant breakthrough in video generation technology, capable of producing high-quality, cinematic video content of theoretically infinite length.
Core Technical Features
1. Diffusion Forcing Architecture
Diffusion Forcing is a training and sampling strategy that assigns independent noise levels to each token. This allows tokens to be denoised according to arbitrary, per-token schedules. Conceptually, this method is equivalent to a form of partial masking: tokens with zero noise are fully unmasked, while fully noisy tokens are completely masked.
2. Multimodal Technology Integration
This method integrates Multimodal Large Language Models (MLLM), multi-stage pre-training, Reinforcement Learning, and Diffusion Forcing techniques to achieve comprehensive optimization.
3. Video Caption Generator (SkyCaptioner-V1)
SkyCaptioner-V1 is fine-tuned on the Qwen2.5-VL-7B-Instruct base model for domain-specific video captioning tasks, achieving the highest average accuracy in accuracy evaluations across various captioning domains.
Model Variants
The project offers multiple model variants to meet different needs:
Diffusion Forcing Model Series
- SkyReels-V2-DF-1.3B-540P: Low-parameter version, recommended resolution 544×960, 97 frames
- SkyReels-V2-DF-14B-540P: Standard version, suitable for 540P video generation
- SkyReels-V2-DF-14B-720P: High-resolution version, supports 720P video generation
Text-to-Video (T2V) Models
- SkyReels-V2-T2V-14B-540P: Specifically for text-to-video generation
- SkyReels-V2-T2V-14B-720P: High-resolution text-to-video model
Image-to-Video (I2V) Models
- SkyReels-V2-I2V-1.3B-540P: Lightweight image-to-video model
- SkyReels-V2-I2V-14B-540P: Standard image-to-video model
- SkyReels-V2-I2V-14B-720P: High-resolution image-to-video model
Technical Innovations
1. Reinforcement Learning Optimization
To avoid degradation of other metrics such as text alignment and video quality, the team ensured that preference data pairs were comparable in terms of text alignment and video quality, with only motion quality differing. Utilizing this enhanced dataset, a specialized reward model was first trained to capture general motion quality differences between paired samples.
2. Multi-Stage Training Pipeline
The project employs a four-stage training enhancement pipeline:
- Initial Concept-Balanced Supervised Fine-Tuning (SFT): To improve baseline quality
- Motion-Specific Reinforcement Learning (RL) Training: To address dynamic artifact issues
- Diffusion Forcing Framework: To enable long video synthesis
- Final High-Quality SFT: To refine visual fidelity
3. Resolution Progressive Training
Two consecutive high-quality Supervised Fine-Tuning (SFT) stages were implemented for 540p and 720p resolutions, with the initial SFT stage occurring immediately after pre-training but before the Reinforcement Learning stage.
Performance Metrics
Human Evaluation Results
In SkyReels-Bench evaluations:
- Text-to-Video Models: Excelled in instruction following (3.15) and maintained a competitive edge in consistency (3.35).
- Image-to-Video Models: SkyReels-V2-I2V achieved an average score of 3.29, comparable to proprietary models like Kling-1.6 (3.4) and Runway-Gen4 (3.39).
Automated Evaluation Results
In V-Bench evaluations: SkyReels-V2 surpassed all comparison models, including HunyuanVideo-13B and Wan2.1-14B, achieving the highest overall score (83.9%) and quality score (84.7%).
Application Scenarios
1. Story Generation
Capable of generating narrative video content of theoretically infinite length.
2. Image-to-Video Synthesis
Transforms static images into dynamic video sequences.
3. Camera Director Functionality
Provides professional camera movement and composition control.
4. Multi-Entity Consistent Video Generation
Enables multi-element composite video generation through the SkyReels-A2 system.
System Requirements
Hardware Requirements
- 1.3B model: Requires approximately 14.7GB peak VRAM for 540P video generation.
- 14B model: Requires approximately 51.2GB peak VRAM (Diffusion Forcing) or 43.4GB (T2V/I2V) for 540P video generation.
Software Environment
- Python 3.10.12
- Supports single-GPU and multi-GPU inference
- Integrated xDiT USP for accelerated inference
Installation and Usage
Basic Installation
# Clone the repository
git clone https://github.com/SkyworkAI/SkyReels-V2
cd SkyReels-V2
# Install dependencies
pip install -r requirements.txt
Text-to-Video Generation Example
model_id=Skywork/SkyReels-V2-T2V-14B-540P
python3 generate_video.py \
--model_id ${model_id} \
--resolution 540P \
--num_frames 97 \
--guidance_scale 6.0 \
--shift 8.0 \
--fps 24 \
--prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
--offload \
--teacache \
--use_ret_steps \
--teacache_thresh 0.3
Infinite-Length Video Generation Example
model_id=Skywork/SkyReels-V2-DF-14B-540P
# Synchronous inference to generate a 10-second video
python3 generate_video_df.py \
--model_id ${model_id} \
--resolution 540P \
--ar_step 0 \
--base_num_frames 97 \
--num_frames 257 \
--overlap_history 17 \
--prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
--addnoise_condition 20 \
--offload \
--teacache \
--use_ret_steps \
--teacache_thresh 0.3
Advanced Features
1. Video Extension
Supports extending existing videos to create longer content.
2. Start/End Frame Control
Allows specifying the start and end frames of a video for precise control.
3. Prompt Enhancer
A prompt enhancement feature based on Qwen2.5-32B-Instruct, capable of expanding short prompts into more detailed descriptions.
4. Multi-GPU Acceleration
Supports multi-GPU parallel inference via xDiT USP, significantly boosting generation speed.
Related Projects
- SkyReels-A2: A controllable video generation framework capable of assembling arbitrary visual elements.
- SkyReels-V1: The first open-source human-centric video foundation model.
- SkyCaptioner-V1: A specialized video caption generation model.
Open Source Information
- GitHub Repository: https://github.com/SkyworkAI/SkyReels-V2
- Hugging Face Models: https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9
- Technical Paper: https://arxiv.org/pdf/2504.13074
- Online Demo: https://www.skyreels.ai/home
Summary
SkyReels-V2 represents a major breakthrough in AI video generation technology, particularly in long-form video synthesis. It not only achieves technical innovation but also opens up new possibilities for creative applications such as dramatic production and virtual e-commerce, pushing the boundaries of controllable video generation.