SkyworkAI/SkyReels-V2 View GitHub Homepage for Latest Official Releases

The world's first infinite-length movie generation model, using the Diffusion Forcing architecture to achieve professional film-level video generation.

NOASSERTIONPythonSkyReels-V2SkyworkAI 4.8k Last Updated: August 11, 2025

SkyReels-V2: Infinite-Length Film Generation Model

Project Overview

SkyReels-V2, developed by SkyworkAI, is the world's first infinite-length film generation model. It utilizes an AutoRegressive Diffusion-Forcing architecture and achieves SOTA (State-Of-The-Art) performance among publicly available models. This project represents a significant breakthrough in video generation technology, capable of producing high-quality, cinematic video content of theoretically infinite length.

Core Technical Features

1. Diffusion Forcing Architecture

Diffusion Forcing is a training and sampling strategy that assigns independent noise levels to each token. This allows tokens to be denoised according to arbitrary, per-token schedules. Conceptually, this method is equivalent to a form of partial masking: tokens with zero noise are fully unmasked, while fully noisy tokens are completely masked.

2. Multimodal Technology Integration

This method integrates Multimodal Large Language Models (MLLM), multi-stage pre-training, Reinforcement Learning, and Diffusion Forcing techniques to achieve comprehensive optimization.

3. Video Caption Generator (SkyCaptioner-V1)

SkyCaptioner-V1 is fine-tuned on the Qwen2.5-VL-7B-Instruct base model for domain-specific video captioning tasks, achieving the highest average accuracy in accuracy evaluations across various captioning domains.

Model Variants

The project offers multiple model variants to meet different needs:

Diffusion Forcing Model Series

SkyReels-V2-DF-1.3B-540P: Low-parameter version, recommended resolution 544×960, 97 frames
SkyReels-V2-DF-14B-540P: Standard version, suitable for 540P video generation
SkyReels-V2-DF-14B-720P: High-resolution version, supports 720P video generation

Text-to-Video (T2V) Models

SkyReels-V2-T2V-14B-540P: Specifically for text-to-video generation
SkyReels-V2-T2V-14B-720P: High-resolution text-to-video model

Image-to-Video (I2V) Models

SkyReels-V2-I2V-1.3B-540P: Lightweight image-to-video model
SkyReels-V2-I2V-14B-540P: Standard image-to-video model
SkyReels-V2-I2V-14B-720P: High-resolution image-to-video model

Technical Innovations

1. Reinforcement Learning Optimization

To avoid degradation of other metrics such as text alignment and video quality, the team ensured that preference data pairs were comparable in terms of text alignment and video quality, with only motion quality differing. Utilizing this enhanced dataset, a specialized reward model was first trained to capture general motion quality differences between paired samples.

2. Multi-Stage Training Pipeline

The project employs a four-stage training enhancement pipeline:

Initial Concept-Balanced Supervised Fine-Tuning (SFT): To improve baseline quality
Motion-Specific Reinforcement Learning (RL) Training: To address dynamic artifact issues
Diffusion Forcing Framework: To enable long video synthesis
Final High-Quality SFT: To refine visual fidelity

3. Resolution Progressive Training

Two consecutive high-quality Supervised Fine-Tuning (SFT) stages were implemented for 540p and 720p resolutions, with the initial SFT stage occurring immediately after pre-training but before the Reinforcement Learning stage.

Performance Metrics

Human Evaluation Results

In SkyReels-Bench evaluations:

Text-to-Video Models: Excelled in instruction following (3.15) and maintained a competitive edge in consistency (3.35).
Image-to-Video Models: SkyReels-V2-I2V achieved an average score of 3.29, comparable to proprietary models like Kling-1.6 (3.4) and Runway-Gen4 (3.39).

Automated Evaluation Results

In V-Bench evaluations: SkyReels-V2 surpassed all comparison models, including HunyuanVideo-13B and Wan2.1-14B, achieving the highest overall score (83.9%) and quality score (84.7%).

Application Scenarios

1. Story Generation

Capable of generating narrative video content of theoretically infinite length.

2. Image-to-Video Synthesis

Transforms static images into dynamic video sequences.

3. Camera Director Functionality

Provides professional camera movement and composition control.

4. Multi-Entity Consistent Video Generation

Enables multi-element composite video generation through the SkyReels-A2 system.

System Requirements

Hardware Requirements

1.3B model: Requires approximately 14.7GB peak VRAM for 540P video generation.
14B model: Requires approximately 51.2GB peak VRAM (Diffusion Forcing) or 43.4GB (T2V/I2V) for 540P video generation.

Software Environment

Python 3.10.12
Supports single-GPU and multi-GPU inference
Integrated xDiT USP for accelerated inference

Installation and Usage

Basic Installation

# Clone the repository
git clone https://github.com/SkyworkAI/SkyReels-V2
cd SkyReels-V2

# Install dependencies
pip install -r requirements.txt

Text-to-Video Generation Example

model_id=Skywork/SkyReels-V2-T2V-14B-540P
python3 generate_video.py \
--model_id ${model_id} \
--resolution 540P \
--num_frames 97 \
--guidance_scale 6.0 \
--shift 8.0 \
--fps 24 \
--prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
--offload \
--teacache \
--use_ret_steps \
--teacache_thresh 0.3

Infinite-Length Video Generation Example

model_id=Skywork/SkyReels-V2-DF-14B-540P
# Synchronous inference to generate a 10-second video
python3 generate_video_df.py \
--model_id ${model_id} \
--resolution 540P \
--ar_step 0 \
--base_num_frames 97 \
--num_frames 257 \
--overlap_history 17 \
--prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
--addnoise_condition 20 \
--offload \
--teacache \
--use_ret_steps \
--teacache_thresh 0.3

Advanced Features

1. Video Extension

Supports extending existing videos to create longer content.

2. Start/End Frame Control

Allows specifying the start and end frames of a video for precise control.

3. Prompt Enhancer

A prompt enhancement feature based on Qwen2.5-32B-Instruct, capable of expanding short prompts into more detailed descriptions.

4. Multi-GPU Acceleration

Supports multi-GPU parallel inference via xDiT USP, significantly boosting generation speed.

Related Projects

SkyReels-A2: A controllable video generation framework capable of assembling arbitrary visual elements.
SkyReels-V1: The first open-source human-centric video foundation model.
SkyCaptioner-V1: A specialized video caption generation model.

Open Source Information

GitHub Repository: https://github.com/SkyworkAI/SkyReels-V2
Hugging Face Models: https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9
Technical Paper: https://arxiv.org/pdf/2504.13074
Online Demo: https://www.skyreels.ai/home

Summary

SkyReels-V2 represents a major breakthrough in AI video generation technology, particularly in long-form video synthesis. It not only achieves technical innovation but also opens up new possibilities for creative applications such as dramatic production and virtual e-commerce, pushing the boundaries of controllable video generation.