Stability-AI/generative-modelsPlease refer to the latest official releases for information GitHub Homepage

A generative AI model library developed by Stability AI, including various image and video generation models such as Stable Diffusion XL and Stable Video Diffusion.

MITPython 26.1kStability-AIgenerative-models Last Updated: 2025-05-20

Stability AI Generative Models Project Details

Project Overview

Stability AI's generative model library is an open-source project that provides a variety of advanced AI generative models, including image generation, video generation, and multi-view synthesis. The project adopts a modular design and supports the training and inference of various diffusion models.

Core Features

1. Modular Architecture

Configuration-Driven Approach: Construct and combine submodules by calling the instantiate_from_config() function.
Cleaned Diffusion Model Class: Refactored from LatentDiffusion to DiffusionEngine.
Unified Condition Handling: The GeneralConditioner class handles all types of conditional inputs.

2. Improved Model Architecture

Denoiser Framework: Supports continuous-time and discrete-time models.
Independent Samplers: Separates the guider from the sampler.
Cleaned Autoencoding Model: Optimized encoder architecture.

Supported Models

SDXL (Stable Diffusion XL) Series

SDXL-base-1.0: Base model, supports 1024x1024 resolution image generation.
SDXL-refiner-1.0: Refinement model, used for image post-processing.
SDXL-Turbo: Fast generation model.

SVD (Stable Video Diffusion) Series

SVD: Image-to-video model, generates 14 frames of 576x1024 resolution video.
SVD-XT: Extended version, supports 25 frame generation.

SV3D (Stable Video 3D) Series

SV3D_u: Orbit video generation based on a single image.
SV3D_p: Supports 3D video generation with specified camera paths.

SV4D (Stable Video 4D) Series

SV4D: Video-to-4D diffusion model for novel view video synthesis.
Generates 40 frames (5 video frames × 8 camera views) at 576x576 resolution.

Technical Architecture

Denoiser Framework

Continuous-Time Models: Supports more flexible time sampling.
Discrete-Time Models: Special case of traditional diffusion models.
Configurable Components:
- Loss function weights (denoiser_weighting.py)
- Network preconditioning (denoiser_scaling.py)
- Noise level sampling (sigma_sampling.py)

Installation and Usage

Environment Requirements

Python 3.10+
PyTorch 2.0+
CUDA-supported GPU

Installation Steps

git clone https://github.com/Stability-AI/generative-models.git
cd generative-models

# Create virtual environment
python3 -m venv .pt2
source .pt2/bin/activate

# Install dependencies
pip3 install -r requirements/pt2.txt
pip3 install .
pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata

Quick Start

Text-to-Image Generation (SDXL)

# Download model weights to the checkpoints/ folder
# Run Streamlit demo
streamlit run scripts/demo/sampling.py --server.port <your_port>

Image-to-Video Generation (SVD)

# Download SVD model
# Run simple video sampling
python scripts/sampling/simple_video_sample.py --input_path <path/to/image.png>

Multi-View Synthesis (SV3D)

# SV3D_u (Orbit Video)
python scripts/sampling/simple_video_sample.py --input_path <path/to/image.png> --version sv3d_u

# SV3D_p (Specified Camera Path)
python scripts/sampling/simple_video_sample.py --input_path <path/to/image.png> --version sv3d_p --elevations_deg 10.0

4D Video Synthesis (SV4D)

python scripts/sampling/simple_video_sample_4d.py --input_path assets/sv4d_videos/test_video1.mp4 --output_folder outputs/sv4d

Training Configuration

Supported Training Types

Pixel-Level Diffusion Models: Trained directly in pixel space.
Latent Diffusion Models: Trained in latent space, requires a pre-trained VAE.
Conditional Generative Models: Supports various conditions such as text and categories.

Training Example

# MNIST Conditional Generation Training
python main.py --base configs/example_training/toy/mnist_cond.yaml

# Text-to-Image Training
python main.py --base configs/example_training/txt2img-clipl.yaml

Data Processing

Data Pipeline

Data pipeline supporting large-scale training.
WebDataset format tar files.
Map-style dataset support.

Data Format

example = {
    "jpg": x, 
    "txt": "a beautiful image"  
}

Model License

SDXL-1.0: CreativeML Open RAIL++-M License
SDXL-0.9: Research License
SVD Series: Research Use License

Watermark Detection

The project uses the invisible-watermark library to embed invisible watermarks in generated images:

# Install watermark detection environment
python -m venv .detect
source .detect/bin/activate
pip install "numpy>=1.17" "PyWavelets>=1.1.1" "opencv-python>=4.1.0.25"
pip install --no-deps invisible-watermark

# Detect watermark
python scripts/demo/detect.py <filename>

Technical Features

1. High-Quality Generation

SDXL supports 1024x1024 high-resolution image generation.
SVD supports high-quality video generation.
SV3D/SV4D supports multi-view and 4D video synthesis.

2. Flexible Conditional Control

Supports various conditional inputs such as text, images, and vectors.
Classifier-free Guidance.
Configurable conditional dropout rate.

3. Advanced Sampling Techniques

Multiple numerical solvers.
Configurable sampling steps and discretization methods.
Supports guider wrappers.

4. Research-Friendly

Detailed technical reports and papers.
Open-source code and model weights.
Active community support.

Application Scenarios

Art Creation: Text-to-art image generation.
Content Production: Image-to-video content generation.
3D Modeling: Multi-view image generation.
Research and Development: Diffusion model algorithm research.
Education and Training: AI generation technology learning.

Project Advantages

Modular Design: Easy to extend and customize.
High Performance: Optimized training and inference code.
Multi-Modal Support: Multiple generation tasks such as images, videos, and 3D.
Continuous Updates: Regularly releases new models and features.
Active Community: Rich documentation and example code.

This project represents the most advanced technology in the field of generative AI, providing researchers and developers with powerful tools to explore and apply generative AI technologies.