Stability AI Generative Models Project Details
Project Overview
Stability AI's generative model library is an open-source project that provides a variety of advanced AI generative models, including image generation, video generation, and multi-view synthesis. The project adopts a modular design and supports the training and inference of various diffusion models.
Core Features
1. Modular Architecture
- Configuration-Driven Approach: Construct and combine submodules by calling the
instantiate_from_config()
function.
- Cleaned Diffusion Model Class: Refactored from
LatentDiffusion
to DiffusionEngine
.
- Unified Condition Handling: The
GeneralConditioner
class handles all types of conditional inputs.
2. Improved Model Architecture
- Denoiser Framework: Supports continuous-time and discrete-time models.
- Independent Samplers: Separates the guider from the sampler.
- Cleaned Autoencoding Model: Optimized encoder architecture.
Supported Models
SDXL (Stable Diffusion XL) Series
- SDXL-base-1.0: Base model, supports 1024x1024 resolution image generation.
- SDXL-refiner-1.0: Refinement model, used for image post-processing.
- SDXL-Turbo: Fast generation model.
SVD (Stable Video Diffusion) Series
- SVD: Image-to-video model, generates 14 frames of 576x1024 resolution video.
- SVD-XT: Extended version, supports 25 frame generation.
SV3D (Stable Video 3D) Series
- SV3D_u: Orbit video generation based on a single image.
- SV3D_p: Supports 3D video generation with specified camera paths.
SV4D (Stable Video 4D) Series
- SV4D: Video-to-4D diffusion model for novel view video synthesis.
- Generates 40 frames (5 video frames × 8 camera views) at 576x576 resolution.
Technical Architecture
Denoiser Framework
- Continuous-Time Models: Supports more flexible time sampling.
- Discrete-Time Models: Special case of traditional diffusion models.
- Configurable Components:
- Loss function weights (
denoiser_weighting.py
)
- Network preconditioning (
denoiser_scaling.py
)
- Noise level sampling (
sigma_sampling.py
)
Installation and Usage
Environment Requirements
- Python 3.10+
- PyTorch 2.0+
- CUDA-supported GPU
Installation Steps
git clone https://github.com/Stability-AI/generative-models.git
cd generative-models
# Create virtual environment
python3 -m venv .pt2
source .pt2/bin/activate
# Install dependencies
pip3 install -r requirements/pt2.txt
pip3 install .
pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata
Quick Start
Text-to-Image Generation (SDXL)
# Download model weights to the checkpoints/ folder
# Run Streamlit demo
streamlit run scripts/demo/sampling.py --server.port <your_port>
Image-to-Video Generation (SVD)
# Download SVD model
# Run simple video sampling
python scripts/sampling/simple_video_sample.py --input_path <path/to/image.png>
Multi-View Synthesis (SV3D)
# SV3D_u (Orbit Video)
python scripts/sampling/simple_video_sample.py --input_path <path/to/image.png> --version sv3d_u
# SV3D_p (Specified Camera Path)
python scripts/sampling/simple_video_sample.py --input_path <path/to/image.png> --version sv3d_p --elevations_deg 10.0
4D Video Synthesis (SV4D)
python scripts/sampling/simple_video_sample_4d.py --input_path assets/sv4d_videos/test_video1.mp4 --output_folder outputs/sv4d
Training Configuration
Supported Training Types
- Pixel-Level Diffusion Models: Trained directly in pixel space.
- Latent Diffusion Models: Trained in latent space, requires a pre-trained VAE.
- Conditional Generative Models: Supports various conditions such as text and categories.
Training Example
# MNIST Conditional Generation Training
python main.py --base configs/example_training/toy/mnist_cond.yaml
# Text-to-Image Training
python main.py --base configs/example_training/txt2img-clipl.yaml
Data Processing
Data Pipeline
- Data pipeline supporting large-scale training.
- WebDataset format tar files.
- Map-style dataset support.
Data Format
example = {
"jpg": x,
"txt": "a beautiful image"
}
Model License
- SDXL-1.0: CreativeML Open RAIL++-M License
- SDXL-0.9: Research License
- SVD Series: Research Use License
Watermark Detection
The project uses the invisible-watermark library to embed invisible watermarks in generated images:
# Install watermark detection environment
python -m venv .detect
source .detect/bin/activate
pip install "numpy>=1.17" "PyWavelets>=1.1.1" "opencv-python>=4.1.0.25"
pip install --no-deps invisible-watermark
# Detect watermark
python scripts/demo/detect.py <filename>
Technical Features
1. High-Quality Generation
- SDXL supports 1024x1024 high-resolution image generation.
- SVD supports high-quality video generation.
- SV3D/SV4D supports multi-view and 4D video synthesis.
2. Flexible Conditional Control
- Supports various conditional inputs such as text, images, and vectors.
- Classifier-free Guidance.
- Configurable conditional dropout rate.
3. Advanced Sampling Techniques
- Multiple numerical solvers.
- Configurable sampling steps and discretization methods.
- Supports guider wrappers.
4. Research-Friendly
- Detailed technical reports and papers.
- Open-source code and model weights.
- Active community support.
Application Scenarios
- Art Creation: Text-to-art image generation.
- Content Production: Image-to-video content generation.
- 3D Modeling: Multi-view image generation.
- Research and Development: Diffusion model algorithm research.
- Education and Training: AI generation technology learning.
Project Advantages
- Modular Design: Easy to extend and customize.
- High Performance: Optimized training and inference code.
- Multi-Modal Support: Multiple generation tasks such as images, videos, and 3D.
- Continuous Updates: Regularly releases new models and features.
- Active Community: Rich documentation and example code.
This project represents the most advanced technology in the field of generative AI, providing researchers and developers with powerful tools to explore and apply generative AI technologies.
