Stability-AI/stablediffusionPlease refer to the latest official releases for information GitHub Homepage

High-resolution text-to-image generation model based on latent diffusion models

MITPython 41.2kStability-AIstablediffusion Last Updated: 2024-10-10

Stable Diffusion Project Detailed Introduction

Project Overview

Stable Diffusion is an open-source text-to-image generation model developed by Stability AI, based on Latent Diffusion Models technology. This project achieves high-resolution image synthesis, capable of generating high-quality images from text descriptions.

Project Address: https://github.com/Stability-AI/stablediffusion

Core Technical Features

1. Latent Diffusion Model Architecture

Uses latent space for the diffusion process, which is more efficient than operating directly in pixel space.
Employs a U-Net architecture as the denoising network.
Integrates self-attention and cross-attention mechanisms.

2. Text Encoder

Uses OpenCLIP ViT-H/14 as the text encoder.
Supports complex text conditional control.
Able to understand detailed text descriptions and convert them into visual content.

3. Multi-Resolution Support

Stable Diffusion 2.1-v: 768x768 pixel output
Stable Diffusion 2.1-base: 512x512 pixel output
Supports training and inference at different resolutions.

Major Version History

Version 2.1 (December 7, 2022)

Introduced v model with 768x768 resolution and base model with 512x512 resolution.
Based on the same number of parameters and architecture.
Fine-tuned on a more lenient NSFW filtered dataset.

Version 2.0 (November 24, 2022)

Completely new model with 768x768 resolution.
Uses OpenCLIP-ViT/H as the text encoder.
Trained from scratch, using the v-prediction method.

Stable UnCLIP 2.1 (March 24, 2023)

Supports image transformation and mixing operations.
Fine-tuned based on SD2.1-768.
Provides two variants: Stable unCLIP-L and Stable unCLIP-H.

Core Functionality

1. Text-to-Image Generation

Basic text description to image generation function:

python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768

2. Image Inpainting

Supports local repair and editing of images:

python scripts/gradio/inpainting.py configs/stable-diffusion/v2-inpainting-inference.yaml <path-to-checkpoint>

3. Depth-Conditional Image Generation

Image generation based on depth information for structure preservation:

python scripts/gradio/depth2img.py configs/stable-diffusion/v2-midas-inference.yaml <path-to-ckpt>

4. Image Super-Resolution

4x super-resolution function:

python scripts/gradio/superresolution.py configs/stable-diffusion/x4-upscaling.yaml <path-to-checkpoint>

5. Image-to-Image Conversion

Classic img2img function:

python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8 --ckpt <path/to/model.ckpt>

Installation and Environment Configuration

Basic Environment

conda install pytorch==1.12.1 torchvision==0.13.1 -c pytorch
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .

Performance Optimization (Recommended)

Install the xformers library to improve GPU performance:

export CUDA_HOME=/usr/local/cuda-11.4
conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc
conda install -c conda-forge gcc
conda install -c conda-forge gxx_linux-64==9.5.0

cd ..
git clone https://github.com/facebookresearch/xformers.git
cd xformers
git submodule update --init --recursive
pip install -r requirements.txt
pip install -e .
cd ../stablediffusion

Intel CPU Optimization

Optimization configuration for Intel CPUs:

apt-get install numactl libjemalloc-dev
pip install intel-openmp
pip install intel_extension_for_pytorch -f https://software.intel.com/ipex-whl-stable

Technical Architecture Details

Model Components

Encoder-Decoder Architecture: Uses an autoencoder with a downsampling factor of 8.
U-Net Network: 865M parameter U-Net for the diffusion process.
Text Encoder: OpenCLIP ViT-H/14 processes text input.
Sampler: Supports various sampling methods such as DDIM, PLMS, and DPMSolver.

Memory Optimization

Automatically enables memory-efficient attention mechanisms.
Supports xformers acceleration.
Provides FP16 precision options to save video memory.

Application Scenarios

1. Artistic Creation

Concept art design
Illustration generation
Style transfer

2. Content Production

Marketing material creation
Social media content
Product prototype design

3. Research Applications

Computer vision research
Generative model research
Multimodal learning

Ethical Considerations and Limitations

Data Bias

The model reflects biases and misunderstandings in the training data.
Not recommended for direct use in commercial services without adding additional safety mechanisms.

Content Safety

Built-in invisible watermarking system helps identify AI-generated content.
Efforts are made to reduce explicit pornographic content, but caution is still required.

Usage Restrictions

Weights are for research purposes only.
Follow the CreativeML Open RAIL++-M license.