Home
Login

High-resolution text-to-image generation model based on latent diffusion models

MITPython 41.2kStability-AIstablediffusion Last Updated: 2024-10-10

Stable Diffusion Project Detailed Introduction

Project Overview

Stable Diffusion is an open-source text-to-image generation model developed by Stability AI, based on Latent Diffusion Models technology. This project achieves high-resolution image synthesis, capable of generating high-quality images from text descriptions.

Project Address: https://github.com/Stability-AI/stablediffusion

Core Technical Features

1. Latent Diffusion Model Architecture

  • Uses latent space for the diffusion process, which is more efficient than operating directly in pixel space.
  • Employs a U-Net architecture as the denoising network.
  • Integrates self-attention and cross-attention mechanisms.

2. Text Encoder

  • Uses OpenCLIP ViT-H/14 as the text encoder.
  • Supports complex text conditional control.
  • Able to understand detailed text descriptions and convert them into visual content.

3. Multi-Resolution Support

  • Stable Diffusion 2.1-v: 768x768 pixel output
  • Stable Diffusion 2.1-base: 512x512 pixel output
  • Supports training and inference at different resolutions.

Major Version History

Version 2.1 (December 7, 2022)

  • Introduced v model with 768x768 resolution and base model with 512x512 resolution.
  • Based on the same number of parameters and architecture.
  • Fine-tuned on a more lenient NSFW filtered dataset.

Version 2.0 (November 24, 2022)

  • Completely new model with 768x768 resolution.
  • Uses OpenCLIP-ViT/H as the text encoder.
  • Trained from scratch, using the v-prediction method.

Stable UnCLIP 2.1 (March 24, 2023)

  • Supports image transformation and mixing operations.
  • Fine-tuned based on SD2.1-768.
  • Provides two variants: Stable unCLIP-L and Stable unCLIP-H.

Core Functionality

1. Text-to-Image Generation

Basic text description to image generation function:

python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768

2. Image Inpainting

Supports local repair and editing of images:

python scripts/gradio/inpainting.py configs/stable-diffusion/v2-inpainting-inference.yaml <path-to-checkpoint>

3. Depth-Conditional Image Generation

Image generation based on depth information for structure preservation:

python scripts/gradio/depth2img.py configs/stable-diffusion/v2-midas-inference.yaml <path-to-ckpt>

4. Image Super-Resolution

4x super-resolution function:

python scripts/gradio/superresolution.py configs/stable-diffusion/x4-upscaling.yaml <path-to-checkpoint>

5. Image-to-Image Conversion

Classic img2img function:

python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8 --ckpt <path/to/model.ckpt>

Installation and Environment Configuration

Basic Environment

conda install pytorch==1.12.1 torchvision==0.13.1 -c pytorch
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .

Performance Optimization (Recommended)

Install the xformers library to improve GPU performance:

export CUDA_HOME=/usr/local/cuda-11.4
conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc
conda install -c conda-forge gcc
conda install -c conda-forge gxx_linux-64==9.5.0

cd ..
git clone https://github.com/facebookresearch/xformers.git
cd xformers
git submodule update --init --recursive
pip install -r requirements.txt
pip install -e .
cd ../stablediffusion

Intel CPU Optimization

Optimization configuration for Intel CPUs:

apt-get install numactl libjemalloc-dev
pip install intel-openmp
pip install intel_extension_for_pytorch -f https://software.intel.com/ipex-whl-stable

Technical Architecture Details

Model Components

  1. Encoder-Decoder Architecture: Uses an autoencoder with a downsampling factor of 8.
  2. U-Net Network: 865M parameter U-Net for the diffusion process.
  3. Text Encoder: OpenCLIP ViT-H/14 processes text input.
  4. Sampler: Supports various sampling methods such as DDIM, PLMS, and DPMSolver.

Memory Optimization

  • Automatically enables memory-efficient attention mechanisms.
  • Supports xformers acceleration.
  • Provides FP16 precision options to save video memory.

Application Scenarios

1. Artistic Creation

  • Concept art design
  • Illustration generation
  • Style transfer

2. Content Production

  • Marketing material creation
  • Social media content
  • Product prototype design

3. Research Applications

  • Computer vision research
  • Generative model research
  • Multimodal learning

Ethical Considerations and Limitations

Data Bias

  • The model reflects biases and misunderstandings in the training data.
  • Not recommended for direct use in commercial services without adding additional safety mechanisms.

Content Safety

  • Built-in invisible watermarking system helps identify AI-generated content.
  • Efforts are made to reduce explicit pornographic content, but caution is still required.

Usage Restrictions

  • Weights are for research purposes only.
  • Follow the CreativeML Open RAIL++-M license.

Star History Chart