Memory-conditioned video generation framework for creating coherent multi-shot long-form narrative videos with cross-shot consistency

NOASSERTIONPythonStoryMemKevin-thu 0.6k Last Updated: December 26, 2025

StoryMem: Multi-shot Long Video Storytelling with Memory

Overview

StoryMem is a cutting-edge AI framework developed by researchers from Nanyang Technological University (NTU) S-Lab and ByteDance that revolutionizes long-form video generation by enabling coherent, multi-shot narrative videos with cinematic quality. The system addresses a fundamental challenge in AI video generation: maintaining visual consistency and narrative coherence across multiple shots in extended storytelling scenarios.

Core Innovation

Memory-to-Video (M2V) Paradigm

The project introduces a novel Memory-to-Video (M2V) design that transforms pre-trained single-shot video diffusion models into multi-shot storytellers. This paradigm reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, inspired by human memory mechanisms.

Key Technical Components

  1. Dynamic Memory Bank: Maintains a compact, dynamically updated memory bank of keyframes extracted from previously generated shots
  2. Memory Injection: Stored memory is injected into single-shot video diffusion models via latent concatenation and negative RoPE (Rotary Position Embedding) shifts
  3. LoRA Fine-tuning: Achieves efficient adaptation with only Low-Rank Adaptation (LoRA) fine-tuning
  4. Semantic Keyframe Selection: Uses intelligent keyframe selection strategy with aesthetic preference filtering to ensure informative and stable memory throughout generation

Technical Architecture

Base Models

StoryMem builds upon the Wan2.2 video generation framework:

  • Wan2.2 T2V-A14B: Text-to-Video MoE (Mixture of Experts) model
  • Wan2.2 I2V-A14B: Image-to-Video MoE model
  • StoryMem M2V LoRA: Memory-conditioned fine-tuned models

Generation Pipeline

The system operates through an iterative process:

  1. Initial Shot Generation: Uses T2V model to generate the first shot as initial memory
  2. Iterative Shot Synthesis: Generates subsequent shots conditioned on memory bank
  3. Keyframe Extraction: Automatically extracts keyframes from each generated shot
  4. Memory Update: Updates memory bank with new keyframes for next iteration
  5. Cross-shot Consistency: Maintains character appearance, scene elements, and narrative flow

Advanced Features

MI2V (Memory + Image-to-Video)

Enables smooth transitions between adjacent shots by conditioning on both memory and the first frame of the next shot when no scene cut is intended. This creates seamless continuity in narrative flow.

MM2V (Memory + Motion-to-Video)

Supports memory conditioning with the first 5 motion frames, providing even smoother shot transitions by incorporating temporal motion information.

MR2V (Memory + Reference-to-Video)

Allows users to provide reference images as initial memory, enabling customized story generation with specific characters or backgrounds established from the outset.

ST-Bench: Evaluation Benchmark

To facilitate comprehensive evaluation, the researchers introduced ST-Bench, a diverse benchmark for multi-shot video storytelling containing:

  • 30 long story scripts spanning diverse styles
  • 8-12 shot-level text prompts per story
  • 300 total detailed video prompts describing characters, scenes, dynamics, shot types, and camera movements
  • Scene-cut indicators for proper shot transition handling

Performance Achievements

StoryMem demonstrates significant improvements over existing methods:

  • 28.7% improvement in cross-shot consistency over strong baselines
  • Superior visual quality: Maintains high aesthetic standards and prompt adherence
  • Efficient generation: Single-shot computational costs for multi-shot outputs
  • Minute-long videos: Capable of generating coherent narratives exceeding 60 seconds

Technical Specifications

System Requirements

  • Python 3.11
  • CUDA-compatible GPU
  • Flash Attention support
  • Sufficient VRAM for video diffusion models

Key Parameters

  • Output Resolution: Default 832×480, configurable
  • Max Memory Size: Default 10 shots, adjustable
  • Memory Management: Dynamic updates with semantic filtering
  • Random Seed: Reproducible generation support

Use Cases and Applications

  1. Narrative Video Creation: Generate complete stories with multiple scenes
  2. Character-Consistent Content: Maintain character identity across extended sequences
  3. Customized Storytelling: Use reference images for personalized narratives
  4. Cinematic Productions: Create videos with professional shot composition and transitions
  5. Educational Content: Generate explanatory videos with sequential scenes

Research Impact

The framework represents a significant advancement in AI video generation by:

  • Bridging the gap between single-shot quality and multi-shot consistency
  • Introducing practical memory mechanisms for temporal coherence
  • Providing efficient fine-tuning approach via LoRA
  • Establishing evaluation standards through ST-Bench
  • Enabling accessible long-form video creation

Implementation Details

Story Script Format

The system uses JSON-formatted story scripts with:

  • story_overview: Narrative summary
  • scene_num: Sequential scene indexing
  • cut: Scene transition indicators (True/False)
  • video_prompts: Shot-level text descriptions

Generation Workflow

  1. Load base models (T2V/I2V) and LoRA weights
  2. Parse story script with shot descriptions
  3. Generate initial shot or load reference images
  4. Enter iterative generation loop
  5. Extract and filter keyframes
  6. Update memory bank
  7. Generate next shot conditioned on memory
  8. Repeat until story completion

Future Directions

The framework opens pathways for:

  • Extended video length capabilities
  • Enhanced character customization
  • Improved temporal consistency mechanisms
  • Multi-character story handling
  • Interactive storytelling applications

Citation

@article{zhang2025storymem,
  title={{StoryMem}: Multi-shot Long Video Storytelling with Memory},
  author={Zhang, Kaiwen and Jiang, Liming and Wang, Angtian and 
          Fang, Jacob Zhiyuan and Zhi, Tiancheng and Yan, Qing and 
          Kang, Hao and Lu, Xin and Pan, Xingang},
  journal={arXiv preprint},
  volume={arXiv:2512.19539},
  year={2025}
}

Resources

Acknowledgments

StoryMem builds upon the Wan2.2 framework and represents collaborative research between NTU S-Lab and ByteDance, advancing the state-of-the-art in AI-powered video storytelling.

Star History Chart