Kevin-thu/StoryMem View GitHub Homepage for Latest Official Releases

Memory-conditioned video generation framework for creating coherent multi-shot long-form narrative videos with cross-shot consistency

NOASSERTIONPythonStoryMemKevin-thu 0.6k Last Updated: January 22, 2026

StoryMem: Multi-shot Long Video Storytelling with Memory

Overview

StoryMem is a cutting-edge AI framework developed by researchers from Nanyang Technological University (NTU) S-Lab and ByteDance that revolutionizes long-form video generation by enabling coherent, multi-shot narrative videos with cinematic quality. The system addresses a fundamental challenge in AI video generation: maintaining visual consistency and narrative coherence across multiple shots in extended storytelling scenarios.

Core Innovation

Memory-to-Video (M2V) Paradigm

The project introduces a novel Memory-to-Video (M2V) design that transforms pre-trained single-shot video diffusion models into multi-shot storytellers. This paradigm reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, inspired by human memory mechanisms.

Key Technical Components

Dynamic Memory Bank: Maintains a compact, dynamically updated memory bank of keyframes extracted from previously generated shots
Memory Injection: Stored memory is injected into single-shot video diffusion models via latent concatenation and negative RoPE (Rotary Position Embedding) shifts
LoRA Fine-tuning: Achieves efficient adaptation with only Low-Rank Adaptation (LoRA) fine-tuning
Semantic Keyframe Selection: Uses intelligent keyframe selection strategy with aesthetic preference filtering to ensure informative and stable memory throughout generation

Technical Architecture

Base Models

StoryMem builds upon the Wan2.2 video generation framework:

Wan2.2 T2V-A14B: Text-to-Video MoE (Mixture of Experts) model
Wan2.2 I2V-A14B: Image-to-Video MoE model
StoryMem M2V LoRA: Memory-conditioned fine-tuned models

Generation Pipeline

The system operates through an iterative process:

Initial Shot Generation: Uses T2V model to generate the first shot as initial memory
Iterative Shot Synthesis: Generates subsequent shots conditioned on memory bank
Keyframe Extraction: Automatically extracts keyframes from each generated shot
Memory Update: Updates memory bank with new keyframes for next iteration
Cross-shot Consistency: Maintains character appearance, scene elements, and narrative flow

Advanced Features

MI2V (Memory + Image-to-Video)

Enables smooth transitions between adjacent shots by conditioning on both memory and the first frame of the next shot when no scene cut is intended. This creates seamless continuity in narrative flow.

MM2V (Memory + Motion-to-Video)

Supports memory conditioning with the first 5 motion frames, providing even smoother shot transitions by incorporating temporal motion information.

MR2V (Memory + Reference-to-Video)

Allows users to provide reference images as initial memory, enabling customized story generation with specific characters or backgrounds established from the outset.

ST-Bench: Evaluation Benchmark

To facilitate comprehensive evaluation, the researchers introduced ST-Bench, a diverse benchmark for multi-shot video storytelling containing:

30 long story scripts spanning diverse styles
8-12 shot-level text prompts per story
300 total detailed video prompts describing characters, scenes, dynamics, shot types, and camera movements
Scene-cut indicators for proper shot transition handling

Performance Achievements

StoryMem demonstrates significant improvements over existing methods:

28.7% improvement in cross-shot consistency over strong baselines
Superior visual quality: Maintains high aesthetic standards and prompt adherence
Efficient generation: Single-shot computational costs for multi-shot outputs
Minute-long videos: Capable of generating coherent narratives exceeding 60 seconds

Technical Specifications

System Requirements

Python 3.11
CUDA-compatible GPU
Flash Attention support
Sufficient VRAM for video diffusion models

Key Parameters

Output Resolution: Default 832×480, configurable
Max Memory Size: Default 10 shots, adjustable
Memory Management: Dynamic updates with semantic filtering
Random Seed: Reproducible generation support

Use Cases and Applications

Narrative Video Creation: Generate complete stories with multiple scenes
Character-Consistent Content: Maintain character identity across extended sequences
Customized Storytelling: Use reference images for personalized narratives
Cinematic Productions: Create videos with professional shot composition and transitions
Educational Content: Generate explanatory videos with sequential scenes

Research Impact

The framework represents a significant advancement in AI video generation by:

Bridging the gap between single-shot quality and multi-shot consistency
Introducing practical memory mechanisms for temporal coherence
Providing efficient fine-tuning approach via LoRA
Establishing evaluation standards through ST-Bench
Enabling accessible long-form video creation

Implementation Details

Story Script Format

The system uses JSON-formatted story scripts with:

story_overview: Narrative summary
scene_num: Sequential scene indexing
cut: Scene transition indicators (True/False)
video_prompts: Shot-level text descriptions

Generation Workflow

Load base models (T2V/I2V) and LoRA weights
Parse story script with shot descriptions
Generate initial shot or load reference images
Enter iterative generation loop
Extract and filter keyframes
Update memory bank
Generate next shot conditioned on memory
Repeat until story completion

Future Directions

The framework opens pathways for:

Extended video length capabilities
Enhanced character customization
Improved temporal consistency mechanisms
Multi-character story handling
Interactive storytelling applications

Citation

@article{zhang2025storymem,
  title={{StoryMem}: Multi-shot Long Video Storytelling with Memory},
  author={Zhang, Kaiwen and Jiang, Liming and Wang, Angtian and 
          Fang, Jacob Zhiyuan and Zhi, Tiancheng and Yan, Qing and 
          Kang, Hao and Lu, Xin and Pan, Xingang},
  journal={arXiv preprint},
  volume={arXiv:2512.19539},
  year={2025}
}

Resources

Paper: arXiv:2512.19539
Project Page: kevin-thu.github.io/StoryMem
Code Repository: GitHub - Kevin-thu/StoryMem
Model Weights: Hugging Face - Kevin-thu/StoryMem

Acknowledgments

StoryMem builds upon the Wan2.2 framework and represents collaborative research between NTU S-Lab and ByteDance, advancing the state-of-the-art in AI-powered video storytelling.