Memory-conditioned video generation framework for creating coherent multi-shot long-form narrative videos with cross-shot consistency
StoryMem: Multi-shot Long Video Storytelling with Memory
Overview
StoryMem is a cutting-edge AI framework developed by researchers from Nanyang Technological University (NTU) S-Lab and ByteDance that revolutionizes long-form video generation by enabling coherent, multi-shot narrative videos with cinematic quality. The system addresses a fundamental challenge in AI video generation: maintaining visual consistency and narrative coherence across multiple shots in extended storytelling scenarios.
Core Innovation
Memory-to-Video (M2V) Paradigm
The project introduces a novel Memory-to-Video (M2V) design that transforms pre-trained single-shot video diffusion models into multi-shot storytellers. This paradigm reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, inspired by human memory mechanisms.
Key Technical Components
- Dynamic Memory Bank: Maintains a compact, dynamically updated memory bank of keyframes extracted from previously generated shots
- Memory Injection: Stored memory is injected into single-shot video diffusion models via latent concatenation and negative RoPE (Rotary Position Embedding) shifts
- LoRA Fine-tuning: Achieves efficient adaptation with only Low-Rank Adaptation (LoRA) fine-tuning
- Semantic Keyframe Selection: Uses intelligent keyframe selection strategy with aesthetic preference filtering to ensure informative and stable memory throughout generation
Technical Architecture
Base Models
StoryMem builds upon the Wan2.2 video generation framework:
- Wan2.2 T2V-A14B: Text-to-Video MoE (Mixture of Experts) model
- Wan2.2 I2V-A14B: Image-to-Video MoE model
- StoryMem M2V LoRA: Memory-conditioned fine-tuned models
Generation Pipeline
The system operates through an iterative process:
- Initial Shot Generation: Uses T2V model to generate the first shot as initial memory
- Iterative Shot Synthesis: Generates subsequent shots conditioned on memory bank
- Keyframe Extraction: Automatically extracts keyframes from each generated shot
- Memory Update: Updates memory bank with new keyframes for next iteration
- Cross-shot Consistency: Maintains character appearance, scene elements, and narrative flow
Advanced Features
MI2V (Memory + Image-to-Video)
Enables smooth transitions between adjacent shots by conditioning on both memory and the first frame of the next shot when no scene cut is intended. This creates seamless continuity in narrative flow.
MM2V (Memory + Motion-to-Video)
Supports memory conditioning with the first 5 motion frames, providing even smoother shot transitions by incorporating temporal motion information.
MR2V (Memory + Reference-to-Video)
Allows users to provide reference images as initial memory, enabling customized story generation with specific characters or backgrounds established from the outset.
ST-Bench: Evaluation Benchmark
To facilitate comprehensive evaluation, the researchers introduced ST-Bench, a diverse benchmark for multi-shot video storytelling containing:
- 30 long story scripts spanning diverse styles
- 8-12 shot-level text prompts per story
- 300 total detailed video prompts describing characters, scenes, dynamics, shot types, and camera movements
- Scene-cut indicators for proper shot transition handling
Performance Achievements
StoryMem demonstrates significant improvements over existing methods:
- 28.7% improvement in cross-shot consistency over strong baselines
- Superior visual quality: Maintains high aesthetic standards and prompt adherence
- Efficient generation: Single-shot computational costs for multi-shot outputs
- Minute-long videos: Capable of generating coherent narratives exceeding 60 seconds
Technical Specifications
System Requirements
- Python 3.11
- CUDA-compatible GPU
- Flash Attention support
- Sufficient VRAM for video diffusion models
Key Parameters
- Output Resolution: Default 832×480, configurable
- Max Memory Size: Default 10 shots, adjustable
- Memory Management: Dynamic updates with semantic filtering
- Random Seed: Reproducible generation support
Use Cases and Applications
- Narrative Video Creation: Generate complete stories with multiple scenes
- Character-Consistent Content: Maintain character identity across extended sequences
- Customized Storytelling: Use reference images for personalized narratives
- Cinematic Productions: Create videos with professional shot composition and transitions
- Educational Content: Generate explanatory videos with sequential scenes
Research Impact
The framework represents a significant advancement in AI video generation by:
- Bridging the gap between single-shot quality and multi-shot consistency
- Introducing practical memory mechanisms for temporal coherence
- Providing efficient fine-tuning approach via LoRA
- Establishing evaluation standards through ST-Bench
- Enabling accessible long-form video creation
Implementation Details
Story Script Format
The system uses JSON-formatted story scripts with:
- story_overview: Narrative summary
- scene_num: Sequential scene indexing
- cut: Scene transition indicators (True/False)
- video_prompts: Shot-level text descriptions
Generation Workflow
- Load base models (T2V/I2V) and LoRA weights
- Parse story script with shot descriptions
- Generate initial shot or load reference images
- Enter iterative generation loop
- Extract and filter keyframes
- Update memory bank
- Generate next shot conditioned on memory
- Repeat until story completion
Future Directions
The framework opens pathways for:
- Extended video length capabilities
- Enhanced character customization
- Improved temporal consistency mechanisms
- Multi-character story handling
- Interactive storytelling applications
Citation
@article{zhang2025storymem,
title={{StoryMem}: Multi-shot Long Video Storytelling with Memory},
author={Zhang, Kaiwen and Jiang, Liming and Wang, Angtian and
Fang, Jacob Zhiyuan and Zhi, Tiancheng and Yan, Qing and
Kang, Hao and Lu, Xin and Pan, Xingang},
journal={arXiv preprint},
volume={arXiv:2512.19539},
year={2025}
}
Resources
- Paper: arXiv:2512.19539
- Project Page: kevin-thu.github.io/StoryMem
- Code Repository: GitHub - Kevin-thu/StoryMem
- Model Weights: Hugging Face - Kevin-thu/StoryMem
Acknowledgments
StoryMem builds upon the Wan2.2 framework and represents collaborative research between NTU S-Lab and ByteDance, advancing the state-of-the-art in AI-powered video storytelling.