VideoLingo is a comprehensive tool integrating video translation, localization, and dubbing functionalities, aiming to generate Netflix-level quality subtitles. This project eliminates awkward machine translations and multi-line subtitle issues, while adding high-quality dubbing, enabling global knowledge sharing across language barriers.
*Chinese uses a separate punctuation-enhanced whisper model
Translation supports all languages, and dubbing languages depend on the selected TTS method.
C:\Program Files\NVIDIA\CUDNN\v9.3\bin\12.6
to the system PATHchoco install ffmpeg
(via Chocolatey)brew install ffmpeg
(via Homebrew)sudo apt install ffmpeg
(Debian/Ubuntu)git clone https://github.com/Huanshere/VideoLingo.git
cd VideoLingo
conda create -n videolingo python=3.10.0 -y
conda activate videolingo
python install.py
streamlit run st.py
docker build -t videolingo .
docker run -d -p 8501:8501 --gpus all videolingo
Requires CUDA 12.4 and NVIDIA driver version >550
VideoLingo supports OpenAI-Like API format and various TTS interfaces:
claude-3-5-sonnet
gpt-4.1
deepseek-v3
gemini-2.0-flash
azure-tts
openai-tts
siliconflow-fishtts
fish-tts
GPT-SoVITS
edge-tts
*custom-tts
(Can modify custom TTS in custom_tts.py)Audio Quality Impact: WhisperX transcription performance may be affected by video background noise. For videos with significant background music, enable vocal separation enhancement.
Numeric Character Handling: Subtitles ending with numbers or special characters may be truncated early because wav2vac cannot map numeric characters (e.g., "1") to their spoken form (e.g., "one").
Model Compatibility: Using weaker models may cause errors during processing due to strict JSON format requirements.
Dubbing Perfection: Due to differences in speech rate and intonation between languages, as well as the impact of translation steps, the dubbing function may not be 100% perfect.
Multi-Language Recognition: Multi-language video transcription recognition will only retain the primary language.
Multi-Character Dubbing: Currently, it is not possible to dub multiple characters separately because whisperX's speaker diarization capability is not reliable enough.