GLM-4.5V and GLM-4.1V series: Open-source vision-language models for diverse multimodal reasoning, enhancing visual reasoning capabilities through reinforcement learning.
GLM-V Project Details
Project Overview
GLM-V is an open-source series of multimodal vision-language models from Zhipu AI (Z.ai), including two main models: GLM-4.5V and GLM-4.1V. This project aims to explore the technical frontiers of vision-language models in complex reasoning tasks, significantly enhancing the models' multimodal understanding and reasoning capabilities through reinforcement learning techniques.
GitHub Address: https://github.com/zai-org/GLM-V
Core Features
🚀 Key Capabilities
- Image Reasoning: Scene understanding, complex multi-image analysis, spatial recognition
- Video Understanding: Long video segmentation and event recognition
- GUI Tasks: Screen reading, icon recognition, desktop operation assistance
- Complex Chart and Long Document Parsing: Research report analysis, information extraction
- Precise Localization: Precise localization of visual elements
🧠 Thinking Mode Switch
The model introduces a Thinking Mode switch, allowing users to balance between fast response and deep reasoning, similar to how the GLM-4.5 language model operates.
Model Architecture
GLM-4.5V
- Base Model: Based on Zhipu AI's next-generation flagship text base model GLM-4.5-Air
- Parameter Scale: 106B total parameters, 12B active parameters
- Performance: Achieves SOTA performance among models of comparable scale across 42 public vision-language benchmarks
- Technical Features:
- Supports various visual content types
- Full-spectrum visual reasoning capabilities
- Efficient hybrid training
- Focus on practical application scenarios
GLM-4.1V-9B-Thinking
- Base Model: Based on the GLM-4-9B-0414 base model
- Core Technology: Introduces a reasoning paradigm, utilizing RLCS (Reinforcement Learning with Curriculum Sampling)
- Performance Advantages:
- Strongest performance among 10B-level VLMs
- Matches or surpasses the 72B-parameter Qwen-2.5-VL on 18 benchmark tasks
- Supports 64k context length
- Supports arbitrary aspect ratios and up to 4k image resolution
- Bilingual (Chinese and English) open-source version
Technical Innovation
Reasoning Mechanism
GLM-4.1V-9B-Thinking integrates the Chain-of-Thought reasoning mechanism, enhancing accuracy, richness, and interpretability. It outperforms other 10B-parameter models on 23 out of 28 benchmark tasks.
Reinforcement Learning Training
The model employs scalable reinforcement learning techniques, comprehensively enhancing model capabilities through the RLCS method, particularly excelling in math, code, and logical reasoning tasks.
Installation and Usage
Environment Requirements
Suitable for NVIDIA GPUs, supports Ascend NPU inference.
Install Dependencies
For SGLang and transformers:
pip install -r requirements.txt
For vLLM:
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install transformers-v4.55.0-GLM-4.5V-preview
Inference Examples
Using vLLM Service
vllm serve zai-org/GLM-4.5V \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.5v \
--allowed-local-media-path / \
--media-io-kwargs '{"video": {"num_frames": -1}}'
Using SGLang Service
python3 -m sglang.launch_server --model-path zai-org/GLM-4.5V \
--tp-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--served-model-name glm-4.5v \
--port 8000 \
--host 0.0.0.0
Transformers Code Example
from transformers import AutoProcessor, Glm4vMoeForConditionalGeneration
import torch
MODEL_PATH = "zai-org/GLM-4.5V"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://example.com/image.png"
},
{
"type": "text",
"text": "describe this image"
}
],
}
]
processor = AutoProcessor.from_pretrained(MODEL_PATH)
model = Glm4vMoeForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path=MODEL_PATH,
torch_dtype="auto",
device_map="auto",
)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.decode(
generated_ids[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=False
)
print(output_text)
Fine-tuning Support
The project supports fine-tuning using LLaMA-Factory. Dataset format example:
[
{
"messages": [
{
"content": "<image>Who are they?",
"role": "user"
},
{
"content": "<think>\nUser asked me to observe the image and find the answer. I know they are Kane and Goretzka from Bayern Munich.</think>\n<answer>They're Kane and Goretzka from Bayern Munich.</answer>",
"role": "assistant"
}
],
"images": [
"mllm_demo_data/1.jpg"
]
}
]
Application Examples
GUI Agent
The project provides examples of GUI agents, demonstrating prompt construction and output processing strategies for mobile, PC, and web environments.
Desktop Assistant
An open-source, hand-crafted desktop assistant application is provided, which can capture visual information from a PC screen via screenshots or screen recordings when connected to GLM-4.5V.
VLM Reward System
The VLM reward system used to train GLM-4.1V-Thinking is open-sourced and can be run locally:
python examples/reward_system_demo.py
Performance
Benchmark Achievements
- GLM-4.5V achieves SOTA performance among models of comparable scale across 42 public vision-language benchmarks.
- GLM-4.1V-9B-Thinking outperforms models of comparable parameter scale on 23 out of 28 benchmark tasks.
- Matches or surpasses the 72B-parameter Qwen-2.5-VL-72B on 18 benchmark tasks.
Optimization Improvements
Since the release of GLM-4.1V, the team has addressed many community feedback issues. In GLM-4.5V, common problems such as repetitive thinking and output format errors have been mitigated.
Community and Support
- Online Experience: chat.z.ai
- API Interface: Z.ai API Platform
- Hugging Face: GLM-4.5V, GLM-4.1V-9B-Thinking
- Discord Community: Join the discussion
The GLM-V project represents a significant advancement in open-source multimodal AI, providing researchers and developers with powerful vision-language understanding and reasoning tools, and promoting the development of multimodal agents and complex visual reasoning applications.