GLM-4.5V and GLM-4.1V series: Open-source vision-language models for diverse multimodal reasoning, enhancing visual reasoning capabilities through reinforcement learning.

Apache-2.0PythonGLM-Vzai-org 1.4k Last Updated: August 14, 2025

GLM-V Project Details

Project Overview

GLM-V is an open-source series of multimodal vision-language models from Zhipu AI (Z.ai), including two main models: GLM-4.5V and GLM-4.1V. This project aims to explore the technical frontiers of vision-language models in complex reasoning tasks, significantly enhancing the models' multimodal understanding and reasoning capabilities through reinforcement learning techniques.

GitHub Address: https://github.com/zai-org/GLM-V

Core Features

🚀 Key Capabilities

  • Image Reasoning: Scene understanding, complex multi-image analysis, spatial recognition
  • Video Understanding: Long video segmentation and event recognition
  • GUI Tasks: Screen reading, icon recognition, desktop operation assistance
  • Complex Chart and Long Document Parsing: Research report analysis, information extraction
  • Precise Localization: Precise localization of visual elements

🧠 Thinking Mode Switch

The model introduces a Thinking Mode switch, allowing users to balance between fast response and deep reasoning, similar to how the GLM-4.5 language model operates.

Model Architecture

GLM-4.5V

  • Base Model: Based on Zhipu AI's next-generation flagship text base model GLM-4.5-Air
  • Parameter Scale: 106B total parameters, 12B active parameters
  • Performance: Achieves SOTA performance among models of comparable scale across 42 public vision-language benchmarks
  • Technical Features:
    • Supports various visual content types
    • Full-spectrum visual reasoning capabilities
    • Efficient hybrid training
    • Focus on practical application scenarios

GLM-4.1V-9B-Thinking

  • Base Model: Based on the GLM-4-9B-0414 base model
  • Core Technology: Introduces a reasoning paradigm, utilizing RLCS (Reinforcement Learning with Curriculum Sampling)
  • Performance Advantages:
    • Strongest performance among 10B-level VLMs
    • Matches or surpasses the 72B-parameter Qwen-2.5-VL on 18 benchmark tasks
    • Supports 64k context length
    • Supports arbitrary aspect ratios and up to 4k image resolution
    • Bilingual (Chinese and English) open-source version

Technical Innovation

Reasoning Mechanism

GLM-4.1V-9B-Thinking integrates the Chain-of-Thought reasoning mechanism, enhancing accuracy, richness, and interpretability. It outperforms other 10B-parameter models on 23 out of 28 benchmark tasks.

Reinforcement Learning Training

The model employs scalable reinforcement learning techniques, comprehensively enhancing model capabilities through the RLCS method, particularly excelling in math, code, and logical reasoning tasks.

Installation and Usage

Environment Requirements

Suitable for NVIDIA GPUs, supports Ascend NPU inference.

Install Dependencies

For SGLang and transformers:

pip install -r requirements.txt

For vLLM:

pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install transformers-v4.55.0-GLM-4.5V-preview

Inference Examples

Using vLLM Service

vllm serve zai-org/GLM-4.5V \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.5v \
--allowed-local-media-path / \
--media-io-kwargs '{"video": {"num_frames": -1}}'

Using SGLang Service

python3 -m sglang.launch_server --model-path zai-org/GLM-4.5V \
--tp-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--served-model-name glm-4.5v \
--port 8000 \
--host 0.0.0.0

Transformers Code Example

from transformers import AutoProcessor, Glm4vMoeForConditionalGeneration
import torch

MODEL_PATH = "zai-org/GLM-4.5V"
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://example.com/image.png"
            },
            {
                "type": "text",
                "text": "describe this image"
            }
        ],
    }
]

processor = AutoProcessor.from_pretrained(MODEL_PATH)
model = Glm4vMoeForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path=MODEL_PATH,
    torch_dtype="auto",
    device_map="auto",
)

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.decode(
    generated_ids[0][inputs["input_ids"].shape[1]:], 
    skip_special_tokens=False
)
print(output_text)

Fine-tuning Support

The project supports fine-tuning using LLaMA-Factory. Dataset format example:

[
    {
        "messages": [
            {
                "content": "<image>Who are they?",
                "role": "user"
            },
            {
                "content": "<think>\nUser asked me to observe the image and find the answer. I know they are Kane and Goretzka from Bayern Munich.</think>\n<answer>They're Kane and Goretzka from Bayern Munich.</answer>",
                "role": "assistant"
            }
        ],
        "images": [
            "mllm_demo_data/1.jpg"
        ]
    }
]

Application Examples

GUI Agent

The project provides examples of GUI agents, demonstrating prompt construction and output processing strategies for mobile, PC, and web environments.

Desktop Assistant

An open-source, hand-crafted desktop assistant application is provided, which can capture visual information from a PC screen via screenshots or screen recordings when connected to GLM-4.5V.

VLM Reward System

The VLM reward system used to train GLM-4.1V-Thinking is open-sourced and can be run locally:

python examples/reward_system_demo.py

Performance

Benchmark Achievements

  • GLM-4.5V achieves SOTA performance among models of comparable scale across 42 public vision-language benchmarks.
  • GLM-4.1V-9B-Thinking outperforms models of comparable parameter scale on 23 out of 28 benchmark tasks.
  • Matches or surpasses the 72B-parameter Qwen-2.5-VL-72B on 18 benchmark tasks.

Optimization Improvements

Since the release of GLM-4.1V, the team has addressed many community feedback issues. In GLM-4.5V, common problems such as repetitive thinking and output format errors have been mitigated.

Community and Support

The GLM-V project represents a significant advancement in open-source multimodal AI, providing researchers and developers with powerful vision-language understanding and reasoning tools, and promoting the development of multimodal agents and complex visual reasoning applications.

Star History Chart