Apple's Manzano Model Debuts: Hybrid Tokenizer Solves the Dual Challenges of Image Understanding and Generation

September 29, 2025

arXiv

3 min

Abstract

Apple's research team recently unveiled Manzano, a unified multimodal large language model that demonstrates groundbreaking capabilities in image understanding and generation. Manzano employs a hybrid image tokenizer architecture, enabling it to handle both image understanding and generation tasks simultaneously. The model has achieved industry-leading performance across multiple benchmarks, particularly excelling in text-intensive image understanding tasks.

Apple's research team released an innovative image model named Manzano (meaning "apple tree" in Spanish) in September 2025. As a unified multimodal large language model, Manzano addresses the performance trade-off between image understanding and generation that exists in current open-source models.

According to the academic paper published by Apple's research team, Manzano utilizes a unique Hybrid Image Tokenizer technology. This architecture consists of three core components: a unified visual encoder, a large language model decoder, and an image decoder for final output. Notably, the hybrid tokenizer can generate two types of tokens from the same encoder—continuous tokens for understanding tasks and discrete tokens for generation tasks.

Technically, Manzano's training is divided into three stages. The pre-training phase involved 2.3 billion image-text pairs and 1 billion text-to-image pairs, totaling 1.6 trillion tokens. The model is available in multiple parameter scales, including 300 million, 1 billion, 3 billion, and 30 billion. Its image decoder ranges from 900 million to 3.52 billion parameters, supporting various output resolutions from 256 to 2048 pixels.

In terms of performance evaluation, Manzano has shown outstanding results in image understanding benchmarks. The 3-billion-parameter version achieved a score of 93.5 on DocVQA, 85.7 on OCRBench, and 69.8 on MathVista. The 30-billion-parameter version ranked among the top performers in knowledge reasoning benchmarks such as ScienceQA and MMMU.

Its image generation capabilities are equally impressive. In automated evaluations like GenEval and WISE, Manzano performed comparably to commercial systems such as GPT-4o and Google's Nano Banana. Human evaluations indicated high scores for the model across three dimensions: structural integrity, instruction following, and aesthetic quality.

Notably, Manzano also supports various image editing functionalities, including instruction-based editing, style transfer, inpainting, outpainting, and depth estimation. These features are achieved by conditioning both the large language model and the diffusion decoder on a reference image simultaneously.

Apple's research team emphasizes in their paper that Manzano's design philosophy centers on simplicity and scalability. The model employs a unified autoregressive objective function, eliminating the need for additional auxiliary losses or task-specific heads. Its components are clearly decoupled, facilitating independent scaling. Research indicates that scaling the language model decoder consistently leads to performance improvements in both understanding and generation tasks.

Currently, Manzano has not been publicly released, nor is a demo version available. Apple's research team has only shared the academic paper and low-resolution image samples for the research community's reference. The model's research findings have been publicly published on the arXiv platform.

Industry experts believe that Manzano represents a new direction in the development of unified multimodal models. Its hybrid tokenizer architecture effectively mitigates the conflicts between visual understanding and generation tasks, offering new insights for the design of future multimodal AI systems. With further scaling of the model and optimization of training methods, unified multimodal models are expected to play a role in a wider range of practical application scenarios.