Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu1, Wenwei Zhang2, Lumin Xu3, Sheng Jin4, Zhonghua Wu5
Qingyi Tao5, Wentao Liu4, Wei Li1, Chen Change Loy1
1S-Lab, Nanyang Technological University     2Shanghai AI Laboratory Research     3The Chinese University of Hong Kong
4SenseTime Research and Tetras.AI     5SenseTime Research
arXiv GitHub HuggingFace HuggingFace Model
Harmon Teaser Figure 1

Figure 1: An overview of image generation and understanding examples. All results are obtained by the proposed Harmon-1.5B, which uses a shared visual encoder for both tasks.

Abstract

Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. And we present Harmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on text-to-image generation benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks.

Method

Harmon Teaser Figure 1

Figure 2: The overall framework of Harmon. (a) Image generation is performed in the masked autoregressive manner. (b) Image understanding is formulated as image-conditioned text autoregression. The MAR encoder is shared by both tasks..

Benchmark Results

Table 1: Comparison with state-of-the-art models on multimodal question-answering benchmarks. Best and second best results are marked in bold and underlined respectively. † means employing a separate semantic encoder.

Model LLM Scale POPE↑ MME-P↑ MME-C↑ MMB↑ SEED↑ GQA↑ MMMU↑
ILLUME 7B 88.5 1445 - 65.1 72.9 - 38.2
VILA-U 7B 85.8 1402 - - 59.0 60.8 -
Show-o 1.3B 80.0 1097 248 51.6 54.4 58.0 26.7
Janus† 1.3B 87.0 1338 222 69.4 63.7 59.1 30.5
Janus-Pro† 1.5B 86.2 1444 268 75.5 68.3 59.3 36.3
Harmon-0.5B 0.5B 86.5 1148 260 59.8 62.5 56.3 34.2
Harmon-1.5B 1.5B 87.6 1155 321 65.5 67.1 58.9 38.9

Table 2: Text-to-image generation on MSCOCO-30K and MJHQ-30K. FID is used as the metric for both benchmarks. Best and second best results are marked in bold and underlined respectively.

Model MSCOCO-FID↓ MJHQ-FID↓
Show-o 9.24 15.18
LWM 12.68 17.77
VILA-U - 7.69
Janus 8.53 10.10
Janus-Pro-1.5B 16.08 9.53
Harmon-0.5B 8.86 6.08
Harmon-1.5B 8.39 5.15

Table 3: Comparison with state-of-the-art models on the GenEval benchmark for text-to-image generation. Best and second best results are marked in bold and underlined respectively.

Method Single Obj. Two Obj. Counting Colors Position Color Attri. Overall↑
Show-o 0.95 0.52 0.49 0.82 0.11 0.28 0.53
LWM 0.93 0.41 0.46 0.79 0.09 0.15 0.47
ILLUME 0.99 0.86 0.45 0.71 0.39 0.28 0.61
Janus 0.97 0.68 0.30 0.84 0.46 0.42 0.61
Janus-Pro-1.5B 0.98 0.82 0.51 0.89 0.65 0.56 0.73
Harmon-0.5B 0.99 0.80 0.57 0.87 0.55 0.48 0.71
Harmon-1.5B 0.99 0.86 0.66 0.85 0.74 0.48 0.76

Table 4: Comparison with state-of-the-art models on the WISE benchmark for text-to-image generation. Best and second best results are marked in bold and underlined respectively.

Method Cultural Time Space Biology Physics Chemistry Overall↑
Janus 0.16 0.26 0.35 0.28 0.30 0.14 0.23
Janus-Pro-1.5B 0.20 0.28 0.45 0.24 0.32 0.16 0.26
Orthus 0.23 0.31 0.38 0.28 0.31 0.20 0.27
VILA-U 0.26 0.33 0.37 0.35 0.39 0.23 0.31
Show-o 0.28 0.40 0.48 0.30 0.46 0.30 0.35
Harmon-1.5B 0.38 0.48 0.52 0.37 0.44 0.29 0.41

Visualization Results

Text-to-Image Generation Examples

Figure 3: Text-to-image generation examples.

Image Understanding Examples

Figure 4: Image understanding examples.

Citation

    @misc{wu2025harmon,
      title={Harmonizing Visual Representations for Unified Multimodal Understanding and Generation}, 
      author={Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Zhonghua Wu and Qingyi Tao and Wentao Liu and Wei Li and Chen Change Loy},
      year={2025},
      eprint={2503.21979},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.21979}, 
}