Figure 1: An overview of image generation and understanding examples. All results are obtained by the proposed Harmon-1.5B, which uses a shared visual encoder for both tasks.
Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. And we present Harmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on text-to-image generation benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks.
Figure 2: The overall framework of Harmon. (a) Image generation is performed in the masked autoregressive manner. (b) Image understanding is formulated as image-conditioned text autoregression. The MAR encoder is shared by both tasks..
Table 1: Comparison with state-of-the-art models on multimodal question-answering benchmarks. Best and second best results are marked in bold and underlined respectively. â means employing a separate semantic encoder.
Model | LLM Scale | POPEâ | MME-Pâ | MME-Câ | MMBâ | SEEDâ | GQAâ | MMMUâ |
---|---|---|---|---|---|---|---|---|
ILLUME | 7B | 88.5 | 1445 | - | 65.1 | 72.9 | - | 38.2 |
VILA-U | 7B | 85.8 | 1402 | - | - | 59.0 | 60.8 | - |
Show-o | 1.3B | 80.0 | 1097 | 248 | 51.6 | 54.4 | 58.0 | 26.7 |
Janusâ | 1.3B | 87.0 | 1338 | 222 | 69.4 | 63.7 | 59.1 | 30.5 |
Janus-Proâ | 1.5B | 86.2 | 1444 | 268 | 75.5 | 68.3 | 59.3 | 36.3 |
Harmon-0.5B | 0.5B | 86.5 | 1148 | 260 | 59.8 | 62.5 | 56.3 | 34.2 |
Harmon-1.5B | 1.5B | 87.6 | 1155 | 321 | 65.5 | 67.1 | 58.9 | 38.9 |
Table 2: Text-to-image generation on MSCOCO-30K and MJHQ-30K. FID is used as the metric for both benchmarks. Best and second best results are marked in bold and underlined respectively.
Model | MSCOCO-FIDâ | MJHQ-FIDâ |
---|---|---|
Show-o | 9.24 | 15.18 |
LWM | 12.68 | 17.77 |
VILA-U | - | 7.69 |
Janus | 8.53 | 10.10 |
Janus-Pro-1.5B | 16.08 | 9.53 |
Harmon-0.5B | 8.86 | 6.08 |
Harmon-1.5B | 8.39 | 5.15 |
Table 3: Comparison with state-of-the-art models on the GenEval benchmark for text-to-image generation. Best and second best results are marked in bold and underlined respectively.
Method | Single Obj. | Two Obj. | Counting | Colors | Position | Color Attri. | Overallâ |
---|---|---|---|---|---|---|---|
Show-o | 0.95 | 0.52 | 0.49 | 0.82 | 0.11 | 0.28 | 0.53 |
LWM | 0.93 | 0.41 | 0.46 | 0.79 | 0.09 | 0.15 | 0.47 |
ILLUME | 0.99 | 0.86 | 0.45 | 0.71 | 0.39 | 0.28 | 0.61 |
Janus | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | 0.61 |
Janus-Pro-1.5B | 0.98 | 0.82 | 0.51 | 0.89 | 0.65 | 0.56 | 0.73 |
Harmon-0.5B | 0.99 | 0.80 | 0.57 | 0.87 | 0.55 | 0.48 | 0.71 |
Harmon-1.5B | 0.99 | 0.86 | 0.66 | 0.85 | 0.74 | 0.48 | 0.76 |
Table 4: Comparison with state-of-the-art models on the WISE benchmark for text-to-image generation. Best and second best results are marked in bold and underlined respectively.
Method | Cultural | Time | Space | Biology | Physics | Chemistry | Overallâ |
---|---|---|---|---|---|---|---|
Janus | 0.16 | 0.26 | 0.35 | 0.28 | 0.30 | 0.14 | 0.23 |
Janus-Pro-1.5B | 0.20 | 0.28 | 0.45 | 0.24 | 0.32 | 0.16 | 0.26 |
Orthus | 0.23 | 0.31 | 0.38 | 0.28 | 0.31 | 0.20 | 0.27 |
VILA-U | 0.26 | 0.33 | 0.37 | 0.35 | 0.39 | 0.23 | 0.31 |
Show-o | 0.28 | 0.40 | 0.48 | 0.30 | 0.46 | 0.30 | 0.35 |
Harmon-1.5B | 0.38 | 0.48 | 0.52 | 0.37 | 0.44 | 0.29 | 0.41 |
Figure 3: Text-to-image generation examples.
Figure 4: Image understanding examples.
@misc{wu2025harmon, title={Harmonizing Visual Representations for Unified Multimodal Understanding and Generation}, author={Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Zhonghua Wu and Qingyi Tao and Wentao Liu and Wei Li and Chen Change Loy}, year={2025}, eprint={2503.21979}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.21979}, }