Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu¹, Wenwei Zhang², Lumin Xu³, Sheng Jin⁴, Zhonghua Wu⁵
Qingyi Tao⁵, Wentao Liu⁴, Wei Li¹, Chen Change Loy¹

¹S-Lab, Nanyang Technological University ²Shanghai AI Laboratory Research ³The Chinese University of Hong Kong

⁴SenseTime Research and Tetras.AI ⁵SenseTime Research

Figure 1: An overview of image generation and understanding examples. All results are obtained by the proposed Harmon-1.5B, which uses a shared visual encoder for both tasks.

Abstract

Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. And we present Harmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on text-to-image generation benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks.

Method

Figure 2: The overall framework of Harmon. (a) Image generation is performed in the masked autoregressive manner. (b) Image understanding is formulated as image-conditioned text autoregression. The MAR encoder is shared by both tasks..

Benchmark Results

Table 1: Comparison with state-of-the-art models on multimodal question-answering benchmarks. Best and second best results are marked in bold and underlined respectively. † means employing a separate semantic encoder.

Model	LLM Scale	POPE↑	MME-P↑	MME-C↑	MMB↑	SEED↑	GQA↑	MMMU↑
ILLUME	7B	88.5	1445	-	65.1	72.9	-	38.2
VILA-U	7B	85.8	1402	-	-	59.0	60.8	-
Show-o	1.3B	80.0	1097	248	51.6	54.4	58.0	26.7
Janus†	1.3B	87.0	1338	222	69.4	63.7	59.1	30.5
Janus-Pro†	1.5B	86.2	1444	268	75.5	68.3	59.3	36.3
Harmon-0.5B	0.5B	86.5	1148	260	59.8	62.5	56.3	34.2
Harmon-1.5B	1.5B	87.6	1155	321	65.5	67.1	58.9	38.9

Table 2: Text-to-image generation on MSCOCO-30K and MJHQ-30K. FID is used as the metric for both benchmarks. Best and second best results are marked in bold and underlined respectively.

Model	MSCOCO-FID↓	MJHQ-FID↓
Show-o	9.24	15.18
LWM	12.68	17.77
VILA-U	-	7.69
Janus	8.53	10.10
Janus-Pro-1.5B	16.08	9.53
Harmon-0.5B	8.86	6.08
Harmon-1.5B	8.39	5.15

Table 3: Comparison with state-of-the-art models on the GenEval benchmark for text-to-image generation. Best and second best results are marked in bold and underlined respectively.

Method	Single Obj.	Two Obj.	Counting	Colors	Position	Color Attri.	Overall↑
Show-o	0.95	0.52	0.49	0.82	0.11	0.28	0.53
LWM	0.93	0.41	0.46	0.79	0.09	0.15	0.47
ILLUME	0.99	0.86	0.45	0.71	0.39	0.28	0.61
Janus	0.97	0.68	0.30	0.84	0.46	0.42	0.61
Janus-Pro-1.5B	0.98	0.82	0.51	0.89	0.65	0.56	0.73
Harmon-0.5B	0.99	0.80	0.57	0.87	0.55	0.48	0.71
Harmon-1.5B	0.99	0.86	0.66	0.85	0.74	0.48	0.76

Table 4: Comparison with state-of-the-art models on the WISE benchmark for text-to-image generation. Best and second best results are marked in bold and underlined respectively.

Method	Cultural	Time	Space	Biology	Physics	Chemistry	Overall↑
Janus	0.16	0.26	0.35	0.28	0.30	0.14	0.23
Janus-Pro-1.5B	0.20	0.28	0.45	0.24	0.32	0.16	0.26
Orthus	0.23	0.31	0.38	0.28	0.31	0.20	0.27
VILA-U	0.26	0.33	0.37	0.35	0.39	0.23	0.31
Show-o	0.28	0.40	0.48	0.30	0.46	0.30	0.35
Harmon-1.5B	0.38	0.48	0.52	0.37	0.44	0.29	0.41

Visualization Results

Figure 3: Text-to-image generation examples.

Figure 4: Image understanding examples.

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Abstract

Method

Benchmark Results

Visualization Results

Citation