Cong Wan (万聪)

M.S. Student · Research Intern @ ByteDance Seed

Machine Intelligence and Vision (MIV) Lab, Xi'an Jiaotong University
Advised by Prof. Yihong Gong · wancong[at]stu.xjtu.edu.cn

Email Google Scholar GitHub CV

About

I am an M.S. student in Computer Science and Technology at Xi'an Jiaotong University (XJTU), advised by Prof. Yihong Gong. I received my B.S. in Mathematics from XJTU in 2024. I am currently a Research Intern at ByteDance Seed (led by Chang Zhou), working on foundation models and embodied AI. Previously I interned at Microsoft Research Asia (world models) and Alibaba DAMO Academy (unified visual generation).

My research centers on multimodal foundation models — spanning model design, large-scale pre-training, data construction, and benchmark evaluation. I am especially interested in advancing the intelligence of unified multimodal models, so they can perceive, reason, and act more like genuinely intelligent systems.

Multimodal Foundation Models World Models Embodied AI / VLA Visual Generation Reinforcement Learning

News

Jul 2026

CoRe accepted to ACM Multimedia (MM) 2026 — a framework for cross-image comparative reasoning in VLMs. (arXiv)

Jun 2026

DataClaw preprint is on arXiv — agentic tailoring of multimodal data from raw streams. (project page)

May 2026

Four papers — DataClaw, WorldCanvas, ProSR, and Retrieve-then-Steer — under review at NeurIPS 2026.

Feb 2026

ReMoT accepted to CVPR 2026 as a Highlight (Top 3.8%).

Nov 2025

Started as a Research Intern at ByteDance Seed, working on foundation models & embodied AI.

Jun 2025

Joined the Machine Learning Group at Microsoft Research Asia as a Research Intern (world models).

Dec 2024

Joined Alibaba DAMO Academy as a Research Intern; released GRID: Omni Visual Generation (tech report).

Sep 2024

Started my M.S. at XJTU; PAP accepted to NeurIPS 2024.

Publications

First-author work is highlighted. Full list on Google Scholar.

NeurIPS2026

DataClaw: Agentic Tailoring Multimodal Data from Raw Streams

Cong Wan, Zeyu Guo, Zijian Cai, Jiangyang Li, SongLin Dong, Lin Peng, Xiangyang Luo, Zhiheng Ma, Yihong Gong

NeurIPS 2026Under review

arXiv Project Code

NeurIPS2026 · Under review

WorldCanvas: Context-Guided Embodied Video Synthesis with a Unified Visual Canvas

Cong Wan, Jingzhou You, SongLin Dong, Kailai Chen, Zhiheng Ma, Yihong Gong

NeurIPS 2026Under review

CVPR2026

ReMoT: Reinforcement Learning with Motion Contrast Triplets

Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong

CVPR 2026★ Highlight · Top 3.8%

arXiv

NeurIPS2026

ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

Jiangyang Li, Cong Wan, Changjie Wu, SongLin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, Yihong Gong

NeurIPS 2026Under review

arXiv

NeurIPS2026

Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

Jianchao Zhao, Huoren Yang, Yusong Hu, Yuyang Gao, Qiguan Ou, Cong Wan, SongLin Dong, Zhiheng Ma, Yihong Gong

NeurIPS 2026Under review

arXiv

CoRe cross-image comparative reasoning teaser

ACM MM2026

CoRe: A Comprehensive Framework for Cross-Image Comparative Reasoning in Vision-Language Models

Lin Peng, Cong Wan, Zeyu Guo, SongLin Dong, Yihong Gong

ACM Multimedia (MM) 2026

arXiv

CVPR2026

Trajectory-Diversity-Driven Robust Vision-and-Language Navigation

Jiangyang Li, Cong Wan, SongLin Dong, Chenhao Ding, Qiang Wang, Zhiheng Ma, Yihong Gong

CVPR 2026 Findings

arXiv

ICCV2025

CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation

Xiangyang Luo, Ye Zhu, Yunfei Liu, Lijian Lin, Cong Wan, Zijian Cai, Yu Li, Shao-Lun Huang

ICCV 2025

arXiv Code

ACM MM2025

CIA: Class- and Instance-aware Adaptation for Vision-Language Models

Lin Peng, Cong Wan, Shaokun Wang, Xiang Song, Yuhang He, Yihong Gong

ACM Multimedia (MM) 2025

Paper

arXiv2024

GRID: Omni Visual Generation

Cong Wan, Xiangyang Luo, Hao Luo, Zijian Cai, Yiren Song, Yunlong Zhao, Yifan Bai, Fan Wang, Yuhang He, Yihong Gong

arXiv preprint

arXiv Code

NeurIPS2024

Prompt-Agnostic Adversarial Perturbation for Customized Diffusion Models

Cong Wan, Yuhang He, Xiang Song, Yihong Gong

NeurIPS 2024

arXiv Code

IEEE TMM2026

SDA: Structure-aware Distribution Alignment for Vision-Language Models

Lin Peng, Cong Wan, Shaokun Wang, Yuhang He, Yihong Gong

IEEE Transactions on Multimedia (TMM) 2026

Experience

ByteDance SeedNov 2025 – Present

Research Intern · Foundation Model, Embodied AI · Beijing · Led by Chang Zhou

VLM pre-training; unified VLA world model; interleaved image-text reasoning for robot manipulation.

Machine Learning Group, Microsoft Research Asia (MSRA)Jun 2025 – Nov 2025

Research Intern · World Model · Beijing · Led by Li Zhao

Voxel-based world models; dynamic scene consistency.

Alibaba DAMO AcademyDec 2024 – Jun 2025

Research Intern · AIGC, Unified Model · Hangzhou · Led by Hao Luo

Unified visual generation (GRID); large-scale generation/editing dataset construction.

Education

Xi'an Jiaotong University2024 – 2027

M.S. in Computer Science and Technology · Advised by Prof. Yihong Gong

MIV Lab · Foundation Models, Reinforcement Learning, World Models.

Xi'an Jiaotong University2020 – 2024

B.S. in Mathematics

Honors & Awards

National ScholarshipGraduate

Special-Class Scholarship, Xi'an Jiaotong University—

Outstanding Undergraduate2024

National Postgraduate Mathematical Modeling Competition — Second Prize—

National Undergraduate Mathematics Competition — Second Prize—

Mathematical Contest in Modeling (MCM) — Meritorious Winner—

National High School Mathematics League — Second Prize—

Academic Service

Conference ReviewerNeurIPS

Journal ReviewerIEEE Transactions on Multimedia (TMM)