Cong Wan

Cong Wan (万聪)

M.S. Student · Research Intern @ ByteDance Seed

Ma-Chao Institute of Vision (MIV) Lab, Xi'an Jiaotong University
Advised by Prof. Yihong Gong  ·  wancong[at]stu.xjtu.edu.cn

About

I am an M.S. student in Computer Science and Technology at Xi'an Jiaotong University (XJTU), advised by Prof. Yihong Gong. I received my B.S. in Mathematics from XJTU in 2024. I am currently a Research Intern at ByteDance Seed (led by Chang Zhou), working on foundation models and embodied AI. Previously I interned at Microsoft Research Asia (world models) and Alibaba DAMO Academy (unified visual generation).

My research centers on multimodal foundation models — spanning model design, large-scale pre-training, data construction, and benchmark evaluation. I am especially interested in advancing the intelligence of unified multimodal models, so they can perceive, reason, and act more like genuinely intelligent systems.

Multimodal Foundation Models World Models Embodied AI / VLA Visual Generation Reinforcement Learning

News

Jun 2026
DataClaw preprint is on arXiv — agentic tailoring of multimodal data from raw streams. (project page)
May 2026
Four papers — DataClaw, WorldCanvas, ProSR, and Retrieve-then-Steer — under review at NeurIPS 2026.
Feb 2026
ReMoT accepted to CVPR 2026 as a Highlight (Top 3.8%).
Nov 2025
Started as a Research Intern at ByteDance Seed, working on foundation models & embodied AI.
Jun 2025
Joined the Machine Learning Group at Microsoft Research Asia as a Research Intern (world models).
Sep 2024
Started my M.S. at XJTU; PAP accepted to NeurIPS 2024.

Publications

denotes representative (first-author) work. Full list on Google Scholar.

DataClaw demos across six domains NeurIPS2026
DataClaw: Agentic Tailoring Multimodal Data from Raw Streams
Cong Wan, Zeyu Guo, Zijian Cai, Jiangyang Li, SongLin Dong, Lin Peng, Xiangyang Luo, Zhiheng Ma, Yihong Gong
NeurIPS 2026Under review★ Representative
NeurIPS2026 · Under review
WorldCanvas: Context-Guided Embodied Video Synthesis with a Unified Visual Canvas
Cong Wan, Jingzhou You, SongLin Dong, Kailai Chen, Zhiheng Ma, Yihong Gong
NeurIPS 2026Under review★ Representative
ReMoT teaser CVPR2026
ReMoT: Reinforcement Learning with Motion Contrast Triplets
Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong
CVPR 2026★ Highlight · Top 3.8%
ProSR teaser NeurIPS2026
ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs
Jiangyang Li, Cong Wan, Changjie Wu, SongLin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, Yihong Gong
NeurIPS 2026Under review
Retrieve-then-Steer teaser NeurIPS2026
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
Jianchao Zhao, Huoren Yang, Yusong Hu, Yuyang Gao, Qiguan Ou, Cong Wan, SongLin Dong, Zhiheng Ma, Yihong Gong
NeurIPS 2026Under review
Trajectory-Diversity VLN teaser CVPR2026
Trajectory-Diversity-Driven Robust Vision-and-Language Navigation
Jiangyang Li, Cong Wan, SongLin Dong, Chenhao Ding, Qiang Wang, Zhiheng Ma, Yihong Gong
CVPR 2026 Findings
CanonSwap teaser ICCV2025
CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation
Xiangyang Luo, Ye Zhu, Yunfei Liu, Lijian Lin, Cong Wan, Zijian Cai, Yu Li, Shao-Lun Huang
ICCV 2025
ACM MM2025
CIA: Class- and Instance-aware Adaptation for Vision-Language Models
Lin Peng, Cong Wan, Shaokun Wang, Xiang Song, Yuhang He, Yihong Gong
ACM Multimedia (MM) 2025
GRID teaser arXiv2024
GRID: Omni Visual Generation
Cong Wan, Xiangyang Luo, Hao Luo, Zijian Cai, Yiren Song, Yunlong Zhao, Yifan Bai, Fan Wang, Yuhang He, Yihong Gong
arXiv preprint★ Representative
PAP teaser NeurIPS2024
Prompt-Agnostic Adversarial Perturbation for Customized Diffusion Models
Cong Wan, Yuhang He, Xiang Song, Yihong Gong
NeurIPS 2024★ Representative
IEEE TMM2026
SDA: Structure-aware Distribution Alignment for Vision-Language Models
Lin Peng, Cong Wan, Shaokun Wang, Yuhang He, Yihong Gong
IEEE Transactions on Multimedia (TMM) 2026

Experience

ByteDance SeedNov 2025 – Present
Research Intern · Foundation Model, Embodied AI · Beijing · Led by Chang Zhou
VLM pre-training; unified VLA world model; interleaved image-text reasoning for robot manipulation.
Machine Learning Group, Microsoft Research Asia (MSRA)Jun 2025 – Nov 2025
Research Intern · World Model · Beijing · Led by Li Zhao
Voxel-based world models; dynamic scene consistency.
Alibaba DAMO AcademyDec 2024 – Jun 2025
Research Intern · AIGC, Unified Model · Hangzhou · Led by Hao Luo
Unified visual generation (GRID); large-scale generation/editing dataset construction.

Education

Xi'an Jiaotong University2024 – 2027
M.S. in Computer Science and Technology · Advised by Prof. Yihong Gong
MIV Lab · Foundation Models, Reinforcement Learning, World Models.
Xi'an Jiaotong University2020 – 2024
B.S. in Mathematics

Honors & Awards

National ScholarshipGraduate
Special-Class Scholarship, Xi'an Jiaotong University
Outstanding Undergraduate2024
National Postgraduate Mathematical Modeling Competition — Second Prize
National Undergraduate Mathematics Competition — Second Prize
Mathematical Contest in Modeling (MCM) — Meritorious Winner
National High School Mathematics League — Second Prize

Academic Service

Conference ReviewerNeurIPS
Journal ReviewerIEEE Transactions on Multimedia (TMM)