Jilan Xu

Jilan Xu
jilanxu18 at fudan dot edu dot cn

I am currently a postdoctoral researcher at the Visual Geometry Group (VGG), University of Oxford, advised by Professor Andrew Zisserman. Before that, I received my Ph.D. degree from Fudan University, where I was advised by Professor Yuejie Zhang. I also collaborate closely with Professor Weidi Xie at Shanghai Jiao Tong University.

My research interests include multimodal video understanding, visual representation learning, and medical image analysis. I aspire to contribute to building medical AI agents that can heal the world and make it a better place for all humankind.

Google Scholar / Twitter / GitHub / Zhihu

📢 News

[01/2026] Two papers Ego-Exo Survey and Video Mamba Suite are accepted to IJCV !
[09/2025] Three papers EgoThinker, AOR, EgoExoBench are accepted to NeurIPS 2025 !
[09/2025] One paper VinCi is accepted to IMWUT 2025 !
[09/2025] We rank 1st at (1) Fair Disease Challenge and (2) Multi-Source COVID19 Detection Challenge @ ICCV 2025
[07/2025] One paper Streamformer is accepted to ICCV 2025 as an Oral paper !!!
[03/2025] EgoExoLearn won EgoVis 2023/2024 Distinguished Paper Award !!!
[01/2025] Three Papers EgoExo-Gen, EgoVideo, CGBench accepted to ICLR 2025 !!!
[01/2025] Honored to be invited to give a talk on joint egocentric-exocentric video understanding at TechBeat
[05/2024] Our CVPR papers Egoinstructor and EgoExoLearn are accepted to 1st LPVL Workshop @ CVPR 2024

📑 Research

	OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie Technical Report arXiv project code OmniStream is a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs, supporting visual probing, streaming geometric reconstruction, VLM and VLA.
	Scaling Audio-Text Retrieval with Multimodal Large Language Models Jilan Xu, Carl Thome, Danijela Horak, Weidi Xie, Andrew Zisserman Technical Report arXiv code AuroLA is a contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models as a unified backbone for retrieval.
	Vinci: A Real-time Smart Assistant Based on Egocentric Vision-language Model for Portable Devices Yifei Huang, Jilan Xu, Baoqi Pei, Lijin Yang, MingFang Zhang, Yuping He, Guo Chen, et al. IMWUT 2025 arXiv code A real-time vision-language system that can assist users with daily tasks, including scene understanding, grounding, summarization, and future planning.
	Learning Streaming Video Representation via Multitask Training Yibin Yan, Jilan Xu, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie ICCV 2025, Oral arXiv project code A streaming video backbone that learns global, temporal, and spatial video features in a unified visual-textual alignment framework.
	EgoExo-Gen: Egocentric Video Prediction by Watching Exocentric Videos Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie ICLR 2025 arXiv A cross-view video prediction model that predicts future egocentric video frames by leveraging paired exocentric video and text instructions.
	EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, Yu Qiao CVPR 2024 arXiv project code A cross-view benchmark dataset that emulates the human demonstration following process, containing recorded egocentric videos guided by exocentric-view demonstration videos.
	Retrieval-Augmented Egocentric Video Captioning Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie CVPR 2024 arXiv project code Given an egocentric video, Egoinstructor automatically retrieves relevant exocentric instructional videos for assisting egocentric video captioning.
	Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi Xie CVPR 2023 arXiv project code Training open-vocabulary semantic segmentation models with image-text pairs only, which enables zero-transfer to various segmentation datasets.
	CREAM: Weakly supervised object localization via class re-activation mapping Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Rui-Wei Zhao, Tao Zhang, Xuequan Lu, Shang Gao CVPR 2022 arXiv A weakly-supervised object localization model that generates better CAMs via soft-clustering algorithms.
	Does video-text pretraining help open vocabulary online action detection Qingsong Zhao, Yi Wang, Jilan Xu, Yinan He, Zifan Song, Limin Wang, Yu Qiao, Cairong Zhao NeurIPS 2024 arXiv A zero-shot online action detector that leverages vision-language models and enables open-world temporal understanding.
	InternVideo: General Video Foundation Models via Generative and Discriminative Learning Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao Tech report 2022 arXiv code A foundation model for video / video-text understanding, achieving SOTA over 30 benchmark datasets.

	AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation Qingqiu Li, Zihang Cui, Seongsu Bae, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Junjun He, Shujun Wang NeurIPS 2025 arXiv project code An anatomical ontology-guided reasoning model and a large instruction dataset for chest-x-ray image understanding.
	QMix: Quality-aware Learning with Mixed Noise for Robust Retinal Disease Diagnosis Junlin Hou, Jilan Xu, Rui Feng, Hao Chen IEEE Transactions on Medical Imaging 2025 arXiv A noise learning framework that learns a robust disease diagnosis model under mixed noise scenarios.
	Concept-Attention Whitening for Interpretable Skin Lesion Diagnosis Junlin Hou, Jilan Xu, Hao Chen MICCAI 2024 arXiv An XAI framework that aligns the axes of the latent space with concepts of interest for interpretable skin lesion diagnosis.
	Anatomical structure-guided medical vision-language pre-training Qingqiu Li, Xiaohan Yan, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Shujun Wang MICCAI 2024 arXiv An Anatomical Structure-Guided visual-text pre-training framework that leverages the anatomical knowledge.
	CMC_v2: Towards More Accurate COVID-19 Detection with Discriminative Video Priors Junlin Hou, Jilan Xu, Nan Zhang, Yi Wang, Yuejie Zhang, Xiaobo Zhang, Rui Feng ECCV 2022 AIMIA Workshop arXiv code A Transformer-based model with contrastive representation enhancement. Winner of the 2nd COVID-19 Detection in ECCV 2022.
	TCCNet: Temporally Consistent Context-Free Network for Semi-supervised Video Polyp Segmentation Xiaotong Li, Jilan Xu, Yuejie Zhang, Rui Feng, Rui-Wei Zhao, Tao Zhang, Xuequan Lu, Shang Gao IJCAI 2022, Oral paper Co-training a model for semi-supervised video polyp segmentation, achieving comparable results using only 15% labeled data.
	CMC-COV19D: Contrastive Mixup Classification for COVID-19 Diagnosis Junlin Hou, Jilan Xu, Rui Feng, Yuejie Zhang, Fei Shan, Weiya Shi ICCV 2021, AIMIA Workshop paper code A ResNest-50 model combined with contrastive mixup technique for 3D COVID-19 CT image classification. Winner of the 1st COVID-19 detection challenge.
	Data-Efficient Histopathology Image Analysis with Deformation Representation Learning Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Chunyang Ruan, Tao Zhang, Weiguo Fan BIBM 2020, Oral paper Introducing a self-supervised deformation representation learning technique for histopathology image analysis.

🏆 Awards & Honors

Winner of the 6th-COV19D Competition Track 1 (Fair Disease Diagnosis Challenge) and Track 2 (Multi-Source COVID-19 Detection Challenge) @ ICCV 2025
Winner of the 7 tracks (Natural Language Queries, Visual Queries 2D, Short-term Object Interaction Anticipation, Long-term Action Anticipation, Body Pose, Domain Adaptation for Action Recognition) in the 1st EgoVis Workshop @ CVPR 2024
Winner of the 4th-COV19D Competition Track 2 (COVID19 Domain Adaptation Challenge) and 4th place at Track 1 (COVID-19 Detection Challenge) @ CVPR 2024
Winner of the MMAC Challenge Track 1 (Classification of Myopic Maculopathy) and Track 2 (Segmentation of Myopic Maculopathy Plus Lesions) @ MICCAI 2023
Winner of the 1st & 2nd COVID-19 Detection Challenge @ ICCV 2021 & ECCV 2022
Winner of the 1st COVID-19 Severity Detection Challenge @ ECCV 2022
VenusTech Enterprise Scholarship

💼 Working Experience

Shanghai AI Laboratory

Research Intern

Supervised by Dr. Yifei Huang, Yi Wang and Prof. Yu Qiao

Bell AI Lab, Shanghai

Research Intern

Supervised by Dr. Chenhui Ye

Google Winter AI Camp

🏆 Best Presentation Award

Morgan Stanley Technology

Software Engineering Intern

Supervised by Ray Zhou

🎓 Academic Services

Guest Editor

Journal of Imaging (IF=3.3)

Conference Reviewer

ICLR26/25, NeurIPS25/24/22, ECCV26/24, MICCAI25/24, CVPR25/24/23, ICCV25/23, ACMMM25, ICML26/25, SIGIR26

Journal Reviewer

Nature Communications, TPAMI, IJCV, TMM, NeuroComputing

Teaching Assistant (TA)

Data Structure, The Theory of Computation

This guy is good at website design.