Wenbo Ji 嵇jī / yí文博

I am an M.Sc. student at the Technical University of Munich and a research intern at Agile Robots SE. I build generative and geometric models for interactive humans, dynamic 3D scenes, and robot learning.

My current work studies camera-controlled human motion video diffusion with Prof. Matthias Nießner and cross-embodiment video generation for dexterous manipulation.

I am seeking Ph.D. opportunities for Fall 2027 in 3D/4D scene representation, video generation, and robot world models.

Previously, I collaborated with Prof. Daniel Cremers, Prof. Nassir Navab, and Prof. Benjamin Busam on 3D reconstruction and tracking. I earned an M.Sc. in Computer Science from Tongji University and a B.Sc. in Information and Computing Science from Nanjing Tech University.

News

04/2026

I joined Agile Robots SE as a research intern working on world models for robot dexterous manipulation.

10/2025

Our paper CSG-Fusion received the Best Paper Award at the ICCV 2025 Workshop E2E3D (workshop track).

08/2025

I graduated from Tongji University with a Master's degree in Computer Science.

More news

06/2025

Our paper LiteTracker is accepted by MICCAI 2025.

01/2025

Our paper RE0 is accepted by ICRA 2025.

08/2024

I joined ImFusion as a research intern on dense video point tracking.

01/2024

I joined TUM-CVG and TUM-DI-LAB under the co-supervision of Prof. Yan Xia and Prof. Chuanxia Zheng.

05/2023

I joined the Technical University of Munich as a Master's student in Electrical Engineering and Information Technology.

Research Interests

Past
Built foundations in long-term tracking, 3D/4D reconstruction, and scene decomposition.
Now
Working on camera-controlled human motion video generation and video world models for robot dexterous manipulation.
Next
Unifying these threads into perception-action models of interactive humans and dynamic scenes.

3D/4D Scene Representation Human-Centric Video Generation Dynamic Visual Perception Embodied World Models

Selected Publications

* equal contribution, † corresponding author

ViDS teaser showing identity-preserving portrait animation driven by 3D face normal maps

Human-Centric Video Generation

ViDS: Video Diffusion Shader using 3D Face Tracking

Preprint, 2026

Authors

Wenbo Ji, Davide Davoli, Zhe Chen, Liam Schoneveld, Matthias Nießner, Jiapeng Tang†

Overview

3D face tracking-conditioned video diffusion for expressive, identity-preserving portrait animation from a single image, with autoregressive sampling for longer videos. On VFHQ, ViDS ranked first on 8 of 13 reported metrics.

3D/4D Scene Representation

CSG-Fusion: Consistent Sparse-View Gaussian Splatting via Matching-based Fusion

ICCV Workshop E2E3D, 2025 Best Paper Award

Authors

Yan Xia*†, Wenbo Ji*, Weirong Chen, Daniel Cremers

Overview

Matching-based fusion of sparse-view pointmaps into compact, cross-view-consistent 3D Gaussians. At 90% ScanNet++ overlap, it improved PSNR by 2.8 dB over Splatt3R while using approximately 124K fewer Gaussians.

Dynamic Visual Perception

LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking

MICCAI, 2025

Authors

Mert Asim Karaoglu, Wenbo Ji, Ahmed Abbas, Nassir Navab, Benjamin Busam, Alexander Ladikos†

Overview

Causal temporal feature reuse with prior-motion initialization for accurate, low-latency online tissue tracking. It ran approximately 7× faster than its predecessor and 2× faster than prior state of the art, reaching 29.67 ms P95 for 1,024 points.

Dynamic Visual Perception

RE0: Recognize Everything with 3D Zero-shot Instance Segmentation

ICRA, 2025

Authors

Xiaohan Yan*, Zijian Jiang*, Yinghao Shuai*, Nan Wang, Xiaowei Song,
Wenbo Ji, Ge Wu, Jinyu He, Gang Wei, Zhicheng Wang†

Overview

Training-free 3D zero-shot instance segmentation from multi-view masks and CLIP semantics.

Experiences

Embodied World Models

Video World Model for Robot Dexterous Manipulation

Apr 2026 - Now

Research Internship

Contributions

Developing a cross-embodiment video generation method that translates egocentric human demonstrations into robot-domain videos for downstream policy learning.

Mentors

Mahdi Mustapha Hamad Agile Robots SE / WRD Group

Human-Centric Video Generation

Human Motion Video Diffusion

March 2026 - Now

Master's Thesis

Contributions

Developing a camera-controlled video diffusion model for controllable synthesis of human motion and scene interactions across changing viewpoints.

Mentors

Yu Chi, Jiapeng Tang, Prof. Matthias Nießner TUM Visual Computing

Human-Centric Video Generation

Human Head Avatar Animation

April 2025 - March 2026

Research Internship

Contributions

Led the development of ViDS, an identity-preserving video diffusion method for long-form portrait animation.
ViDS ranked first on 8 of 13 VFHQ metrics, improving reenactment quality and identity preservation.

Mentors

Jiapeng Tang, Prof. Matthias Nießner TUM Visual Computing

Dynamic Visual Perception

Dense Point Tracking

Aug 2024 - April 2025

Research Internship

Contributions

Implemented LiteTracker’s online inference and EMA-flow initialization, and conducted low-latency tracking experiments.
LiteTracker remained competitive on STIR and SuPer while running ~7× faster than its predecessor and 2× faster than prior state of the art.

Mentors

Mert Asim Karaoglu Imfusion & TUM CAMP

Prof. Benjamin Busam, Prof. Nassir Navab TUM CAMP

Dr. Alexander Ladikos Imfusion

3D/4D Scene Representation

3D Scene Decomposition

Feb 2024 - Aug 2025

Guided Research

Contributions

Designed and evaluated CSG-Fusion; received Best Paper at ICCV Workshop E2E3D.
Improved ScanNet++ PSNR by 2.8 dB over Splatt3R at 90% overlap with ~124K fewer Gaussians, and demonstrated zero-shot generalization on DTU.

Mentors

Prof. Daniel Cremers, Prof. Yan Xia, Weirong Chen TUM CVG

Prof. Chuanxia Zheng Oxford VGG

3D/4D Scene Representation

Large Scale 3D Scene Reconstruction

July 2023 - Sep 2023

Research Assistant

Contributions

Contributed to the development and experimental evaluation of a large-scale 3D scene reconstruction pipeline.

Mentors

Prof. Yiyi Liao Zhejiang University

Thesis

3D/4D Scene Representation

Endoscopic Scene Reconstruction with 4D Half Gaussian Splatting

2025

Master's Thesis

Overview

Developed a 4D Half-Gaussian splatting pipeline for deformable stereo endoscopic reconstruction with depth-prior initialization, HexPlane spatiotemporal deformation, and edge-aware depth regularization. Achieved 38.1 PSNR on EndoNeRF versus prior endoscopic GS/NeRF baselines; also evaluated on SCARED.

Technical Report

3D/4D Scene Representation

Object-Centric 3D Reconstruction and Decomposition

2025

TUM DI Lab Report

Authors

Wenbo Ji, Michael Neumayr, Nina Kirakosyan, Filip Skubacz

Overview

A TUM DI Lab report on object-centric 3D reconstruction and decomposition with 3D Gaussian Splatting.

Education

M.Sc. Electrical Engineering and Information Technology

Technical University of Munich

2023 – Now

M.Sc. Electrical Engineering and Information Technology

Double-degree program with Tongji University (Tongji M.Sc. awarded 2025).
Thesis: camera-controlled human motion video diffusion at the Visual Computing Group.

M.Sc. Computer Science

Tongji University

2021 – 2025

M.Sc. Computer Science

Thesis: endoscopic scene reconstruction with 4D half-Gaussian splatting.

B.Sc. Information and Computing Science (Embedded Software)

Nanjing Tech University

2017 – 2021

B.Sc. Information and Computing Science (Embedded Software)

A computing major within the Department of Mathematics.

Selected Awards

2023 - 2024Munich, Germany

Deutscher Akademischer Austauschdienst (DAAD) Scholarship

Recognition

2019Nanjing, China

National Encouragement Scholarship

Recognition

Projects

InfraLens

May 2026

Focus

AI Infrastructure Handbook

Overview

A static handbook for understanding how modern AI systems train, serve, generate, route, compress, and fail.

OpenUserStudyKit

March 2026

Focus

Reusable User Study Infrastructure

Overview

An open-source toolkit for building reusable user study questionnaires and experiment workflows.

LiteAvatar - WASM Version

Jan 2026

Focus

2D Audio-driven Human Avatar Animation

Overview

A lightweight audio-driven 2D avatar solution that runs entirely in the browser using WASM based on Lite-avatar. No backend server required.

Blog

Thoughts on research, 3D, video generation, and the occasional in-between.

View all posts

Latest posts are available on the blog page.

News

Research Interests

ViDS: Video Diffusion Shader using 3D Face Tracking

CSG-Fusion: Consistent Sparse-View Gaussian Splatting via Matching-based Fusion

LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking

RE0: Recognize Everything with 3D Zero-shot Instance Segmentation

Video World Model for Robot Dexterous Manipulation

Human Motion Video Diffusion

Human Head Avatar Animation

Dense Point Tracking

3D Scene Decomposition

Large Scale 3D Scene Reconstruction

Endoscopic Scene Reconstruction with 4D Half Gaussian Splatting

Object-Centric 3D Reconstruction and Decomposition

Technical University of Munich

Tongji University

Nanjing Tech University

Deutscher Akademischer Austauschdienst (DAAD) Scholarship

National Encouragement Scholarship

InfraLens

OpenUserStudyKit

LiteAvatar - WASM Version

Blog

Gallery