VLG | Computer Vision and Learning Group

Basic Information

I am a PhD student at Computer Vision and Learning Group (VLG), ETH Zürich, supervised by Professor Siyu Tang. Prior to this, I obtained my Master degree (2020) in Electrical Engineering and Information Technology, ETH Zürich, and Bachelor degree in Automation, Tsinghua University (2017).

My research focuses on human-scene interaction learning, human motion modelling and egocentric human understanding, particularly with the 3D scenes.

Talks

2nd International Ego4D Workshop @ ECCV 2022
EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices.

Microsoft Swiss Joint Research Center 2022
Egocentric Interaction Capture for Mixed Reality.

Microsoft Swiss Joint Research Center 2021
Learning Motion Priors for 4D Human Body Capture in 3D Scenes.

Awards

Qualcomm Innovative Fellowship Europe 2023

Social

Publications

RoHM: Robust Human Motion Reconstruction via Diffusion

Conference: Conference on Computer Vision and Pattern Recognition (CVPR 2024) oral presentation

Authors:Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, Federica Bogo
Conditioned on noisy and occluded input data, RoHM reconstructs complete, plausible motions in consistent global coordinates.

Project PDF Code BibTeX

EgoGen: An Egocentric Synthetic Data Generator

Conference: Conference on Computer Vision and Pattern Recognition (CVPR 2024) oral presentation

Authors:Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys, Siyu Tang
EgoGen is new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks.

Project PDF Code BibTeX

EgoHMR: Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

Conference: International Conference on Computer Vision (ICCV 2023) oral presentation

Authors:Siwei Zhang, Qianli Ma, Yan Zhang, Sadegh Aliakbarian, Darren Cosker, Siyu Tang
We propose a novel scene-conditioned probabilistic method to recover the human mesh from an egocentric view image (typically with the body truncated) in the 3D environment.

Project PDF BibTeX

POV-Surgery: A Dataset for Egocentric Hand and Tool Pose Estimation During Surgical Activities

Conference: 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023) oral presentation

Authors:Rui Wang, Sophokles Ktistakis, Siwei Zhang, Mirko Meboldt, Quentin Lohmeyer
We propose POV-Surgery, a large-scale, synthetic, egocentric dataset focusing on pose estimation for hands with different surgical gloves and three orthopedic surgical instruments, namely scalpel, friem, and diskplacer.

Project PDF BibTeX

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

Conference: European Conference on Computer Vision (ECCV 2022)

Authors:Siwei Zhang, Qianli Ma, Yan Zhang, Zhiyin Qian, Taein Kwon, Marc Pollefeys, Federica Bogo and Siyu Tang
A large-scale dataset of accurate 3D human body shape, pose and motion of humans interacting in 3D scenes, with multi-modal streams from third-person and egocentric views, captured by Azure Kinects and a HoloLens2.

Project Code Dataset Challenge BibTeX

SAGA: Stochastic Whole-Body Grasping with Contact

Conference: European Conference on Computer Vision (ECCV 2022)

Authors:Yan Wu, Jiahao Wang, Yan Zhang, Siwei Zhang, Otmar Hilliges, Fisher Yu and Siyu Tang
(* denotes equal contribution)
Our goal is to synthesize whole-body grasping motion. Given a 3D object, we aim to generate diverse and natural whole-body human motions that approach and grasp the object.

Project PDF Code BibTeX

Learning Motion Priors for 4D Human Body Capture in 3D Scenes

Conference: International Conference on Computer Vision (ICCV 2021) oral presentation

Authors:Siwei Zhang, Yan Zhang, Federica Bogo, Marc Pollefeys and Siyu Tang
LEMO learns motion priors from a larger scale mocap dataset and proposes a multi-stage optimization pipeline to enable 3D motion reconstruction in complex 3D scenes.

Project PDF Code BibTeX

Facial Emotion Recognition with Noisy Multi-task Annotations

Conference: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2021

Authors:Siwei Zhang, Zhiwu Huang, Danda Pani Paudel, and Luc Van Gool
To reduce human labelling effort on multi-task labels, we introduce a new problem of facial emotion recognition with noisy multi-task annotations.

PDF Code BibTeX

Neural architecture search as sparse supernet

Conference: The 35th AAAI Conference on Artificial Intelligence (AAAI) 2021

Authors:Yan Wu; Aoming Liu, Zhiwu Huang, Siwei Zhang, and Luc Van Gool
We model the NAS problem as a sparse supernet using a new continuous architecture representation with a mixture of sparsity constraints.

PDF BibTeX

PLACE: Proximity Learning of Articulation and Contact in 3D Environments

Conference: International Virtual Conference on 3D Vision (3DV) 2020

Authors:Siwei Zhang, Yan Zhang, Qianli Ma, Michael J. Black, Siyu Tang
Automated synthesis of realistic humans posed naturally in a 3D scene is essential for many applications. In this paper we propose explicit representations for the 3D scene and the person-scene contact relation in a coherent manner.

Project PDF Code BibTeX

One-shot face reenactment

Conference: The British Machine Vision Conference (BMVC) 2019 spotlight presentation

Authors:Yunxuan Zhang, Siwei Zhang, Yue He, Cheng Li, Chen Change Loy, and Ziwei Liu
We propose a novel one-shot face reenactment learning system, that is able to disentangle and compose appearance and shape information for effective modeling.

PDF Code BibTeX

Basic Information

Talks

Awards

Social

Publications

RoHM: Robust Human Motion Reconstruction via Diffusion

Authors:Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, Federica Bogo Conditioned on noisy and occluded input data, RoHM reconstructs complete, plausible motions in consistent global coordinates.

EgoGen: An Egocentric Synthetic Data Generator

Authors:Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys, Siyu Tang EgoGen is new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks.

EgoHMR: Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

Authors:Siwei Zhang, Qianli Ma, Yan Zhang, Sadegh Aliakbarian, Darren Cosker, Siyu Tang We propose a novel scene-conditioned probabilistic method to recover the human mesh from an egocentric view image (typically with the body truncated) in the 3D environment.

POV-Surgery: A Dataset for Egocentric Hand and Tool Pose Estimation During Surgical Activities

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

SAGA: Stochastic Whole-Body Grasping with Contact

Authors:Yan Wu*, Jiahao Wang*, Yan Zhang, Siwei Zhang, Otmar Hilliges, Fisher Yu and Siyu Tang (* denotes equal contribution) Our goal is to synthesize whole-body grasping motion. Given a 3D object, we aim to generate diverse and natural whole-body human motions that approach and grasp the object.

Learning Motion Priors for 4D Human Body Capture in 3D Scenes

Authors:Siwei Zhang, Yan Zhang, Federica Bogo, Marc Pollefeys and Siyu Tang LEMO learns motion priors from a larger scale mocap dataset and proposes a multi-​stage optimization pipeline to enable 3D motion reconstruction in complex 3D scenes.

Facial Emotion Recognition with Noisy Multi-task Annotations

Authors:Siwei Zhang, Zhiwu Huang, Danda Pani Paudel, and Luc Van Gool To reduce human labelling effort on multi-task labels, we introduce a new problem of facial emotion recognition with noisy multi-task annotations.

Neural architecture search as sparse supernet

Authors:Yan Wu*; Aoming Liu*, Zhiwu Huang, Siwei Zhang, and Luc Van Gool We model the NAS problem as a sparse supernet using a new continuous architecture representation with a mixture of sparsity constraints.

PLACE: Proximity Learning of Articulation and Contact in 3D Environments

One-shot face reenactment

Authors:Yunxuan Zhang, Siwei Zhang, Yue He, Cheng Li, Chen Change Loy, and Ziwei Liu We propose a novel one-shot face reenactment learning system, that is able to disentangle and compose appearance and shape information for effective modeling.

Authors:Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, Federica Bogo
Conditioned on noisy and occluded input data, RoHM reconstructs complete, plausible motions in consistent global coordinates.

Authors:Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys, Siyu Tang
EgoGen is new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks.

Authors:Siwei Zhang, Qianli Ma, Yan Zhang, Sadegh Aliakbarian, Darren Cosker, Siyu Tang
We propose a novel scene-conditioned probabilistic method to recover the human mesh from an egocentric view image (typically with the body truncated) in the 3D environment.

Authors:Yan Wu, Jiahao Wang, Yan Zhang, Siwei Zhang, Otmar Hilliges, Fisher Yu and Siyu Tang
(* denotes equal contribution)
Our goal is to synthesize whole-body grasping motion. Given a 3D object, we aim to generate diverse and natural whole-body human motions that approach and grasp the object.

Authors:Siwei Zhang, Yan Zhang, Federica Bogo, Marc Pollefeys and Siyu Tang
LEMO learns motion priors from a larger scale mocap dataset and proposes a multi-stage optimization pipeline to enable 3D motion reconstruction in complex 3D scenes.

Authors:Siwei Zhang, Zhiwu Huang, Danda Pani Paudel, and Luc Van Gool
To reduce human labelling effort on multi-task labels, we introduce a new problem of facial emotion recognition with noisy multi-task annotations.

Authors:Yan Wu; Aoming Liu, Zhiwu Huang, Siwei Zhang, and Luc Van Gool
We model the NAS problem as a sparse supernet using a new continuous architecture representation with a mixture of sparsity constraints.

Authors:Yunxuan Zhang, Siwei Zhang, Yue He, Cheng Li, Chen Change Loy, and Ziwei Liu
We propose a novel one-shot face reenactment learning system, that is able to disentangle and compose appearance and shape information for effective modeling.