Welcome to
Computer Vision and Learning Group.


Our group conducts research in Computer Vision, focusing on perceiving and modeling humans.

We study computational models that enable machines to perceive and analyze human activities from visual input. We leverage machine learning and optimization techniques to build statistical models of humans and their behaviors. Our goal is to advance algorithmic foundations of scalable and reliable human digitalization, enabling a broad class of real-world applications. Our group is part of the Institute for Visual Computing (IVC) at the Department of Computer Science of ETH Zurich.

Featured Projects

In-depth look at our work.

ARAH: Animatable Volume Rendering of Articulated Human SDFs

Conference: European Conference on Computer Vision (ECCV 2022)

Authors: Shaofei Wang, Katja Schwarz, Andreas Geiger, Siyu Tang

Given sparse multi-view videos, ARAH learns animatable clothed human avatars that have detailed pose-dependent geometry/appearance and generalize to out-of-distribution poses.

COINS: Compositional Human-Scene Interaction Synthesis with Semantic Control

Conference: European Conference on Computer Vision (ECCV 2022)

Authors: Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, Siyu Tang

Synthesizing natural interactions between virtual humans and their 3D environments is critical for numerous applications, such as computer games and AR/VR experiences. We propose COINS, for COmpositional INteraction Synthesis with Semantic Control.

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

Conference: European Conference on Computer Vision (ECCV 2022)

Authors: Siwei Zhang, Qianli Ma, Yan Zhang, Zhiyin Qian, Taein Kwon, Marc Pollefeys, Federica Bogo and Siyu Tang

EgoBody is a large-scale egocentric dataset for human 3D motion and social interactions in 3D scenes. We employ Microsoft HoloLens2 headsets to record rich egocentric data streams (including RGB, depth, eye gaze, head and hand tracking). To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames, reconstructing 3D human poses and shapes relative to the scene.

SAGA: Stochastic Whole-Body Grasping with Contact

Conference: European Conference on Computer Vision (ECCV 2022)

Authors: Yan Wu, Jiahao Wang, Yan Zhang, Siwei Zhang, Otmar Hilliges, Fisher Yu and Siyu Tang

Our goal is to synthesize whole-body grasping motion. Given a 3D object, we aim to generate diverse and natural whole-body human motions that approach and grasp the object.

KeypointNeRF: Generalizing Image-based Volumetric Avatars using Relative Spatial Encoding of Keypoints

Conference: European Conference on Computer Vision (ECCV 2022)

Authors:  Marko Mihajlovic, Aayush Bansal, Michael Zollhoefer, Siyu Tang, Shunsuke Saito

KeypointNeRF is a generalizable neural radiance field for virtual avatars.

COAP: Compositional Articulated Occupancy of People

Conference: Conference on Computer Vision and Pattern Recognition (CVPR 2022)

Authors: Marko Mihajlovic, Shunsuke Saito, Aayush Bansal, Michael Zollhoefer, Siyu Tang

COAP is a novel neural implicit representation for articulated human bodies that provides an efficient mechanism for modeling self-contacts and interactions with 3D environments.

MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images

Conference: Thirty-​fifth Conference on Neural Information Processing Systems (NeurIPS 2021)

Authors: Shaofei Wang, Marko Mihajlovic, Qianli Ma, Andreas Geiger, Siyu Tang

MetaAvatar is meta-learned model that represents generalizable and controllable neural signed distance fields (SDFs) for clothed humans. It can be fast fine-tuned to represent unseen subjects given as few as 8 monocular depth images.


HALO: A Skeleton-Driven Neural Occupancy Representation for Articulated Hands

Conference: International Virtual Conference on 3D Vision (3DV 2021)

Authors: Korrawe Karunratanakul, Adrian Spurr, Zicong Fan, Otmar Hilliges, Siyu Tang

We present HALO, a neural occupancy representation for articulated hands that produce implicit hand surfaces from input skeletons in a differentiable manner.

The Power of Points for Modeling Humans in Clothing

Conference:  International Conference on Computer Vision (ICCV 2021)

Authors: Qianli Ma, Jinlong Yang, Siyu Tang and Michael J. Black

We introduce POP — a point-based, unified model for multiple subjects and outfits that can turn a single, static 3D scan into an animatable avatar with natural pose-dependent clothing deformations.

Learning Motion Priors for 4D Human Body Capture in 3D Scenes

ConferenceInternational Conference on Computer Vision (ICCV 2021) oral presentation


Authors: Siwei Zhang, Yan Zhang, Federica Bogo, Marc Pollefeys and Siyu Tang

LEMO learns motion priors from a larger scale mocap dataset and proposes a multi-​stage optimization pipeline to enable 3D motion reconstruction in complex 3D scenes.


We are More than Our Joints: Predicting how 3D Bodies Move

ConferenceConference on Computer Vision and Pattern Recognition (CVPR 2021)

Authors: Yan Zhang, Michael J. Black and Siyu Tang

"We are more than our joints", or MOJO for short, is a solution to stochastic motion prediction of expressive 3D bodies. Given a short motion from the past, MOJO generates diverse plausible motions in the near future.

Latest News

Here’s what we've been up to recently.