Basic Information
I’m a tenure-track assistant professor in the Department of Computer Science at ETH Zürich. I lead the Computer Vision and Learning Group (VLG) at the Institute of Visual Computing. Before joining ETH, I received an early career research grant to start my research group at the Max Planck Institute for Intelligent Systems in November 2017. I was a postdoctoral researcher in the same institute, advised by Dr. Michael Black. I obtained my PhD at the Max Planck Institute for Informatics in 2017, under the supervision of Professor Bernt Schiele. Before that, I received my Master’s degree in Media Informatics at RWTH Aachen University and my Bachelor’s degree in Computer Science at Zhejiang University, China.
My research focuses on computer vision and machine learning, specializing in perceiving and modeling humans. In my group, we study computational models that enable machines to perceive and analyze human pose, motion, and activities from visual input. We leverage machine learning and optimization techniques to build statistical models of humans and their behaviors. Our goal is to advance algorithmic foundations of scalable and reliable human digitalization, enabling a broad class of real-world applications.
Professional Activities
-
Technical Papers Committee, SIGGRAPH 2022
-
Area Chairs: CVPR 2020, 2021, 2022. ECCV 2022. ICCV 2021
-
Tutorial Chairs: CVPR 2023. ACCV 2020
-
Workshop Organization: ECCV 2022 EgoBody Benchmark
Honors and Awards
-
Best Paper Award Finalist, CVPR 2022
-
Best Paper Award Finalist, CVPR 2021
-
Best Paper Award, 3DV 2020
-
ELLIS Ph.D. Award 2019
-
DAGM MVTec Dissertation Award 2018
-
Best Paper Award, BMVC 2012
Funding
-
Innosuisse Flagship Project.
The Data-driven Transformation of Surgical Training for Proficiency-based Performance
-
SNSF Grant.
Learning to Create Realistic Human Avatars
-
Microsoft Research Grant.
First-Person-View Social Interaction Capture for Mixed Reality
-
ETH Post-doctoral Fellowship (Sergey Prokudin)
Robust and Controllable Neural Avatars
-
FIFA Sponsored Research Agreement
-
Facebook Research Gift
Exhibition
-
Installation “Flight Assembled Architecture Revisited-Inhabiting the Virtual" at the Guggenheim Bilbao Museum.
In collaboration with Gramazio Kohler Research, ETH Zurich. 08.04.2022 - 18.09.2022.
Keynote Talks
-
CVMP 2021
Learning to capture and synthesize 3D humans in 3D scenes.
-
Amazon’s Computer Vision Conference 2021
Animatable Neural Bodies and Hands
Invited Talks
-
CVPR 2022 workshop: Computer Vision in the Built Environment
Inhabiting a Virtual City -
19th Conference on Robots and Vision 2022
Inhabiting a Virtual City -
ETH Robotics Innovation Day 2022
Human Motion Capture and Synthesis -
ETH AI+Art Conversation 2022
Human Motion Capture and Synthesis -
Baidu 2022
Inhabiting the Virtual -
AI4AEC Colloquium 2021
Learning to Capture and Synthesize 3D Humans in 3D Scenes -
SoMoF ICCV Workshop 2021
Animatable Neural Bodies and Hands -
Digital Festival Zurich 2021
-
Google 2021
Capture and Synthesis of 3D Humans in 3D Scenes -
ETH Zurich Design++ Opening Event 2021
Capture and Synthesis of 3D Humans in 3D Scenes -
Disney Research, Zurich 2021
Animatable Neural Bodies and Hands -
Facebook, Zurich 2021
Animatable Neural Bodies and Hands -
ELLIS Computer Vision and Pattern Recognition Workshop 2021
-
3DGV 2021
Generating People Interacting with 3D Scenes and Objects
-
ETH AI+X Series Future of Retail 2021
-
ETH Zurich Industrial Day 2020
Learning to See and Generate People -
GCPR 2020
Generating People Interacting with 3D Scenes and Objects
Social
Publications
Authors: Shaofei Wang, Katja Schwarz, Andreas Geiger, Siyu Tang
Given sparse multi-view videos, ARAH learns animatable clothed human avatars that have detailed pose-dependent geometry/appearance and generalize to out-of-distribution poses.
Authors: Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, Siyu Tang
Synthesizing natural interactions between virtual humans and their 3D environments is critical for numerous applications, such as computer games and AR/VR experiences. We propose COINS, for COmpositional INteraction Synthesis with Semantic Control.
Authors: Siwei Zhang, Qianli Ma, Yan Zhang, Zhiyin Qian, Taein Kwon, Marc Pollefeys, Federica Bogo and Siyu Tang
EgoBody is a large-scale egocentric dataset for human 3D motion and social interactions in 3D scenes. We employ Microsoft HoloLens2 headsets to record rich egocentric data streams (including RGB, depth, eye gaze, head and hand tracking). To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames, reconstructing 3D human poses and shapes relative to the scene.
Authors: Yan Wu, Jiahao Wang, Yan Zhang, Siwei Zhang, Otmar Hilliges, Fisher Yu and Siyu Tang
Our goal is to synthesize whole-body grasping motion. Given a 3D object, we aim to generate diverse and natural whole-body human motions that approach and grasp the object.
Authors: Marko Mihajlovic,  Aayush Bansal,  Michael Zollhoefer,  Siyu Tang,  Shunsuke Saito
KeypointNeRF is a generalizable neural radiance field for virtual avatars.
Authors: Marko Mihajlovic,  Shunsuke Saito,  Aayush Bansal,  Michael Zollhoefer, and  Siyu Tang
COAP is a novel neural implicit representation for articulated human bodies that provides an efficient mechanism for modeling self-contacts and interactions with 3D environments.
Authors: Vassilis Choutas, Lea Müller, Chun-Hao Paul Huang, Siyu Tang, Dimitrios Tzionas, Michael Black
We exploit the anthropometric measurements and linguistic shape attributes in several novel ways to train a neural network, called SHAPY, that regresses 3D human pose and shape from an RGB image.
Authors: Taein Kwon, Bugra Tekin, Siyu Tang, Marc Pollefeys
Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality.
Authors: Hongwei Yi, Chun-Hao Paul Huang, Dimitrios Tzionas, Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus Thies, Michael Black
Humans are in constant contact with the world as they move through it and interact with it. This contact is a vital source of information for understanding 3D humans, 3D scenes, and the interactions between them.
Authors: Shaofei Wang, Marko Mihajlovic, Qianli Ma, Andreas Geiger, Siyu Tang
MetaAvatar is meta-learned model that represents generalizable and controllable neural signed distance fields (SDFs) for clothed humans. It can be fast fine-tuned to represent unseen subjects given as few as 8 monocular depth images.
Authors: Korrawe Karunratanakul, Adrian Spurr, Zicong Fan, Otmar Hilliges, Siyu Tang
We present HALO, a neural occupancy representation for articulated hands that produce implicit hand surfaces from input skeletons in a differentiable manner.
Authors: Zicong Fan, Adrian Spurr, Muhammed Kocabas, Siyu Tang, Michael J. Black and Otmar Hilliges
In this paper we demonstrate that self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands and their parts, is a major cause of the final 3D pose error. Motivated by this insight, we propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image.
Authors: Miao Liu, Dexin Yang, Yan Zhang, Zhaopeng Cui, James M. Rehg, Siyu Tang
We seek to reconstruct 4D second-person human body meshes that are grounded on the 3D scene captured in an egocentric view. Our method exploits 2D observations from the entire video sequence and the 3D scene context to optimize human body models over time, and thereby leads to more accurate human motion capture and more realistic human-scene interaction.
Authors: Qianli Ma, Jinlong Yang, Siyu Tang and Michael J. Black
We introduce POP — a point-based, unified model for multiple subjects and outfits that can turn a single, static 3D scan into an animatable avatar with natural pose-dependent clothing deformations.
Learning Motion Priors for 4D Human Body Capture in 3D Scenes
Conference: International Conference on Computer Vision (ICCV 2021) oral presentation
Authors: Qianli Ma, Jinlong Yang, Siyu Tang and Michael J. Black
We introduce POP — a point-based, unified model for multiple subjects and outfits that can turn a single, static 3D scan into an animatable avatar with natural pose-dependent clothing deformations.
Authors: Siwei Zhang, Yan Zhang, Federica Bogo, Marc Pollefeys and Siyu Tang
LEMO learns motion priors from a larger scale mocap dataset and proposes a multi-stage optimization pipeline to enable 3D motion reconstruction in complex 3D scenes.
On Self-Contact and Human Pose
Conference: Conference on Computer Vision and Pattern Recognition (CVPR 2021)oral presentationcandidate for Best Paper Award
Authors: Lea Müller, Ahmed A. A. Osman, Siyu Tang, Chun-Hao P. Huang and Michael J. Black
we develop new datasets and methods that significantly improve human pose estimation with self-contact.
Authors: Marko Mihajlovic, Yan Zhang, Michael J. Black and Siyu Tang
LEAP is a neural network architecture for representing volumetric animatable human bodies. It follows traditional human body modeling techniques and leverages a statistical human prior to generalize to unseen humans.
Authors: Shaofei Wang, Andreas Geiger and Siyu Tang
Registering point clouds of dressed humans to parametric human models is a challenging task in computer vision. We propose novel piecewise transformation fields (PTF), a set of functions that learn 3D translation vectors which facilitates occupancy learning, joint-rotation estimation and mesh registration.
Authors: Qianli Ma, Shunsuke Saito, Jinlong Yang, Siyu Tang and Michael J. Black
SCALE models 3D clothed humans with hundreds of articulated surface elements, resulting in avatars with realistic clothing that deforms naturally even in the presence of topological change.
Grasping Field: Learning Implicit Representations for Human Grasps
Conference: International Virtual Conference on 3D Vision (3DV) 2020 oral presentation & best paper
Authors: Korrawe Karunratanakul, Jinlong Yang, Yan Zhang, Michael Black, Krikamol Muandet, Siyu Tang
Capturing and synthesizing hand-object interaction is essential for understanding human behaviours, and is key to a number of applications including VR/AR, robotics and human-computer interaction.
Authors: Siwei Zhang, Yan Zhang, Qianli Ma, Michael J. Black, Siyu Tang
Automated synthesis of realistic humans posed naturally in a 3D scene is essential for many applications. In this paper we propose explicit representations for the 3D scene and the person-scene contact relation in a coherent manner.
Authors: Yan Zhang, Michael J. Black, Siyu Tang
In this work, our goal is to generate significantly longer, or “perpetual”, motion: given a short motion sequence or even a static body pose, the goal is to generate non-deterministic ever-changing human motions in the future.
Authors: Miao Liu, Siyu Tang, Yin Li, and James M. Rehg
We address the challenging task of anticipating human-object interaction in first person videos. We adopt intentional hand movement as a future representation and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action.
Authors: Xucong Zhang, Seonwook Park, Thabo Beeler, Derek Bradley, Siyu Tang , Otmar Hilliges
We propose the ETH-XGaze dataset: a large scale (over 1 million samples) gaze estimation dataset with high-resolution images under extreme head poses and gaze directions.
Generating 3D People in Scenes without People
Conference: Computer Vision and Pattern Recognition (CVPR) 2020 oral presentation
Authors: Yan Zhang, Mohamed Hassan, Heiko Neumann, Michael J. Black, Siyu Tang
We present a fully-automatic system that takes a 3D scene and generates plausible 3D human bodies that are posed naturally in that 3D scene.
Authors: Qianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades, Gerard Pons-Moll, Siyu Tang, and Michael J. Black
CAPE is a Graph-CNN based generative model for dressing 3D meshes of human body. It is compatible with the popular body model, SMPL, and can generalize to diverse body shapes and body poses. The CAPE Dataset provides SMPL mesh registration of 4D scans of people in clothing, along with registered scans of the ground truth body shapes under clothing.
Authors: Anurag Ranjan, David T. Hoffmann, Dimitrios Tzionas, Siyu Tang, Javier Romero, Michael J. Black
We created an extensive Human Optical Flow dataset containing images of realistic human shapes in motion together with ground truth optical flow. We then train two compact network architectures based on spatial pyramids, namely SpyNet and PWC-Net.
Authors: Anurag Ranjan, David T. Hoffmann, Dimitrios Tzionas, Siyu Tang, Javier Romero, Michael J. Black
We created an extensive Human Optical Flow dataset containing images of realistic human shapes in motion together with ground truth optical flow. We then train two compact network architectures based on spatial pyramids, namely SpyNet and PWC-Net.
Authors: Jie Song, Bjoern Andres, Michael J. Black, Otmar Hilliges, Siyu Tang
We propose an end-to-end trainable framework to learn feature representations globally in a graph decomposition problem.
Authors: Jie Song, Bjoern Andres, Michael J. Black, Otmar Hilliges, Siyu Tang
We propose an end-to-end trainable framework to learn feature representations globally in a graph decomposition problem.