Our Publications

Keep up to date with what we're working on!


AuthorsYan ZhangSergey ProkudinMarko MihajlovicQianli MaSiyu Tang

DOMA is an implicit motion field modeled by a spatiotemporal SIREN network. The learned motion field can predict how novel points move in the same field.

AuthorsKorrawe KarunratanakulKonpat PreechakulEmre AksanThabo BeelerSupasorn SuwajanakornSiyu Tang

Diffusion Noise Optimization (DNO) can leverage the existing human motion diffusion models as universal motion priors. We demonstrate its capability in the motion editing tasks where DNO can preserve the content of the original model and accommodates a diverse range of editing modes, including changing trajectory, pose, joint location, and avoiding newly added obstacles.

AuthorsSiwei ZhangBharat Lal BhatnagarYuanlu XuAlexander WinklerPetr KadlecekSiyu TangFederica Bogo

Conditioned on noisy and occluded input data, RoHM reconstructs complete, plausible motions in consistent global coordinates.

AuthorsGen LiKaifeng ZhaoSiwei ZhangXiaozhong LyuMihai DusmanuYan ZhangMarc PollefeysSiyu Tang

EgoGen is new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks.

AuthorsXiyi Chen Marko Mihajlovic Shaofei Wang Sergey Prokudin Siyu Tang

We introduce a morphable diffusion model to enable consistent controllable novel view synthesis of humans from a single image. Given a single input image and a morphable mesh with a desired facial expression, our method directly generates 3D consistent and photo-realistic images from novel viewpoints, which we could use to reconstruct a coarse 3D model using off-the-shelf neural surface reconstruction methods such as NeuS2.

AuthorsZhiyin QianShaofei WangMarko MihajlovicAndreas GeigerSiyu Tang

Given a monocular video, 3DGS-Avatar learns clothed human avatars that model pose-dependent appearance and generalize to out-of-distribution poses, with short training time and interactive rendering frame rate.

AuthorsMarko MihajlovicSergey ProkudinMarc PollefeysSiyu Tang

ResField layers incorporates time-dependent weights into MLPs to effectively represent complex temporal signals.


AuthorsAyça Takmaz*Jonas Schult*Irem KaftanMertcan AkçayBastian LeibeRobert SumnerFrancis EngelmannSiyu Tang

We propose the first multi-human body-part segmentation model, called Human3D 🧑‍🤝‍🧑, that directly operates on 3D scenes. In an extensive analysis, we validate the benefits of training on synthetic data on multiple baselines and tasks.

AuthorsKaifeng ZhaoYan Zhang,  Shaofei WangThabo BeelerSiyu Tang

Interaction with environments is one core ability of virtual humans and remains a challenging problem. We propose a method capable of generating a sequence of natural interaction events in real cluttered scenes.

AuthorsKorrawe KarunratanakulKonpat PreechakulSupasorn SuwajanakornSiyu Tang

Guided Motion Diffusion (GMD) model can synthesize realistic human motion according to a text prompt, a reference trajectory, and key locations, as well as avoiding hitting your toe on giant X-mark circles that someone dropped on the floor. No need to retrain diffusion models for each of these tasks!

AuthorsSiwei ZhangQianli MaYan ZhangSadegh AliakbarianDarren CoskerSiyu Tang

We propose a novel scene-conditioned probabilistic method to recover the human mesh from an egocentric view image (typically with the body truncated) in the 3D environment.

AuthorsSergey ProkudinQianli MaMaxime RaafatJulien ValentinSiyu Tang

We propose to model dynamic surfaces with a point-based model, where the motion of a point over time is represented by an implicit deformation field. Working directly with points (rather than SDFs) allows us to easily incorporate various well-known deformation constraints, e.g. as-isometric-as-possible. We showcase the usefulness of this approach for creating animatable avatars in complex clothing.

AuthorsAnpei ChenZexiang XuXinyue WeiSiyu TangHao SuAndreas Geiger

We present Dictionary Fields, a novel neural representation which decomposes a signal into a product of factors, each represented by a classical or neural field representation, operating on transformed input coordinates.

AuthorsTheodora KontogianniEkin CelikkanSiyu TangKonrad Schindler

We present interactive object segmentation directly in 3D point clouds. Users provide feedback to a deep learning model in the form of positive and negative clicks to segment a 3D object of interest.

AuthorsKorrawe KarunratanakulSergey ProkudinOtmar HilligesSiyu Tang

We present HARP (HAnd Reconstruction and Personalization), a personalized hand avatar creation approach that takes a short monocular RGB video of a human hand as input and reconstructs a faithful hand avatar exhibiting a high-fidelity appearance and geometry.

AuthorsJonas SchultFrancis EngelmannAlexander HermansOr LitanySiyu Tang, and Bastian Leibe

Mask3D predicts accurate 3D semantic instances achieving state-of-the-art on ScanNet, ScanNet200, S3DIS and STPLS3D.


AuthorsQianli Ma, Jinlong Yang, Michael J. Black and Siyu Tang

The power of point-based digital human representations further unleashed: SkiRT models dynamic shapes of 3D clothed humans including those that wear challenging outfits such as skirts and dresses.

AuthorsSiwei ZhangQianli MaYan ZhangZhiyin QianTaein KwonMarc PollefeysFederica Bogo and Siyu Tang

A large-scale dataset of accurate 3D human body shape, pose and motion of humans interacting in 3D scenes, with multi-modal streams from third-person and egocentric views, captured by Azure Kinects and a HoloLens2.

AuthorsShaofei Wang,  Katja Schwarz,  Andreas Geiger,  Siyu Tang

Given sparse multi-view videos, ARAH learns animatable clothed human avatars that have detailed pose-dependent geometry/appearance and generalize to out-of-distribution poses.

AuthorsKaifeng Zhao,  Shaofei Wang,  Yan Zhang,  Thabo Beeler,  Siyu Tang

Synthesizing natural interactions between virtual humans and their 3D environments is critical for numerous applications, such as computer games and AR/VR experiences. We propose COINS, for COmpositional INteraction Synthesis with Semantic Control.

AuthorsYan Wu*Jiahao Wang*Yan ZhangSiwei ZhangOtmar HilligesFisher Yu and Siyu Tang
(* denotes equal contribution)

Our goal is to synthesize whole-body grasping motion. Given a 3D object, we aim to generate diverse and natural whole-body human motions that approach and grasp the object.

AuthorsMarko Mihajlovic , Shunsuke Saito , Aayush Bansal , Michael Zollhoefer  and Siyu Tang

COAP is a novel neural implicit representation for articulated human bodies that provides an efficient mechanism for modeling self-contacts and interactions with 3D environments.

AuthorsYan Zhang, and Siyu Tang

We propose GAMMA, an automatic and scalable solution, to populate the 3D scene with diverse digital humans. The digital humans have 1) varied body shapes, 2) realistic and perpetual motions to reach goals, and 3) plausible body-ground contact.

AuthorsHongwei YiChun-Hao Paul HuangDimitrios TzionasMuhammed KocabasMohamed HassanSiyu TangJustus ThiesMichael Black

Humans are in constant contact with the world as they move through it and interact with it. This contact is a vital source of information for understanding 3D humans, 3D scenes, and the interactions between them.

AuthorsTaein KwonBugra TekinSiyu TangMarc Pollefeys

Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality.

AuthorsVassilis Choutas,  Lea Müller,  Chun-Hao Paul Huang,  Siyu Tang,  Dimitrios Tzionas,  Michael Black

We exploit the anthropometric measurements and linguistic shape attributes in several novel ways to train a neural network, called SHAPY, that regresses 3D human pose and shape from an RGB image.


AuthorsShaofei Wang, Marko Mihajlovic, Qianli Ma, Andreas Geiger, Siyu Tang

MetaAvatar is meta-learned model that represents generalizable and controllable neural signed distance fields (SDFs) for clothed humans. It can be fast fine-tuned to represent unseen subjects given as few as 8 monocular depth images.

AuthorsZicong FanAdrian SpurrMuhammed KocabasSiyu TangMichael J. Black and Otmar Hilliges

In this paper we demonstrate that self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands and their parts, is a major cause of the final 3D pose error. Motivated by this insight, we propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image.

AuthorsMiao Liu, Dexin Yang, Yan ZhangZhaopeng CuiJames M. RehgSiyu Tang

We seek to reconstruct 4D second-person human body meshes that are grounded on the 3D scene captured in an egocentric view. Our method exploits 2D observations from the entire video sequence and the 3D scene context to optimize human body models over time, and thereby leads to more accurate human motion capture and more realistic human-scene interaction.

AuthorsKorrawe KarunratanakulAdrian SpurrZicong FanOtmar HilligesSiyu Tang

We present HALO, a neural occupancy representation for articulated hands that produce implicit hand surfaces from input skeletons in a differentiable manner.

AuthorsQianli Ma, Jinlong Yang, Siyu Tang and Michael J. Black

We introduce POP — a point-based, unified model for multiple subjects and outfits that can turn a single, static 3D scan into an animatable avatar with natural pose-dependent clothing deformations.

AuthorsSiwei Zhang, Yan Zhang, Federica Bogo, Marc Pollefeys and Siyu Tang

LEMO learns motion priors from a larger scale mocap dataset and proposes a multi-​stage optimization pipeline to enable 3D motion reconstruction in complex 3D scenes.

AuthorsLea MüllerAhmed A. A. OsmanSiyu TangChun-Hao P. Huang and Michael J. Black

we develop new datasets and methods that significantly improve human pose estimation with self-contact.

AuthorsYan Zhang, Michael J. Black and Siyu Tang

"We are more than our joints", or MOJO for short, is a solution to stochastic motion prediction of expressive 3D bodies. Given a short motion from the past, MOJO generates diverse plausible motions in the near future.

AuthorsMarko Mihajlovic, Yan Zhang, Michael J. Black and Siyu Tang

LEAP is a neural network architecture for representing volumetric animatable human bodies. It follows traditional human body modeling techniques and leverages a statistical human prior to generalize to unseen humans.

AuthorsShaofei Wang, Andreas Geiger and Siyu Tang

Registering point clouds of dressed humans to parametric human models is a challenging task in computer vision. We propose novel piecewise transformation fields (PTF), a set of functions that learn 3D translation vectors which facilitates occupancy learning, joint-​rotation estimation and mesh registration.

AuthorsQianli MaShunsuke SaitoJinlong Yang, Siyu Tang and Michael J. Black

SCALE models 3D clothed humans with hundreds of articulated surface elements, resulting in avatars with realistic clothing that deforms naturally even in the presence of topological change.


AuthorsKorrawe Karunratanakul, Jinlong Yang, Yan Zhang, Michael Black, Krikamol Muandet, Siyu Tang

Capturing and synthesizing hand-​object interaction is essential for understanding human behaviours, and is key to a number of applications including VR/AR, robotics and human-​computer interaction.

AuthorsSiwei Zhang, Yan Zhang, Qianli MaMichael J. Black, Siyu Tang

Automated synthesis of realistic humans posed naturally in a 3D scene is essential for many applications. In this paper we propose explicit representations for the 3D scene and the person-​scene contact relation in a coherent manner.

AuthorsYan Zhang, Michael J. Black, Siyu Tang

In this work, our goal is to generate significantly longer, or “perpetual”, motion: given a short motion sequence or even a static body pose, the goal is to generate non-​deterministic ever-​changing human motions in the future.

AuthorsMiao Liu, Siyu Tang, Yin Li, and James M. Rehg

We address the challenging task of anticipating human-​object interaction in first person videos. We adopt intentional hand movement as a future representation and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action.

AuthorsXucong Zhang, Seonwook Park, Thabo Beeler, Derek Bradley, Siyu Tang , Otmar Hilliges

We propose the ETH-​XGaze dataset: a large scale (over 1 million samples) gaze estimation dataset with high-​resolution images under extreme head poses and gaze directions.

AuthorsYan Zhang, Mohamed Hassan, Heiko Neumann, Michael J. Black, Siyu Tang

We present a fully-​automatic system that takes a 3D scene and generates plausible 3D human bodies that are posed naturally in that 3D scene.

AuthorsQianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades, Gerard Pons-​Moll, Siyu Tang, and Michael J. Black

CAPE is a Graph-CNN based generative model for dressing 3D meshes of human body. It is compatible with the popular body model, SMPL, and can generalize to diverse body shapes and body poses. The CAPE Dataset provides SMPL mesh registration of 4D scans of people in clothing, along with registered scans of the ground truth body shapes under clothing.

AuthorsAnurag Ranjan, David T. Hoffmann, Dimitrios Tzionas, Siyu Tang, Javier Romero, Michael J. Black

We created an extensive Human Optical Flow dataset containing images of realistic human shapes in motion together with ground truth optical flow. We then train two compact network architectures based on spatial pyramids, namely SpyNet and PWC-​Net.

AuthorsJie SongBjoern AndresMichael J. BlackOtmar Hilliges, Siyu Tang

We propose an end-​to-end trainable framework to learn feature representations globally in a graph decomposition problem.