VLG | Computer Vision and Learning Group

Authors:Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, Siyu Tang

Abstract

Our goal is to synthesize humans interacting with a given 3D scene controlled by high-level semantic specifications as pairs of action categories and object instances, e.g., “sit on the chair”. The key challenge of incorporating interaction semantics into the generation framework is to learn a joint representation that effectively captures heterogeneous information, including human body articulation, 3D object geometry, and the intent of the interaction. To address this challenge, we design a novel transformer-based generative model, in which the articulated 3D human body surface points and 3D objects are jointly encoded in a unified latent space, and the semantics of the interaction between the human and objects are embedded via positional encoding. Furthermore, inspired by the compositional nature of interactions that humans can simultaneously interact with multiple objects, we define interaction semantics as the composition of varying numbers of atomic action-object pairs. Our proposed generative model can naturally incorporate varying numbers of atomic interactions, which enables synthesizing compositional human-scene interactions without requiring composite interaction data.

Authors:

Kaifeng Zhao
PhD student CAB G 65

Dr. Shaofei Wang
Beijing Institute for General Artificial Intelligence

Dr. Yan Zhang
Epic Games

Prof. Dr. Siyu Tang
Assistant Professor of Computer Science, CNB G 104

Links:

Project PDF Source BibTeX

COINS: Compositional Human-Scene Interaction Synthesis with Semantic Control

Conference: European Conference on Computer Vision (ECCV 2022)

Abstract

Authors:

Links: