G-DReaM: Graph-conditioned Diffusion Retargeting across Multiple Embodiments

G-DReaM can be uniformly applied to heterogeneous embodiments without requiring their motion data, even when their skeletons are non-homeomorphic.

Overview

The denoising network is based on a transformer decoder with a noisy motion input \(X_t\) under the conditions of the reference motion \(\mathcal{M}\) and graphs \(g^{\mathcal{M},X} = \{ \phi_v^{\mathcal{M},X}, \phi_e^{\mathcal{M},X},\psi^{\mathcal{M},X}\}\). The input motion is tokenized at the joint level, where the base joint and other joints are embedded by independent encoders. Then spatial and temporal attentions extract the relationships between all joints and the chronological relationships along the time window. In addition, the spatial attention absorb the joint connectivity \(\psi\) to enrich the joint relationships. For other graphic conditions, we use a multi-conditional cross attention to treat them individually. In particular, the reference motion condition \(\mathcal{M}\) is encoded using a similar transformer decoder and incorporates the joint correspondence \(\eta\) as an attention mask. Finally, the predicted motion \(\hat{X}_0\) is output from a output decoder.

Multi-Embodiment Motion Retargeting

By leveraging graph structures, G-DReaM encodes both topological and geometric features of diverse robotic embodiments, allowing it to retarget motions across robots with varying link lengths, joint numbers, kinematic chains, and end-effectors in a unified way.

Fight

Reference

Atlas

CL

G1

H1

Tron

Cassie

Kick

Reference

Atlas

CL

G1

H1

Tron

Cassie

Stand Up

Reference

Atlas

CL

G1

H1

Tron

Cassie

Throw Ball

Reference

Atlas

CL

G1

H1

Tron

Cassie

Stagger

Reference

Atlas

CL

G1

H1

Tron

Cassie

Embodiment Generalization

(Link Length Aspect)

G-DReaM demonstrates zero-shot generalization across embodiments with varying link lengths. The retargeted motion will be automatically adjusted to fit the specific link length, enabling better imitation of the reference motion.

Unscaled Case

Scaled Case

Here, we remove the joint correspondence of the knee for a more distinct comparative result. When the calf length is increased (right), the robot tends to bend its knee more to better track the reference position of the ankle joint.

Embodiment Generalization

(Joint Correspondence Aspect)

G-DReaM demonstrates zero-shot generalization across embodiments with controllable joint correspondences. The retargeted motions can be automatically adjusted to align with the specified joint mappings.

Without Correspondence

With Correspondence

Here, we choose a scaled calf length setting to produce a more distinct comparative result. The green joints indicate those that are required to match the reference motion, while the red joints are not constrained. The results show that the generated motion adapts to better track the joints with defined correspondences. When the knee joint is unconstrained (left), the robot bends its knee to better align the ankle joint under the scaled configuration.

Embodiment Generalization

(Model Adaptation for New Embodiments)

For diverse unseen embodiments, G-DReaM can adapt the pretrained model to new skeletons by conditioning it on new skeleton graphs during training, while keeping the original dataset unchanged. Here, we show the retargeted results of the adapted model on new robots \(\textit{Talos}\) and \(\textit{Valkyrie}\).

Fight

Reference

Talos

Valkyrie

Kick

Reference

Talos

Valkyrie

Stand Up

Reference

Talos

Valkyrie

Throw Ball

Reference

Talos

Valkyrie

Stagger

Reference

Talos

Valkyrie

Motion Generalization

G-DReaM also demonstrates zero-shot generalization to unseen motions. Specifically, we show that our model can successfully handle unseen motion sequences from the LAFAN1 Retargeting dataset.