The denoising network is based on a transformer decoder with a noisy motion input \(X_t\) under the conditions of the reference motion \(\mathcal{M}\) and graphs \(g^{\mathcal{M},X} = \{ \phi_v^{\mathcal{M},X}, \phi_e^{\mathcal{M},X},\psi^{\mathcal{M},X}\}\).
The input motion is tokenized at the joint level, where the base joint and other joints are embedded by independent encoders.
Then spatial and temporal attentions extract the relationships between all joints and the chronological relationships along the time window.
In addition, the spatial attention absorb the joint connectivity \(\psi\) to enrich the joint relationships. For other graphic conditions,
we use a multi-conditional cross attention to treat them individually. In particular, the reference motion condition \(\mathcal{M}\) is encoded using a similar transformer decoder and incorporates the joint correspondence \(\eta\) as an attention mask.
Finally, the predicted motion \(\hat{X}_0\) is output from a output decoder.