MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer

ICLR 2025
1University of Alberta 2Concordia University 3Simon Fraser University
4Snap Inc. 5Noah's Ark Lab, Huawei Canada
Model Overview

Generative masked transformers have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn internal motion patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of the quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, crowd animation, and beat-aligned dance generation, all using a single reference motion.

Method Overview


(a) Overview of MotionDreamer based on localized generative masked transformer. The single reference motion \(\mathbf{m}_{1:L}\) is embedded as motion tokens \(\mathbf{c}\) by optimizing a codebook through vector quantization, where a codebook distribution regularization loss \(\mathcal{L}_{\text{token}}\) is additionally introduced. The Local-M transformer learns the local dependencies of motion tokens through sliding window local attention (SlidAttn) layers. The SlidAttn layer attends tokens within each unfolded overlapping windows for attention based on learnable query and relative positional embeddings. Attention outputs are merged through overlap attention fusion (AttnFuse). (b) Visualization of the explicit distribution modeling for internal patterns. MotionDreamer learns to express and diversify the combination of internal patterns with explicit categorical distribution of motion tokens, which is visualized as multiple token candidates predicted by Local-M given previous generated ones.

Gallery of Generation

Comparisons with State-of-the-Art

Below are comparisons of our method against two others for two different reference motions.

Motion 1

Motion 2

Diversity Comparison 1

Diversity Comparison 2

Ablation Studies

codebook regularization loss

AttnFuse

FusedQnA VS. SlidAttn

Applications

crowd motions

temporal editing

beat-aligned dance generation 1

Failure Cases

For some cases, some motions generated would be trapped in one internal motion mode. For very short and highly dynamic motion as case 2, artifacts occur and generations are not diverse enough.

case 1

case 2

BibTeX

@inproceedings{wang2025motiondreamer,
    title={MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer},
    author={Wang, Yilin and Guo, Chuan and Mu, Yuxuan and Javed, Muhammad Gohar and Zuo, Xinxin and Cheng, Li and Jiang, Hai and Lu, Juwei},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2025}
}