Machine Learning Research Group | University of Texas

Publications: Language to 3D

text2motion; text to 3D scenes, etc.

Hide abstracts

Text-Guided Interactive Scene Synthesis with Scene Prior Guidance
[Details] [PDF]
Shaoheng Fang, Haitao Yang, Raymond Mooney, Qixing Huang
In European Association for Computer Graphics, May 2025.
3D scene synthesis using natural language instructions has become a popular direction in computer graphics, with significant progress made by data-driven generative models recently. However, previous methods have mainly focused on one-time scene generation, lacking the interactive capability to generate, update, or correct scenes according to user instructions. To overcome this limitation, this paper focuses on text-guided interactive scene synthesis. First, we introduce the SceneMod dataset, which comprises 168k paired scenes with textual descriptions of the modifications. To support the interactive scene synthesis task, we propose a two-stage diffusion generative model that integrates scene-prior guidance into the denoising process to explicitly enforce physical constraints and foster more realistic scenes. Experimental results demonstrate that our approach outperforms baseline methods in text-guided scene synthesis tasks. Our system expands the scope of data-driven scene synthesis tasks and provides a novel, more flexible tool for users and designers in 3D scene generation. Code and dataset are available at https://github.com/bshfang/SceneMod.
ML ID: 439
What is the Best Automated Metric for Text to Motion Generation?
[Details] [PDF] [Slides (PPT)] [Video]
Jordan Voas, Yili Wang, Qixing Huang, Raymond Mooney
In ACM SIGGRAPH Asia, December 2023.
There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing effective generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and less-used coordinate errors show strong correlations. Additionally, several recently developed metrics are not recommended due to their low correlation compared to alternatives. We also introduce a novel metric based on a multimodal BERT-like model, MoBERT, which offers strongly human-correlated sample-level evaluations while maintaining near-perfect model-level correlation. Our results demonstrate that this new metric exhibits extensive benefits over all current alternatives.
ML ID: 422
Generating Animated Videos of Human Activities from Natural Language Descriptions
[Details] [PDF] [Poster]
Angela S. Lin, Lemeng Wu, Rodolfo Corona , Kevin Tai , Qixing Huang , Raymond J. Mooney
In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2018, December 2018.
Generating realistic character animations is of great importance in computer graphics and related domains. Existing approaches for this application involve a significant amount of human interaction. In this paper, we introduce a system that maps a natural language description to an animation of a humanoid skeleton. Our system is a sequence-to-sequence model that is pretrained with an autoencoder objective and then trained end-to-end.
ML ID: 369