DreamScene: 3D Gaussian-based End-to-end Text-to-3D Scene Generation

Under Review, Extended version of ECCV 2024 paper DreamScene

Haoran Li1, Yuli Tian1, Kun Lan1, Yong Liao1*, Lin Wang2, Pan Hui3, Yuyang Wang3, Yonghui Wang1, Peng Yuan Zhou4

1 University of Science and Technology of China   2 Nanyang Technological University  
3 Hong Kong University of Science and Technology (Guangzhou)  4 Aarhus University  

Abstract

Generating 3D scenes from natural language holds great promise for applications in gaming, film, and design. However, existing methods struggle with automation, 3D consistency, and fine-grained control. We present DreamScene, an end-to-end framework for high-quality and editable 3D scene generation from text or dialogue. DreamScene begins with a scene planning module, where a GPT-4 agent infers object semantics and spatial constraints to construct a hybrid graph. A graph-based placement algorithm then produces a structured, collision-free layout. Based on this layout, Formation Pattern Sampling (FPS) generates object geometry using multi-timestep sampling and reconstructive optimization, enabling fast and realistic synthesis. To ensure global consistent, DreamScene employs a progressive camera sampling strategy tailored to both indoor and outdoor settings. Finally, the system supports fine-grained scene editing, including object movement, appearance changes, and 4D dynamic motion. Experiments demonstrate that DreamScene surpasses prior methods in quality, consistency, and flexibility, offering a practical solution for open-domain 3D content creation.

Text-to-3D Scene

Outdoor Scene
Indoor Scene
"There are many astronauts on the moon."
"An autumn park."
"A DSLR photo of a modern living room."
"A DSLR photo of a classical bedroom."
RGB
Depth

Quality Improvement

DreamScene ECCV
DreamScene Now

Approach

Approach of DreamScene

Our framework enables automatic 3D scene generation from natural language, supporting both direct descriptions and interactive dialogues. A GPT-4 agent first performs scene decomposition by inferring object semantics, layout constraints, and spatial relations, and constructs a constraint graph to plan collision-free object placements. Each object is generated using Formation Pattern Sampling (FPS), which integrates multi-timestep sampling, 3D Gaussian filtering, and reconstructive generation. These objects are placed into the global scene using predicted affine transformations. We then apply a three-stage camera sampling strategy to optimize the environment and ensure scene-wide consistency. DreamScene also supports structure-aware scene editing, including object relocation, appearance modification, and 4D editing.

Citation


@article{li2025dreamscene,
  title={DreamScene: 3D Gaussian-based End-to-end Text-to-3D Scene Generation},
  author={Li, Haoran and Tian, Yuli and Lan, Kun and Liao, Yong and Wang, Lin and Hui, Pan and Zhou, Peng Yuan},
  journal={arXiv preprint arXiv:2507.13985},
  year={2025}
}