SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis
with Segmented Consistency Trajectory Distillation

1Jiahao Zhu 1ZiXuan Chen 2Guangcong Wang

1Xiaohua Xie 1Yi Zhou

1Sun Yat-Sen University

2Great Bay University

Demo Video



Abstract


Recent advancements in text-to-3D generation improve the visual quality of Score Distillation Sampling (SDS) and its variants by directly connecting Consistency Distillation (CD) to score distillation. However, due to the imbalance between self-consistency and cross-consistency, these CD-based methods inherently suffer from improper conditional guidance, leading to sub-optimal generation results. To address this issue, we present SegmentDreamer, a novel framework designed to fully unleash the potential of consistency models for high-fidelity text-to-3D generation. Specifically, we reformulate SDS through the proposed Segmented Consistency Trajectory Distillation (SCTD), effectively mitigating the imbalance issues by explicitly defining the relationship between self- and cross-consistency. Moreover, SCTD partitions the Probability Flow Ordinary Differential Equation (PF-ODE) trajectory into multiple sub-trajectories and ensures consistency within each segment, which can theoretically provide a significantly tighter upper bound on distillation error. Additionally, we propose a distillation pipeline for a more swift and stable generation. Extensive experiments demonstrate that our SegmentDreamer outperforms state-of-the-art methods in visual quality, enabling high-fidelity 3D asset creation through 3D Gaussian Splatting (3DGS).


Method


An overview of SegmentDreamer: We begin by initializing a 3D representation θusing a 3D generator, such as Point-E. In each iteration, we randomly render a batch of camera views \( \mathbf{z}_0 \) from \( \theta \) and diffuse them into \( \mathbf{z}_m \) with fixed noise \( \boldsymbol{\epsilon}^* \). Next, we transform zsm into \(\tilde{\mathbf{z}}^{\boldsymbol{\Phi}}_t\) using either one-step or two-step unconditional deterministic sampling. During the denoising process, we first estimate \(\hat{\mathbf{z}}^{\boldsymbol{\Phi}}_s\) through one-step conditional deterministic sampling from \(\tilde{\mathbf{z}}^{\boldsymbol{\Phi}}_t\). Subsequently, we compute two consistency functions and utilize them to derive the loss \(\mathcal{L}_{\text{SCTD}}\), which is ultimately employed to optimize \(\theta\)


Visual Comparisons


DreamFusion

(~1h)

LucidDreamer

(35~45)

Consistent3D

(~2.4h)

Connect3D

(1h~1.4h)

SegmentDreamer

(32~38min)

"A DSLR photo of a car made out of cheese."

"A zoomed out DSLR photo of a robot made out of vegetables."

"A DSLR photo of a bald eagle."

"A DSLR photo of a bear dressed as a lumberjack."

"An amigurumi bulldozer."

"A DSLR photo of a corgi wearing a top hat."

"A plush toy of a corgi nurse."



Consistency Distillation Loss Comparisons


Consistent3D (CDS)

(CFG: 7.5)

Consistent3D (CDS)

(CFG: 20~40)

ConnectCD (GCS)

(CFG: 7.5)

ConnectCD (GCS+BEG)

(CFG: 7.5)

SCTD (Ours)

(CFG: 7.5)

"A DSLR photo of a corgi wearing a top hat."

"A DSLR photo of a pig wearing a backpack."

"A DSLR photo of a tiger made out of yarn."

More Generated Results

"A DSLR photo of an astronaut riding a horse."

"A capybara wearing a top hat, low poly style."

"A baby dragon is spraying flames."

"A DSLR photo of the Mount Fuji, aerial view."

"'A steampunk owl with mechanical wings."

"A DSLR photo of a LV handbag."

"A zoomed out DSLR photo of an origami hippo in a river."

"A delicious hamburger."

"A DSLR photo of a peacock on a surfboard."

"A DSLR photo of a robot dinosaur."

"A DSLR photo of an erupting volcano, aerial view."

"An airplane made out of wood."

Application

3D Head Generation

"A portrait of IRONMAN, white hair, head, photorealistic, 8K, HDR."

"A portrait of Captain America, white hair, head, photorealistic, 8K, HDR."

"A portrait of Kid Spiderman, blue hair, head, photorealistic, 8K, HDR."

"A portrait of white marble bust of BATMAN, head, 8K, HDR."

"A portrait of Hulk, head, photorealistic, 8K, HDR."


3D Avatar Generation

"An armored green-skin orc riding a vicious hog."

"Mulan, Anime, full body, with armor."

"black dragonborn, solo, red eyes, male, full body."

"A soldier, riding a tiger."

"A warrior with a red cape riding a horse."




Citation

@article{chen2024vividdreamer,
    title={SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation},
    author={Zhu, Jiahao and Chen, Zixuan and Wang, Guangcong and Xie, Xiaohua and Zhou, Yi},
    journal={arXiv preprint arXiv:xxxx.xxxxx},
    year={2025}
}
                

Acknowledgements


This project is supported by the Natural Science Foundation of China (No. 62072482), and is also supported by the Project of Guangdong Provincial Key Laboratory of Information Security Technology (Grant No. 2023B1212060026).
We also thank to Lior Yariv for the website template.