TY - GEN
T1 - Carve3D
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Xie, Desai
AU - Li, Jiahao
AU - Tan, Hao
AU - Sun, Xin
AU - Shu, Zhixin
AU - Zhou, Yi
AU - Bi, Sai
AU - Pirk, Soren
AU - Kaufman, Arie E.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Multi-view diffusion models, obtained by applying Su-pervised Finetuning (SFT) to text-to-image diffusion mod-els, have driven recent breakthroughs in text-to-3D re-search. However, due to the limited size and quality of ex-isting 3D datasets, they still suffer from multi-view incon-sistencies and Neural Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view diffusion models can benefit from further Reinforcement Learning Finetuning (RLFT), which allows models to learn from the data generated by themselves and improve beyond their dataset limitations during SFT. To this end, we introduce Carve3D, an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric, to enhance the consistency of multi-view diffusion models. To mea-sure the MRC metric on a set of multi-view images, we compare them with their corresponding NeRF renderings at the same camera viewpoints. The resulting model, which we denote as Carve3DM, demonstrates superior multi-view consistency and NeRF reconstruction quality than existing models. Our results suggest that pairing SFT with Carve3D's RLFT is essential for developing multi-view-consistent diffusion models, mirroring the standard Large Language Model (LLM) alignment pipeline. Our code, training and testing data, and video results are available at: https://desaixie.github.io/carve-3d.
AB - Multi-view diffusion models, obtained by applying Su-pervised Finetuning (SFT) to text-to-image diffusion mod-els, have driven recent breakthroughs in text-to-3D re-search. However, due to the limited size and quality of ex-isting 3D datasets, they still suffer from multi-view incon-sistencies and Neural Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view diffusion models can benefit from further Reinforcement Learning Finetuning (RLFT), which allows models to learn from the data generated by themselves and improve beyond their dataset limitations during SFT. To this end, we introduce Carve3D, an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric, to enhance the consistency of multi-view diffusion models. To mea-sure the MRC metric on a set of multi-view images, we compare them with their corresponding NeRF renderings at the same camera viewpoints. The resulting model, which we denote as Carve3DM, demonstrates superior multi-view consistency and NeRF reconstruction quality than existing models. Our results suggest that pairing SFT with Carve3D's RLFT is essential for developing multi-view-consistent diffusion models, mirroring the standard Large Language Model (LLM) alignment pipeline. Our code, training and testing data, and video results are available at: https://desaixie.github.io/carve-3d.
KW - 3D Generation
KW - 3D Reconstruction
KW - Alignment
KW - Diffusion Model
KW - Multi-view Consistency
KW - NeRF
KW - Reinforcement Learning
KW - Text-to-3D
UR - https://www.scopus.com/pages/publications/85207188847
U2 - 10.1109/CVPR52733.2024.00609
DO - 10.1109/CVPR52733.2024.00609
M3 - Conference contribution
AN - SCOPUS:85207188847
SN - 9798350353006
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 6369
EP - 6379
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -