Learning 3D Foundation Models from Images 

My photo

This is another central topic in my group. Foundation models have made remarkable progress in NLP and vision. A common message is that large-scale data and clean data are more important than the network architecture and training approach. The research on 3D deep learning has been tested on relatively small-scale dataset. A important question is do we still see big gaps under different 3D neural representations when having abundament 3D data? However, we do not and will not have large-scale 3D data as images. This brings the idea of learning 3D foundation models from images.

There are different aspects of this field. For example, people have observed that 2D foundation models reveal certain knowledge. When synthesizing different views of the same object, if the results are multi-view consistent, then we can obtain 3D objects. A recent paper studied this paper [Arxiv2024b]using pretrained video foundation models.

Another aspect is to perform neural rendering and train 3D foundation models from images. In a recent paper LEAP, we studied how to learn 3D representations from sparse-views where we do not have pose information. The paper was accepted by ICLR 2024.

Besides distilling 3D geometric information from image-based foundation models, my group is also interested in how to transfer texture information from images to 3D shapes. I started working on this topic in [SIGA16]. A recent paper [Arxiv2024a] studies this problem using pre-trained text-2-image models. The key task is to address multi-view consistency.

Real3D_2024

Hanwen Jiang, Haitao Yang, Qixing Huang and Georgios Pavlakos. Real3D: Scaling Up Large Reconstruction Models with Real-World Images. https://arxiv.org/abs/2406.08479

MV_2024

[Arxiv2024a] Zhengyi Zhao, Chen Song, Xiaodong Gu, Yuan Dong, Qi Zuo,Weihao Yuan, Zilong Dong, Liefeng Bo and Qixing Huang. An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes Using Pre-Trained Text-to-Image Models. https://arxiv.org/abs/2403.15559

VideoMV_2024

[Arxiv2024b] Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao,Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong,Liefeng Bo and Qixing Huang. VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model. https://arxiv.org/abs/2403.12010

Leap_2024

[ICLR24] Hanwen Jiang, Zhenyu Jiang, Yue Zhao and Qixing Huang. LEAP: Liberate Sparse-view 3D Modeling from Camera Poses. International Conferences on Learning Representations (ICLR) 2024

siga16_texture

[SIGA16] Tuanfeng Wang, Hao Su, Qixing Huang, Jingwei Huang, Leonidas Guibas, and Niloy J. Mitra. Unsupervised Texture Transfer from Images to Model Collections. ACM Transaction on Graphics 35(6) (Proc. Siggraph Asia 2016).