I'm a final-year Ph.D student in the Computer Science Department at The University of Texas at Austin, fortunately working with Prof. Qixing Huang. My primary research interest lies in large vision language models, large-scale 3D pre-training, 3D unsupervised/self-supervised learning, and point cloud processing.

Before that, I received my B.S. degree from the Department of Computer Science at Peking University in 2019 with First Class Honors.

I'm looking for jobs starting from summer 2024. If you are interested, please drop me an email!

siming[AT]cs[DOT]utexas[DOT]edu / Google Scholar / CV / Github.

News

[2024/07] "ViGoR" is accepted at ECCV 2024!
[2024/01] Both "MaskFeat3D" and "MVNet" are accepted at ICLR 2024!
[2023/10] I started my internship at NVIDIA Autonomous Vehicle Research Group at Santa Clara, CA.

Work Experience

NVIDIA, 10/2023 - 01/2024.

Research Intern.

Host: Yue Wang, Xinshuo Weng.

AWS AI, 06/2023 - 10/2023.

Improve visual grounding capability in Large Vision Language Models.

Applied Scientist Intern.

Host: Min Bai, Erran Li.

Microsoft Research Asia, 05/2022 - 11/2022.

3D point cloud unsupervised learning / pre-training.

Research Intern.

Host: Yang Liu, Xin Tong.

Wormpex AI Research, 06/2021 - 10/2021.

3D point cloud unsupervised learning / pre-training.

Research Intern.

Host: Haoxiang Li, Gang Hua.

Kuaishou Technology, 06/2020 - 10/2020.

3D point cloud understanding.

Research Intern.

Host: Haibin Huang, Chongyang Ma.

Publications

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Siming Yan, Min Bai, Weifeng Chen, Xiong Zhou, Qixing Huang, Li Erran Li

European Conference on Computer Vision (ECCV) 2024.

We aim to improve the visual grounding capability of Large Vision Language Models (LVLMs) by using fine-grained reward modeling.

[paper] [data]

Multi-View Representation is What You Need for Point-Cloud Pre-Training

Siming Yan, Chen Song, Youkang Kong, Qixing Huang.

International Conference on Learning Representations (ICLR) 2024.

We introduce a new method for pre-training 3D point clouds by leveraging pre-trained large-scale 2D networks. Additionally, a multi-view consistency loss ensures the 2D projections maintain 3D information by capturing pixel-wise correspondences across views.

[paper]

3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining

Siming Yan, Yuqi Yang, Yuxiao Guo, Hao Pan, Peng-shuai Wang, Xin Tong, Yang Liu, Qixing Huang.

International Conference on Learning Representations (ICLR) 2024.

We introduce a novel method for 3D self-supervised pretraining of point clouds using Masked Autoencoders (MAEs). Diverging from traditional 3D MAEs that focus on reconstructing point positions, our proposed approach employs an attention-based decoder, independent of the encoder design, to recover high-order geometric features of the underlying 3D shape.

[paper]

Implicit Autoencoder for Point Cloud Self-supervised Representation Learning

Siming Yan, Zhenpei Yang, Haoxiang Li, Li Guan, Hao Kang, Gang Hua, Qixing Huang.

International Conference on Computer Vision (ICCV) 2023.

[paper] [code]

HPNet: Deep Primitive Segmentation Using Hybrid Representations

Siming Yan, Zhenpei Yang, Chongyang Ma, Haibin Huang, Etienne Vouga, Qixing Huang.

International Conference on Computer Vision (ICCV) 2021.

We introduce a new deep-learning model for segmenting 3D shapes represented as point clouds into primitive patches. It stands out by using hybrid feature representations, combining a learned semantic descriptor, two spectral descriptors based on geometric parameters, and an adjacency matrix highlighting sharp edges.

[paper] [code]

Extreme Relative Pose Network under Hybrid Representations

Zhenpei Yang*, Siming Yan*, Qixing Huang. (* indicates equal contribution)

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020. (Oral)

We present a novel approach for estimating relative poses in RGB-D scans, especially effective for small or non-overlapping scans. The method involves scene completion followed by matching completed scans. We use hybrid representations combining 360-degree images, 2D image-based layouts, and planar patches, allowing for adaptable feature representations for relative pose estimation.

[paper] [code] [video]

Calcium Removal from Cardiac CT Images Using Deep Convolutional Neural Network

Siming Yan, Feng Shi, Yuhua Chen, Damini Dey, Sang-Eun Lee, Hyuk-Jae Chang, Debiao Li, Yibin Xie

IEEE International Symposium on Biomedical Imaging (ISBI), 2018. (Oral)

We introduce a deep learning-based method featuring a multi-step inpainting process to address the issue of coronary calcium causing beam hardening and blooming artifacts in cardiac computed tomography angiography (CTA) images.

[paper]

Unsupervised neural network models of the ventral visual stream

Chengxu Zhuang, Siming Yan, Aran Nayebi, Martin Schrimpf, Michael Frank, James DiCarlo, Daniel Yamins

Proceedings of the National Academy of Sciences (PNAS) Vol. 118(3).

Recent advancements in unsupervised learning have significantly narrowed the gap in modeling the development of the primate ventral visual stream using deep neural networks. These networks, previously limited due to their reliance on extensive supervised training, which is implausible for mimicking infant development, now show promising results with unsupervised methods.

[paper] [code]

A short version is presented at Conference on Cognitive Computational Neuroscience (CCN), 2019.

Scene Synthesis via Uncertainty-Driven Attribute Synchronization

Haitao Yang, Zaiwei Zhang, Siming Yan, Haibin Huang, Chongyang Ma, Yi Zheng, Chandrajit Bajaj, Qixing Huang.

International Conference on Computer Vision (ICCV) 2021.

We present a novel approach for generating 3D scenes using deep neural networks. It utilizes parametric prior distributions learned from training data to regularize neural model outputs and predict an over-complete set of attributes. This allows for the application of consistency constraints to eliminate infeasible predictions.

[paper] [code]

Recurrent Feedback Improves Feedforward Representations in Deep Neural Networks

Siming Yan, Xuyang Fang, Bowen Xiao, Harold Rockwell, Yimeng Zhang, Tai-Sing Lee

arXiv 2019.

Introducing feedback loops and horizontal recurrent connections to a deep convolutional neural network enhances its robustness against noise and occlusion, suggesting these modifications improve feedforward representations by injecting top-down semantic meaning.

[paper]

Teaching

CS 378H Computer Graphics (Honors) (Spring '20, 24)

TA with Etienne Vouga.

CS 391L Machine Learning (Spring, Fall '21)

TA with Qiang Liu and Adam Klivans.