Inferring 3D Structure with a Statistical Shape Model

Inferring 3D Structure with a Statistical Image-Based Shape Model

Kristen Grauman, Gregory Shakhnarovich, Trevor Darrell

Abstract

We present an image-based approach to infer 3D structure parameters using a probabilistic “shape + structure” model. The 3D shape of a class of objects may be represented by sets of contours from silhouette views simultaneously observed from multiple calibrated cameras. Bayesian reconstructions of new shapes can then be estimated using a prior density constructed with a mixture model and probabilistic principal components analysis. We augment the shape model to incorporate structural features of interest; novel examples with missing structure parameters may then be reconstructed to obtain estimates of these parameters. Model matching and parameter inference are done entirely in the image domain and require no explicit 3D construction. Our shape model enables accurate estimation of structure despite segmentation errors or missing views in the input silhouettes, and works even with only a single input view. Using a dataset of thousands of pedestrian images generated from a synthetic model, we can perform accurate inference of the 3D locations of 19 joints on the body based on observed silhouette contours from real images.

ICCV2003 paper on this work: pdf
Further details in SM thesis: ps or pdf

(a) multi-view input

(a) multi-view input

(b) extracted foreground silhouettes

(b) extracted foreground silhouettes (two views withheld from input)

(c) 3D structure inferred from the shape

(c) 3D structure inferred from the shape

Inferring structure on real data with the proposed shape + structure model.
For each example above, the top row shows the original textured multi-view image, the middle row shows the extracted input silhouettes where the views that are not used in reconstruction are omitted, and the bottom row shows the inferred joint locations with 3D stick figures rendered at different viewpoints. The shape model has been trained on multi-view examples containing four views; however, as shown above, we are able to match novel examples containing fewer than four views (i.e., one, two, or three views are missing/withheld) to the model and infer structure parameters. To aid in inspection, the 3D stick figures are rendered from manually selected viewpoints that were chosen so that they are approximately aligned with the textured images. These are typical representatives of the results we obtain with our proposed shape model and pose inference method. In general, estimation is accurate and agrees with the perceived body configuration.

(a) multi-view input

(a) multi-view input

(b) extracted foreground silhouettes (two views withheld from input)

(b) extracted foreground silhouette (three views withheld from input)

(c) 3D structure inferred from the shape

(c) 3D structure inferred from the single-view shape

Inferring structure on noisy synthetic data from only a single view.
Top row shows ground truth silhouettes that are not in the training set. Noise is added to the input contour points of second view (middle row), and this single view is matched to the multi-view shape model in order to infer the 3D joint locations (bottom row, solid blue). Ground truth pose parameters (bottom row, dotted red) are overlayed with our inferred estimate. This is an example with an average pose error of 5 cm per joint. For error distributions for thousands of such examples, please see our paper.

ground truth contours (withheld)

noisy input silhouette (three views withheld)

3D structure inferred from the single-view shape (solid blue) and ground truth (dotted red)

<<< Back to Research main page


(a) multi-view input	(a) multi-view input
(b) extracted foreground silhouettes	(b) extracted foreground silhouettes (two views withheld from input)
(c) 3D structure inferred from the shape	(c) 3D structure inferred from the shape

Inferring structure on real data with the proposed shape + structure model. For each example above, the top row shows the original textured multi-view image, the middle row shows the extracted input silhouettes where the views that are not used in reconstruction are omitted, and the bottom row shows the inferred joint locations with 3D stick figures rendered at different viewpoints. The shape model has been trained on multi-view examples containing four views; however, as shown above, we are able to match novel examples containing fewer than four views (i.e., one, two, or three views are missing/withheld) to the model and infer structure parameters. To aid in inspection, the 3D stick figures are rendered from manually selected viewpoints that were chosen so that they are approximately aligned with the textured images. These are typical representatives of the results we obtain with our proposed shape model and pose inference method. In general, estimation is accurate and agrees with the perceived body configuration.
(a) multi-view input	(a) multi-view input
(b) extracted foreground silhouettes (two views withheld from input)	(b) extracted foreground silhouette (three views withheld from input)
(c) 3D structure inferred from the shape	(c) 3D structure inferred from the single-view shape