Lesson 04
Training: the scene is the model
Why 3DGS "trains": analysis-by-synthesis, gradients through the renderer, densification.
The inverse problem
In 3DGS you have no measured geometry: you have calibrated photographs \(\{I_k, \pi_k\}\) and a sparse initialization cloud. The question flips around: find the scene \(\Theta\) such that, rendered from the known cameras, it reproduces the photos:
$$\Theta^* = \arg\min_\Theta \sum_k \mathcal{L}\big(R(\Theta;\pi_k),\, I_k\big), \qquad \mathcal{L} = (1-\lambda)\,\mathcal{L}_1 + \lambda\,(1-\text{D-SSIM})$$The parameters \(\Theta = \{\mu_i, s_i, q_i, \alpha_i, c_i\}_{i=1..N}\) are latent: from the photos you cannot measure the covariance of a gaussian or its SH coefficients. You can only estimate them — and since the renderer (lesson 03) is differentiable, the estimation is gradient descent: the gradient of the photometric loss flows back from the pixels to every single gaussian. For the position, for instance:
$$\frac{\partial \mathcal L}{\partial \mu_i} = \sum_{p}\; \underbrace{\frac{\partial \mathcal L}{\partial R(p)}}_{\text{residual}} \cdot \underbrace{c_i\, w_i(p)\,\frac{(p-\mu_i')}{\sigma_i^2}}_{\text{from the gaussian kernel}}$$This is statistical fitting in the fullest sense: the photos are the training data, generalization is novel view synthesis (measured on held-out views), overfitting is the floaters — gaussians that explain the training views perfectly from wrong 3D positions.
Try it: 2D Gaussian Splatting in the browser
Below is a real optimizer, written in JavaScript: a 96×96 target image is reconstructed from isotropic 2D gaussians whose parameters \((\mu_i, \sigma_i, c_i)\) start out almost at random and are updated with Adam on the analytic gradients of the MSE. Press train and watch it converge; press densify when the loss stalls.
Densification = architecture search
The densify button replicates the adaptive density control of 3DGS: every ~100 iterations the optimizer changes the number of parameters — it clones or splits the gaussians where the accumulated positional gradient is high (the residual is "asking" for capacity there), and prunes the transparent ones. In representation-learning terms it is greedy model selection interleaved with the optimization, a close relative of matching pursuit: add atoms to the dictionary where the residual is large. (The elegant reformulation is 3DGS-MCMC: densification and pruning as sampling moves.)
The difference, as a table
| Traditional splatting (EWA, ~2001) | 3DGS (2023) | |
|---|---|---|
| Nature | rendering technique (resampling) | learned representation (inverse problem) |
| Input | measured geometry (scanner + normals) | calibrated photos + sparse init |
| Parameters | derived from the data (density, normals) | latent, optimized end-to-end |
| Color | fixed per point | view-dependent SH, learned |
| # primitives | = number of samples | dynamic (densify/prune) |
| Quality measured by | anti-aliasing, continuity | photometric loss on held-out views |
| Failure modes | holes, blur | floaters (overfitting), speculars |
| In our repo | splat producer (stub) | gsplat producer (testing on the GPU box) |