Lesson 04

Training: the scene is the model

Why 3DGS "trains": analysis-by-synthesis, gradients through the renderer, densification.

The inverse problem

In 3DGS you have no measured geometry: you have calibrated photographs $\{I_k, \pi_k\}$ and a sparse initialization cloud. The question flips around: find the scene $\Theta$ such that, rendered from the known cameras, it reproduces the photos:

$$\Theta^* = \arg\min_\Theta \sum_k \mathcal{L}\big(R(\Theta;\pi_k),\, I_k\big), \qquad \mathcal{L} = (1-\lambda)\,\mathcal{L}_1 + \lambda\,(1-\text{D-SSIM})$$

The parameters $\Theta = \{\mu_i, s_i, q_i, \alpha_i, c_i\}_{i=1..N}$ are latent: from the photos you cannot measure the covariance of a gaussian or its SH coefficients. You can only estimate them — and since the renderer (lesson 03) is differentiable, the estimation is gradient descent: the gradient of the photometric loss flows back from the pixels to every single gaussian. For the position, for instance:

$$\frac{\partial \mathcal L}{\partial \mu_i} = \sum_{p}\; \underbrace{\frac{\partial \mathcal L}{\partial R(p)}}_{\text{residual}} \cdot \underbrace{c_i\, w_i(p)\,\frac{(p-\mu_i')}{\sigma_i^2}}_{\text{from the gaussian kernel}}$$

This is statistical fitting in the fullest sense: the photos are the training data, generalization is novel view synthesis (measured on held-out views), overfitting is the floaters — gaussians that explain the training views perfectly from wrong 3D positions.

Try it: 2D Gaussian Splatting in the browser

Below is a real optimizer, written in JavaScript: a 96×96 target image is reconstructed from isotropic 2D gaussians whose parameters $(\mu_i, \sigma_i, c_i)$ start out almost at random and are updated with Adam on the analytic gradients of the MSE. Press train and watch it converge; press densify when the loss stalls.

target $T$

render $R(\Theta)$

loss (log scale)

show the gaussians (circles at 1σ)

Fig. 1 — interactive. Additive rendering for simplicity ($R(p)=\sum_i c_i\,e^{-\|p-\mu_i\|^2/2\sigma_i^2}$) instead of the transmittance compositing of lesson 03 — the gradient mechanics are identical. Notice how the gaussians migrate, widen and shrink to explain the image: none of them was "measured", they are all estimated.

Densification = architecture search

The densify button replicates the adaptive density control of 3DGS: every ~100 iterations the optimizer changes the number of parameters — it clones or splits the gaussians where the accumulated positional gradient is high (the residual is "asking" for capacity there), and prunes the transparent ones. In representation-learning terms it is greedy model selection interleaved with the optimization, a close relative of matching pursuit: add atoms to the dictionary where the residual is large. (The elegant reformulation is 3DGS-MCMC: densification and pruning as sampling moves.)

The difference, as a table

	Traditional splatting (EWA, ~2001)	3DGS (2023)
Nature	rendering technique (resampling)	learned representation (inverse problem)
Input	measured geometry (scanner + normals)	calibrated photos + sparse init
Parameters	derived from the data (density, normals)	latent, optimized end-to-end
Color	fixed per point	view-dependent SH, learned
# primitives	= number of samples	dynamic (densify/prune)
Quality measured by	anti-aliasing, continuity	photometric loss on held-out views
Failure modes	holes, blur	floaters (overfitting), speculars
In our repo	`splat` producer (stub)	`gsplat` producer (testing on the GPU box)

Prediction for our test — the model will be strong inside the envelope of the training viewpoints (along the recorded trajectory: our use case!) and will degrade outside it: with 300 poses over ~0.9 m of path, novel views far from the path will show floaters. That is not a bug: it is small-dataset variance. This is also why the init from our point cloud matters: the landscape is non-convex and the gradient is local — starting near the right basin is half the job.