🗒️ 3d-research index lesson 04 EN · IT

Lesson 04

Training: the scene is the model

Why 3DGS "trains": analysis-by-synthesis, gradients through the renderer, densification.

The inverse problem

In 3DGS you have no measured geometry: you have calibrated photographs \(\{I_k, \pi_k\}\) and a sparse initialization cloud. The question flips around: find the scene \(\Theta\) such that, rendered from the known cameras, it reproduces the photos:

$$\Theta^* = \arg\min_\Theta \sum_k \mathcal{L}\big(R(\Theta;\pi_k),\, I_k\big), \qquad \mathcal{L} = (1-\lambda)\,\mathcal{L}_1 + \lambda\,(1-\text{D-SSIM})$$

The parameters \(\Theta = \{\mu_i, s_i, q_i, \alpha_i, c_i\}_{i=1..N}\) are latent: from the photos you cannot measure the covariance of a gaussian or its SH coefficients. You can only estimate them — and since the renderer (lesson 03) is differentiable, the estimation is gradient descent: the gradient of the photometric loss flows back from the pixels to every single gaussian. For the position, for instance:

$$\frac{\partial \mathcal L}{\partial \mu_i} = \sum_{p}\; \underbrace{\frac{\partial \mathcal L}{\partial R(p)}}_{\text{residual}} \cdot \underbrace{c_i\, w_i(p)\,\frac{(p-\mu_i')}{\sigma_i^2}}_{\text{from the gaussian kernel}}$$

This is statistical fitting in the fullest sense: the photos are the training data, generalization is novel view synthesis (measured on held-out views), overfitting is the floaters — gaussians that explain the training views perfectly from wrong 3D positions.

Try it: 2D Gaussian Splatting in the browser

Below is a real optimizer, written in JavaScript: a 96×96 target image is reconstructed from isotropic 2D gaussians whose parameters \((\mu_i, \sigma_i, c_i)\) start out almost at random and are updated with Adam on the analytic gradients of the MSE. Press train and watch it converge; press densify when the loss stalls.

target \(T\)
render \(R(\Theta)\)
loss (log scale)
Fig. 1 — interactive. Additive rendering for simplicity (\(R(p)=\sum_i c_i\,e^{-\|p-\mu_i\|^2/2\sigma_i^2}\)) instead of the transmittance compositing of lesson 03 — the gradient mechanics are identical. Notice how the gaussians migrate, widen and shrink to explain the image: none of them was "measured", they are all estimated.

Densification = architecture search

The densify button replicates the adaptive density control of 3DGS: every ~100 iterations the optimizer changes the number of parameters — it clones or splits the gaussians where the accumulated positional gradient is high (the residual is "asking" for capacity there), and prunes the transparent ones. In representation-learning terms it is greedy model selection interleaved with the optimization, a close relative of matching pursuit: add atoms to the dictionary where the residual is large. (The elegant reformulation is 3DGS-MCMC: densification and pruning as sampling moves.)

The difference, as a table

Traditional splatting (EWA, ~2001)3DGS (2023)
Naturerendering technique (resampling)learned representation (inverse problem)
Inputmeasured geometry (scanner + normals)calibrated photos + sparse init
Parametersderived from the data (density, normals)latent, optimized end-to-end
Colorfixed per pointview-dependent SH, learned
# primitives= number of samplesdynamic (densify/prune)
Quality measured byanti-aliasing, continuityphotometric loss on held-out views
Failure modesholes, blurfloaters (overfitting), speculars
In our reposplat producer (stub)gsplat producer (testing on the GPU box)
Prediction for our test — the model will be strong inside the envelope of the training viewpoints (along the recorded trajectory: our use case!) and will degrade outside it: with 300 poses over ~0.9 m of path, novel views far from the path will show floaters. That is not a bug: it is small-dataset variance. This is also why the init from our point cloud matters: the landscape is non-convex and the gradient is local — starting near the right basin is half the job.