This article describes some work that, while interesting, didn’t really break new scientific ground and thus wasn’t published in usual venues. This is best considered a case study in how to apply probabilistic inference to a visually interesting problem. The associated code and notebook were originally created and open sourced in 2024, when I was in Google Research.
We start with a video of a static (without moving or deforming objects) scene recorded by a possibly moving color (as opposed to depth) camera. We’d like to determine the 3D geometry of the scene, as well as the poses (positions and orientations) of the camera in each of the video frames. Structure from Motion (SfM) is a technique for estimating those quantities. This is a common pre-processing step before using a more flexible technique like Neural Radiance Fields (NeRFs) or Gaussian Splats to reconstruct a photorealistic double of the scene captured in the video. These techniques let you use the reconstructed geometry to generate novel views of the scene for visualization purposes, or import it into your favorite 3D modeling package or a video game (after a bit of manual post-processing).
The task of reconstructing a 3D scene geometry in the presence of sensor noise (e.g. in low light conditions, poor sensor calibration, sensor malfunction etc.) is ill-posed. Even with perfect sensors, certain 3D scenes produce highly ambiguous videos when filmed by color cameras. In certain situations, parts of a scene are visible only in a few frames, which does not provide enough information to disambiguate the possible interpretations of the scene geometry.
A standard solution is to discard parts of the scene that are seen in only a few of the frames and then use robust regression techniques (such as RANSAC ) to estimate the remaining quantities (such as camera poses). Can we do better?
In this article we’ll use Bayesian modeling and approximate Markov Chain Monte Carlo (MCMC) Bayesian inference to quantify the uncertainty about the scene geometry and camera poses. This means is that we’ll estimate the scene geometry and camera poses that are consistent with the data at hand (the video), and how probable they are.
This article is accompanied by a Jupyter notebook, which implements this case study in Python. As a special feature, it shows how to use interactive widgets to implement an interactive version of MCMC which lets you vary the model and inference hyperparameters in close-to-real time.
Modeling an observed phenomenon requires specifying the forward model, which deterministically maps the systems internal (aka unobserved, hidden or commonly latent) state to some observable values. For example, in this article we will say that the scene geometry and camera poses are the latents and the observables are the color values which could be seen by a perfect (noiseless) camera. We’ll get into the details of what those are concretely in the next section, but for now let’s group all the latents into a vector \(Z\) and the observables into a vector \(\hat{X}\). Let’s define the forward model as the mapping \(f\) from \(Z\) to \(\hat{X}\): \(\hat{X} = f(Z)\). \(f\) could be a 3D renderer, such as one used in a video game or 3D modeling package.
To make things Bayesian, we need to place probability distributions over all the quantities in the model. First, let us describe the observation model: how the outputs of the model get measured by the sensors we have (the camera). We posit that the sensors we use are not perfect and are inherently noisy. We place some distribution over that noise: \(P(X ; \hat{X})\). \(X\) represents the actual, noisy, observations as reported by the real camera. It is convenient to combine the forward model with the noise distribution and define the observation distribution or the likelihood: \(P(X | Z) = P(X ; \hat{X} = f(Z))\). For example, a standard (though not necessarily good) choice of a likelihood would be an independent Gaussian noise distribution that adds a little bit of independent noise to each camera camera pixel, for each frame.
We also need to place a distribution over \(Z\), \(P(Z)\). While \(P(X | Z)\) in principle corresponds to a real property of the system (the noise introduced by the camera), \(P(Z)\) is specific to the observer (us) and represents the beliefs we have about the system state before we made any observations a priori. \(P(Z)\) is termed the prior distribution. For example, we might know that we’re going to be filming outside, on Earth, so we probably wouldn’t expect any geometry above us (where the sky is).
Given these two distributions, we can form two other related distributions. First, we have the joint distribution over \(X\) and \(Z\), \(P(X, Z) = P(X | Z) P(Z)\). It is a complete description of how we think the system works, and which sets of \(X\)’s and \(Z\)’s are more or less probable. Then there’s \(P(Z | X)\) which is defined as:
\[\begin{align*} P(Z | X) = \frac{P(X | Z) P(Z)}{P(X)}. \end{align*}\]
This distribution is called the posterior, and represents our beliefs about the system state after we observed the observations \(X\). The ability to coherently go from the prior \(P(Z)\) to the posterior \(P(Z | X)\) is the reason we went through all this effort of coming up with the prior and likelihood distributions. This belief updating is what originally attracted me to Bayesian formalism. The posterior is often a difficult quantity to evaluate, requiring many approximations. The process of obtaining the posterior is called Bayesian inference.
The posterior is of particular interest, and is the final object that this article will aim to characterize. The posterior will fully describe the uncertainty we have about the scene 3D geometry and camera poses given the observed video. By its nature as a probability distribution, it will assign different sets of states different probabilities. For example, if the set of probable camera poses for a particular frame under the posterior covers a lot of different possibilities, then we can say that we’re uncertain about that camera pose for that frame. We can go be even more specific about what aspects of the state we’re uncertain about, as will be shown in the future sections.
In this article is we will specify all the terms necessary to form the joint distribution and then perform Bayesian inference conditioned on the observations.
Let us start by specifying the model. We will specify the forward model, the observation model and the prior.
We are after a mechanistic description of how the video is generated given the sequence camera poses and the scene geometry. Lets specify the outputs first.
Typically, a video is a sequence of 2D arrays of RGB color triplets. This is a somewhat difficult representation to deal with, so for this article we will assume that the video has been pre-processed to extract keypoint tracks. A keypoint track is a sequence of 2D positions on the “screen” (the camera plane) of an identifiable 3D location in the scene as seen by the camera. These keypoint tracks can be obtained by using a keypoint detector such as SIFT and then stitching the detected keypoints into a track. Alternatively, there are modern neural-net based solutions that generate the tracks directly, like CoTracker. In this article, we’ll be using synthetic data, which means we can generate keypoint tracks directly from the known (to us, not the probabilistic model) 3D geometry and camera poses (see below for how we generate the synthetic data).
Let’s say we have num_keypoints
number of keypoint
tracks. As output, our forward model will return an array of floats
keypoint_screen_positions_mean
with shape
[num_frames, num_keypoints, 2]
, where the last dimensions
will be the normalized screen coordinates, ranging from -1 to 1. We call
them the mean
positions because they’re assumed to be
noiseless (and we’ll use a symmetric noise distribution later on).
For inputs, first we have the 3D positions of each keypoint,
keypoint_world_positions
. Since we assume a static scene,
this dense array will not have a num_frames
dimension, and
will simply be a float array with shape [num_keypoints, 3]
.
The last dimension are the XYZ “world” coordinates measured in, let’s
say, meters.
The last input will be the camera poses. For the camera translation,
we’ll simply use an array camera_positions
with shape
[num_frames, 3]
. For the camera rotation, it is convenient
to use quaternions, as they are compact and are easy to do
gradient-based probabilistic inference on. This will be an array
camera_quaternions
with shape
[num_frames, 4]
.
This completes the specification of the API of the forward model, it will look something like this (in Python):
def forward_model(
keypoint_world_positions,
camera_positions,
camera_quaternions,
):# Implementation here.
return keypoint_screen_positions_mean
We’ll implement the forward model by assuming a pinhole camera model, and then perform the standard camera transformation pipeline. This isn’t super interesting, and there are many articles elsewhere that describe this process. A quick summary of it is, for each keypoint and camera pose, we’ll perform the following transformation:
def project_keypoint(keypoint_world_position, camera_position, camera_quaternion):
= make_camera_world_matrix(camera_position, camera_quaternion)
camera_world_matrix = inv(camera_world_matrix) @ keypoint_world_position
keypoint_camera_position = keypoint_camera_position[:2] / keypoint_camera_position[2]
keypoint_screen_position_mean return keypoint_screen_position_mean
This function produces the screen coordinates of a single keypoint. Here is a graphical illustration of what is going on.
One interesting property of this projection transformation is that it is invariant to global scale and the position and orientation of any one of the cameras. The former is easy to see, because in the penultimate line in the snippet above you can multiply the first two coordinates and the last coordinates by the same number (a global rescaling) does not change the screen positions of the keypoints. These two invariants are somewhat problematic, as they impede the interpretability of the model. We’ll solve these two problems via a careful choice of the prior.
To apply this transformation to multiple waypoints we’ll be using
JAX’s vmap
program transformation to vectorize this
function:
def forward_model(
keypoint_world_positions,
camera_positions,
camera_quaternions,
):return jax.vmap(project_keypoint)(keypoint_world_positions, camera_positions, camera_quaternions)
For the observation model our goal is to produce a distribution over
the observations, the random variable we’ll call
keypoint_screen_positions
.
One complication with the keypoints is that due to occlusion and a
finite field-of-view of the camera, it can sometimes happen that a
keypoint is not visible at a particular frame. For simplicity, we won’t
be modeling this, and will simply assume that this is an input to the
model. This will be an array keypoint_visibility
of shape
[num_frames, num_keypoints]
, where the elements indicate
whether the corresponding keypoint is visible in the corresponding
frame.
For the actual distribution, we’ll use an isotropic Student’s T distribution centered at the mean keypoint screen positions. For simplicity, we’ll assume that each keypoint’s distribution is conditionally independent of the remaining keypoints, and also conditionally independent across all frames. Unlike the more “default” choice of a Gaussian, a Student’s T distribution has configurable heavy tails (via the degrees-of-freedom hyperparameter, which we’ll treat as a hyperparameter of the model), which make it more robust to extreme values (outliers). While we will be using synthetic data in this work which has no outliers, this is an important consideration for real data. Even in the synthetic setting, we benefit from having a more “relaxed” notion of error, as it’ll help the approximate inference converge a bit faster.
To implement this distribution we will use TensorFlow Probability’s JAX backend. This will let us express the conditional independence, as well as take into account of the keypoint visibility, which we implement via masking (a masked distribution will discard the log probability contributions of elements which are masked).
tfd.Independent(
tfd.Masked(
tfd.StudentT(
observation_noise_degrees,
keypoint_screen_position_mean,
observation_noise_scale,
),
keypoint_visibility[..., jnp.newaxis],
),3,
)
observation_noise_scale
controls the magnitude of the
observation noise. We can infer this as well, although for this article
we’ll be setting it to a fixed value.
From the previous two sections, we can see that our latents are:
keypoint_world_positions
, shape:
[num_frames, num_keypoints, 3]
camera_positions
, shape:
[num_frames, 3]
camera_quaternions
, shape:
[num_frames, 4]
observation_noise_scale
, shape: []
(scalar)Let us specify the priors for each one. What kind of prior should we use? I like to think there are two schools of Bayesians, the “Prior Bayesians” and the “Posterior Bayesians”.
To a Prior Bayesian, the prior is the primary object of interest. They want to generate samples from it, and to evaluate log densities of various latent configurations against it. The probabilistic inference, while an important application to a Prior Bayesian, is merely one application amongst many. It is an afterthought in many cases. The current programme of so-called “Generative AI” is firmly in the Prior Bayesian tradition. In Generative AI we generate samples from priors over images, videos, text and audio. The main challenge to a Prior Bayesian is how to specify the prior, and this is typically done via machine learning from large datasets.
To a Posterior Bayesian, the prior is a nuisance: they’re often reluctant to actually go through the effort of injecting prior information into it and instead use so-called uninformative priors. Posterior Bayesians are interested in the posterior. Probabilistic inference is the end-goal for them, thus they focus more on the likelihood and using a lot of data to condition their models on. In the context of 3D reconstruction, the way we fit NeRFs and Gaussian Splats is closely related to the methodologies of Posterior Bayesians. While probabilistic inference is rarely used in those domains, the amount of data use and the relatively low amount of prior information is very familiar to a Posterior Bayesian.
Both schools of thought have their place. In this article, we will proceed in the Posterior Bayesian tradition, because constructing informative priors is very difficult and we’ll have a lot of data to work with (the keypoint tracks over many frames). However, what we’ll find later on is that some keypoints are only seen from just a few camera positions and thus are poorly constrained by the observations. We’ll of course do our best to quantify our uncertainty about those points, but we’ll essentially give up on them since our priors contain almost no information. A Prior Bayesian will be able to go further, and rely on a considerable amount of information in their informative prior. A stark illustration of this is what happens if only a single frame is available. The model in this article will not do anything useful in that case. A model in the Prior Bayesian tradition, on the other hand, will actually be able to make a very decent guess at the scene geometry even from a single frame.
We’ll treat most keypoints in keypoint_world_positions
as having been sampled from a large-variance normal distribution
centered at the ground truth scene center. For a few keypoints (lets say
4 or so) we’ll cheat and outright tell the model where they are by
placing a very tight Gaussian centered at their ground truth positions.
We do this to simplify the interpretability of the inference output, but
in the real world this would be equivalent to, e.g., recognizing a
particular object via a machine learned object detector.
For most frames of camera_positions
and
camera_quaternions
we’ll also treat them as being sampled
from some uninformative prior (we won’t be assuming smooth camera
motion). For the first frame, however, we’ll fix the camera position and
orientation to the ground truth values.
By using these two very informative prior choices for the first camera and the 4 keypoint positions we solve the invariance issues mentioned in the Forward Model section. The model and inference would work fine without these two hacks, but it’ll be annoying to interpret the results since they would otherwise be arbitrarily shifted, scaled and rotated.
For the last variable, observation_noise_scale
, we’ll
just use a standard Inverse Gamma distribution.
Two of the random variables, camera_quaternions
and
observation_noise_scale
, have complicated supports, being
the unit four-dimensional sphere and the set of positive reals
respectively. The inference method we’ll use, Hamiltonian Monte Carlo,
cannot handle those supports by default, so we need to do something
about this.
For camera_quaternions
we’ll instead put a “donut”
shaped distribution in the unconstrained \(\mathbb{R}^4\) space, which places a lot of
density near the surface of the four dimensional unit sphere, but
relaxes the norm constraint a bit. We normalize these “raw” quaternions
before using them in the forward model. The specification of the donut
distribution is a little complex, but roughly speaking it is a product
distribution of the vector direction on the four sphere (a uniform one
in this case), and a distribution over the norm of the vector (a
suitably scaled Gamma distribution). Here is a plot of the density of a
2D version of the donut distribution.
For observation_noise_scale
, we’ll use a
change-of-variables identity and instead place a distribution on “raw”
observation noise scale on the real line, which we can then constrain to
be positive before using it in the observation model via the softplus
function.
To actually implement the model in Python we’ll use a small probabilistic programming language (PPL) of my design. The minimal job of a PPL is to let the user specify the probabilistic model (a joint distribution over some random variables), sample from it and evaluate the log density. My PPL is based on effect handling, where the model is specified via sampling statements, but non-standard execution lets you additionally do the other operations of interest. Here is a sample model:
def model():
= sample("x", tfd.Normal(0.0, 1.0))
x = sample("y", tfd.Normal(x, 1.0))
y return [x, y]
This model has two random variables (“x” and “y”), and represents a
random function which takes no inputs and returns two scalars. The
sample
calls represent effects, which we can handle via
“effect handlers”. An example of an effect handler is
Sample
, which causes the effects to return a random sample
from the associated distribution.
For convenience, I provide wrapper functions like
model_sample
and model_log_prob
(and plenty of
others) which apply a bunch of handlers and then execute the model. For
example, model_log_prob
is implemented as:
def model_log_prob(model_fn, value):
with apply_handlers(SetValues(value), LogProb()) as trace:
= model_fn()
res return trace["log_prob"], res
It uses the SetValues
handler to fix the random
variables to a particular value and the LogProb
handler to
evaluate and sum up the log densities. Note how we also return the
result of evaluating the model, which is convenient for returning
debugging and visualization information (e.g. we’ll use this mechanism
to return the keypoint_screen_positions_mean
as well as
other diagnostics and metrics). You don’t need a PPL to do probabilistic
modeling and inference, but using one can make the code more succinct,
and if it’s a simple PPL like this one, it can do so without bogging you
down in abstractions.
We can’t do probabilistic inference without conditioning the model on some data. For the purposes of this article, we will use synthetic data that we will generate ourselves. We’ll generate data by arranging some objects and creating a camera track. To generate the keypoint tracks we will some points from the object surfaces and then use the forward model. To compute keypoint visibility, we will perform simple raycasting from each point to the camera, only marking visible those keypoints that have an uninterrupted path to the camera. Here’s is the resultant video. The point colors are just for visualization.
Here’s a video where I pan around the scene. The red track are the ground truth camera poses. Note the 4 points that appear larger than others, these are the points whose ground truth positions are given to the model to set the overall scale as described above when we specified the prior.
We can use our PPL and the synthetic data to visualize the observation noise. This is an important check, as if we assume too much noise, then the likelihood will be too uninformative, and we will not be able to generate inferences anywhere close to the ground truth.
To do this check we will use both the SetValues
and
Sample
effect handles to generate the observations given
the ground truth latents. This will give us a sense of how precisely the
observations are perceived by the model. This is a type of prior
predictive check, where we sample from (a part of) the prior to
verify that our prior generates samples which are reasonable. We’ll wrap
this up this procedure in a utility function:
def model_cond_sample(model_fn, value, seed):
with apply_handlers(
lambda e: e.value())
SetValues(value), Sample(seed), Collect(as trace:
) = model_fn()
res
return trace["results"], res
We then generate 10 samples from the noise distribution while using
observation_noise_scale=0.01
, and make the following
plot:
The white crosses are the noise-free data, and the red crosses are the noised observations. The noise clearly has an effect, but information (relative and absolute positions of the points) is mostly preserved.
Unlike the joint distribution, for most models we cannot directly compute the normalized posterior density and sample from it. Even restricting ourselves to merely generating samples from it does not simplify our job much. As a result, we are forced to resort to approximate methods. There is a plethora of concrete algorithms we can use, but the gold standard is Markov Chain Monte Carlo (MCMC). MCMC is a stateful algorithm that uses a transition kernel \(T(s^{(t + 1)} | s^{(t)})\) which samples a successor state \(s^{(t + 1)}\) given the current state \(s^{(t)})\). It turns out that its often possible to construct a \(T\) such that \(p(s^{(t + 1)})\) as \(t \rightarrow \infty\) approaches a distribution of your choice, the so called target distribution. Many MCMC algorithms only need the target distribution to be computed up to a normalizing constant, which means that it’s perfectly suited to our problem. We set the state to be our latents and choose the target distribution to be the unnormalized posterior distribution.
The initial state for MCMC can be sampled from any distribution, but it will take some number of steps before the bias from choosing a bad distribution decays enough to be dominated by the true target distribution variance. This phase of the markov chain is referred to as the “warmup phase”. We typically discard the states from this phase, as they are manifestly not distributed as the target distribution. That said, in some sense the only interesting things in MCMC sampling happen during this phase, so you shouldn’t discard the states without first doing some plots and diagnostics. One thing the states are used for is to adapt MCMC algorithm hyperparameters. MCMC algorithms typically don’t converge to the right distribution if you adapt hyperparameters like that, so the standard thing to do is to turn off the adaptation towards the end of the warmup phase. How do we know when the warmup phase is done? There exist a number of diagnostics, such as potential scale reduction \(\hat{R}\), which can be used to practically assess convergence.
MCMC is typically used to compute posterior averages: we generate a bunch fo states \(s^{(t)}\) and then average them to estimate the posterior expectation of \(s\). For visualization purposes, we don’t want to wait for multiple steps, as that is not sufficiently responsive for realtime interactivity. Instead, we will run multiple instances of the MCMC sampler (multiple chains) and compute the averages at the same values of \(t\), but across chains. We’ll use 4 chains which isn’t much, but enough to give us an idea of the uncertainty of \(s\). Using multiple chains is a good idea for utilizing the parallel compute capabilities of the GPU anyway (here’s a nice review of running MCMC on modern hardware).
Since all our variables are continuous and the joint density is differentiable (we made sure of this when constructing the forward model, the prior and the likelihood) we can use Hamiltonian Monte Carlo (HMC) as the MCMC method. HMC is a method of choice for posterior inference in these conditions due to its ability to scale to very high dimensional problems. HMC in general has three main hyperparameters, the integration time (or length), the integration step size (just “step size”) and the mass matrix.
As a very rapid introduction to how HMC works, it uses the gradient information of the target distribution to generate a proposed state, that is then either accepted or rejected using the Metropolis Hastings acceptance criterion. The integration length (up to a point) governs how far away the proposed state is from the current one (the farther away it is, the more efficient the MCMC algorithm is). The step size governs the quality of the numerical integration that is used to generate the proposal, higher quality requires a smaller step size which in turn leads to the proposal more likely to be accepted. Unfortunately, overly small step sizes are computationally expensive since many small steps need to be taken to generate a proposal and target distribution’s gradients need to be evaluated once for each of those steps. The mass matrix is too complex to explain in this article, but it acts as a pre-conditioner for the HMC algorithm.
There is a large literature for choosing these hyperparameters, but for exposition and simplicity we’ll use a variant which makes the following choices:
Overall, the practitioner will only need to set the mean number of leapfrog steps, which is both instructive and fun. For implementation, we’ll use the battle tested implementation from FunMC, which is a JAX-based library which is helpful for adhoc implementation of novel MCMC algorithms.
The posterior of this model is somewhat challenging for HMC due to the various non-linearities in the forward model. As a result, the chains can get “stuck”, where all the proposals end up being rejected causing the step size to be adapted to nearly zero. Luckily, this seems to mostly happen before the chains have yet to converge, so we are relatively free to use heuristics to solve this issue. A simple heuristic is to use the resampling procedure from Sequential Monte Carlo, and resample the chain states in accordance to their posterior log density. This is very much unprincipled, but it has the effect that stuck chains get discarded and replaced by chains that got lucky. This heuristic relies on the likelihood being relatively strong and prior being relatively weak, so stuck chains end up having very low log likelihoods.
One way to reduce the prevalence of stuck chains is to change how we feed data into the algorithm. The issue is that the initial state \(s\) could be very far from states which are typical for the posterior, which can cause all proposals to be rejected due to extremely inaccurate gradients. We need a way to effectively initialize the state. What works well is to do posterior inference while adding more and more frames to the likelihood. The state that targets the posterior for the first N frames ends up being a decent initialization for the posterior for N + 1 frames. One small difficulty is dealing with keypoints that were not visible during the prior frames. Additionally, this is also the case for all the keypoints for the first frame. A simple solution used in SFM is to unproject the keypoints using the camera pose. For the first frame, we are given the first camera pose. For the other frames, we use the previous frame’s camera pose (this assumes that the camera doesn’t move much between frames). The trick of copying the previous camera pose is also how we’ll initialize the camera pose for inference.
Everything that we’ve described so far can be coded into an offline inference procedure (and in fact, we’ll do that towards the end of the article), but that would not be that interesting. Instead, we’ll use Jupyter, interactive widgets and three.js to implement an interactive inference GUI. The user will add new frames, set the HMC hyperparameters all while observing how the inference is proceeding. The system will visualize the uncertainty by placing “blobs” that represent the posterior mean location and uncertainty of the 3D positions of the keypoints. These blobs are shown by computing the mean and covariance of keypoint locations across the latest state of the MCMC chains. The implementation is not super difficult, the primary difficulty is making sure we can handle errors since the computation is happening in a background thread. Here is a screenshot of the interactive UI:
At the top we have a bunch of controls that alter the inference (including the “angry” buttons that stop the inference, resample chains etc).
In the middle we have the inference display, consisting of a 3D view and a few traces. The 3D view shows uncertainty blobs for the inferred keypoint positions, and multiple possible camera frustums for the camera poses. The size of the yellow dots indicate the reconstruction error of a selected frame.
Lastly, on the bottom we have some textual output. You can play around with this simulation in the notebook. Here’s a video of me running the inference using this UI. I am running this on my AMD Radeon RX 7900 XTX. Don’t forget to turn on subtitles for this one (I didn’t feel like narrating it).
If you end up running this yourself, here is a few things to try:
Examine the effect of the observation_noise_scale
on
the quality of the reconstruction.
Try varying the number of leapfrog steps for HMC, fewer leapfrog steps simulate more quickly, but is it a worth it in terms of convergence speed?
Does incremental inference (using auto-advance) appear to reach convergence faster than adding all the frames at once?
It’s straightforward to run the inference without interactivity and then do offline analysis of the plots. Here is inference ran for 1500 steps, 500 of which are treated a warmup. We don’t use incremental inference here (for simplicity), so we need to occasionally resample stuck chains. Here is a plot of the traces of the HMC chain states:
The left column shows all the states, while the right column only
shows the states from the last 100 steps. As you can see, we
automatically resampled at step 100 (seen by the sudden jump in
target_log_prob
), but otherwise had no stuck chains. The
red lines show the ground truth values for the latents. We see that in
general infererred states are close to the ground truth, but not in all
cases. This could be indicative of multiple things, such as the
inference being broken (not impossible if the posterior geometry is
gnarly enough) or the prior not corresponding to the synthetic data
(this is more likely).
We can also diagnose convergence using the \(\hat{R}\) diagnostic. It works on multiple chains of scalars, so what we can do is compute it for all the scalar latents and elements of vector latents. Without getting into extensions of \(\hat{R}\), a set of MCMC chains is considered converged if \(\hat{R} < 1.01\) or even more stringently \(\hat{R} < 1.001\). Let us see what we actually get:
These plots are histograms, grouped by label with log counts on the y axis. For camera positions and quaternions we are close to being converged. Raw quaternions are not (as a reminder, “raw” quaternions are not normalized, and have the donut prior), which is not ideal but not super important: it means that HMC struggles exploring the ficticious quaternion norm degree of freedom that we introduced. The keypoint world positions are a bit more concerning, since they are something we’re actually interested in. One thing we can do to try to debug this is see if this is dependent on how many frames the keypoint is seen in.
Indeed, there is a strong dependence. We can see that for keypoints which are only seen from a few frames, HMC struggles to explore the full posterior, but for the rest, the convergence is far better. Typically, when you’re faced with HMC convergence issues (absent numerics issues), the solution is to either increase the number of leapfrog steps or to improve the geometry of the posterior through preconditioning. The latter is generally more efficient, but it can be difficult to construct an appropriate preconditioner. A less typical solution is to improve the prior, which would be a more “prior Bayesian” approach.
This article described a probabilistic model and inference of a structure from motion 3D reconstruction task. It also presented an interactive UI for this, enabled by Jupyter. The model and inference are both relatively simple, but there are some issues with convergence and overall speed. Should you use this in practice? The advantage of the method described here is that there are strong theoretic guarantees about its performance and very robust and reliable diagnostic methods to detect deviation from theory. If this is not important to you, there are many methods (some based on deep learning) which can be significantly faster. That said, it is possible that with more work we could make the method in this work perhaps 10x faster. This isn’t good enough for realtime application, but perhaps it could be used to validate faster methods.
Original content Copyright SiegeLord's Abode 2006-2025. All rights reserved.