This guide isn’t complete yet I will keep updating it adding new information about the other filters/techniques and maybe expand on what I’ve already done but for now this should be enough to get you up to understanding the problem and the Kalman filter in the linear all gaussian case as well as ensemble Kalman filter.

I will also add more on the linear Kalman filter soon by deriving a second way (by leading with the assumption we have a linear update and minimising variance) this will make the Kalman filter more natural as it might seem like more of an algebraic coincidence rather than the natural way to go about things. This proof doesn’t use Bayes’ theorem but still arrives at the exact same formulae.

The Problem We are Trying to Solve

We are trying to model a dynamic system - a system that evolves over time. This could be anything really, such as the weather, some position and velocity of an object, a population over time, volatility of a stock. Just anything that changes over time. We are looking specifically at dynamic systems that are governed by/can be modelled with differential equations. We might look at non differential equation systems that use a state transition matrix

The problem is that we don’t know the initial state of the system exactly. But we can make observations/measurements however our observations aren’t exact, they have error/noise. (This could be due to a sensor only having finite precision.) So we can’t completely trust our measurements and if we are slightly off this error can compound over time as we make predictions and could become wildly off from the true value.

Another problem that we have to deal with is that our model may not be perfect. Even if we did know the exact state of the system at some point our model might not be able to perfectly predict the true value and again this error may compound over time. This could be due to the fact there are external conditions we can’t account for or that the system is too complicated to model every single part so we simplify a bit. E.g. we can’t model every single particle in a system. It is too complicated to deal with, we don’t have enough computation power and we can’t make measurements of the whole system.

The Solution / Core Loop of Bayesian Data Assimilation

Instead of the model making predictions of the single value of the state. We model the state as a random variable and keep track of the probability distribution. That is instead of predicting that the current state of the system is a specific value we talk about what the probability that the true state value falls within some region is. And we take the mean of this distribution to be our guess for the true state. This also allows us to model how confident we are in the guess.

The two core functions of any model is

  1. How we take our current distribution estimate and get a new distribution of the state at some later time.
  2. When we receive an observation of the state (may not be of every variable we are trying to model), How do we update a distribution. To what extent do we prefer either the noisy observation or our noisy prediction more? How do we combine these two to get a more accurate prediction of the state?

Prerequisites

Brief Note on Vectors, Matrices and Linear Algebra

Vectors

The system that we are trying to model will have more than 1 variable. Instead of treating them as separate variables we group together as a vector. However a vector is not “something with direction and magnitude”. This only applies to certain vector spaces that have geometric meaning but vectors are not inherently geometric objects. In the case of our project the vectors we are dealing with are just the most natural mathematical way to group things and they do not have any geometric meaning/properties unless specified.

The definition of a vector is much more abstract but really it boils down to if something can be added together and multiplied by a scalar (vector times vector multiplication is not required). Many different mathematical objects such as functions can be vectors. A vector space refers to a whole set of vectors but don’t worry too much about the details of this. For the purpose of the project a vector we will only be talking about vectors in so they can be thought of as simply a list of real numbers. In code this will simply be a 1D array of floats (more specifically an np.array) And the dimension of the vector is just the length of the list. The set of dimensional vectors is denoted as . For a vector with 5 variables we can say that .

Matrices

Matrices are a two dimensional grid of numbers (The actual definition is a bit more abstract than this but this definition is enough for the project). Generally this grid is a square but can be non-square although those only come up very briefly in the project with the observation map. We can have an matrix ( rows, columns) and we denote the set of these matrices as . However . It is also not an actual multiplication in the “exponent”. . Although they are isomorphic. However it is often better to think of a matrix as a list of vectors. An matrix can either be interpreted as different dimensional row vectors or different dimensional column vectors. In code this will be a 2D array of floats again more specifically an np.array.

In this project matrices are used for three purposes

  1. As a function on vectors - focus on column vectors.
  2. A way to represent systems of equations - focus on row vectors.
  3. The covariance matrix - just a grid of numbers. It’s matrix properties and the properties of it’s row and column vectors aren’t really relevant.

For the project knowing much about matrices as functions or systems of equations isn’t too vital but it is helpful in some cases. The most common occurrence of matrices is the covariance. Matrices as functions on vectors comes up quite a bit as the state transition matrix but depending on the system we are modelling it is not always possible. Representing systems of equations with matrices is only applicable with a very specific type of equation system and for most cases won’t work.

”Linear”

The word linear will come up a few times in this project. This does not refer to a linear polynomial . It refers to a linear function which is a function ( are vector spaces) that satisfies

  1. This definition isn’t really important all you really need to know is that linear refers to a “nice” type of function that can be represented as a matrix. If a linear function’s domain and codomain are the same vector space. Such as then is a linear transform

Annoyingly isn’t actually linear with this definition it’s affine. But don’t really worry about this.

Matrices as functions

I won’t go into too much detail as it’s not really necessary but matrices are linear transforms on vectors. Because of the properties of linear functions they can be determined, by where the basis vectors end up after the transformation. For these are the unit vectors A matrices column vectors are where each of the basis vectors end up after the transform. Matrix-Vector multiplication is how these functions are applied. E.g. then is the result of applying the linear transform on the vector . And matrix multiplication is function composition so if then is a new matrix and is the composition of functions of . Like regular functional composition it is read right to left. We apply first then .

Matrices as Systems of Equations

The linear system of equations:

Is equivalent to this matrix equation

Notice how we have just taken each row of coefficients and put the as a row vector. If we have a linear system of equations we can solve it by inverting the matrix. to get . inverting a matrix by hand is complicated but you don’t need to worry about it in code it can be done with np.linalg.solve(A, np.eye(len(A)), np.linalg.inv(A), or np.linalg.pinv(A). Solve has the best performance and stability so use that unless you have reason to use something different.

Differential Equations

There are different types: Ordinary differential equations (ODEs), Partial (PDEs) and stochastic (SDEs). We will mainly be looking at ODEs for now. An ODE is an equation consisting of derivatives (and higher order derivatives e.g. second and third derivative - however I haven’t dealt with any of these in the context of Bayesian data assimilation yet) If we have an ODE in multiple variables such as the Lorenz equations

We can group the variables into a vector and write the equation as

where is a vector function e.g.

And if the set of equations is linear then we can write the function as a matrix and get .

For the project we won’t really be solving ODEs analytically (exactly) we will be using approximations using methods like Euler’s method or RK4.

Continuous Random Variables

I’ve copied all of the following from my previous notes. You don’t need to learn all of this in detail you just need to understand at a high level what a probability distribution is and what variance/covariance is for both single and multiple variables. We don’t really care about actual probabilities (using integrals) we just care about the distribution.

Definition

A continuous random variable is a random variable which can take any real value in an interval. For continuous random variables . Instead probabilities are defined over an interval e.g. for we consider .

Probability Density Function

In order to do this we have a probability density function which satisfies

  1. *as total probability sums to

Expected Value

The Expected Value denoted of is the mean value of . It is defined/found as

This comes from a generalisation of the discrete case .

Variance

Variance is defined as

Since is the mean, is the deviation of from the mean. So variance is the mean squared distance of from the mean. It is a measure of Spread of the variable/data.

It has the properties that:

  1. .
  2. Weight larger deviations more than small ones. Hence the squaring.

We then define standard deviation as

To go from squared distance to distance to get spread in the original units. However it is NOT quite the average distance. But we use it over as the standard deviation has some nicer properties.

Dealing with Multiple Random Variables

When we have multiple random variables we consider it as a vector in . E.g. . We can consider how individual variables in are distributed by just consider as a single continuous random variable. But also many concepts generalise nicely to variables at once.

Joint Probability Density Function

Again we don’t really care about actual probabilities or the integral that much but for the sake understanding of how distributions work over a vector/multiple variables and completion: A Joint Probability Density Function is

Where

This means to integrate over a region/volume. Basically it takes the case of an integral finding an area under a curve over a range and generalises it to finding the -dimensional volume over a region. Thinking of it too geometrically isn’t really helpful here though. It is just a generalisation of the idea that we can’t assign individual probabilities for a continuous variable and we have to talk the probability that the variable lies within a given regions. Here a region is just the generalisation of a range into variables.

Bayes Theorem

Terminology

  1. is called the Prior. This is what our current belief/hypothesis is about is before we see the data/evidence. It is our starting point.
  2. is called the Likelihood. It is the probability of observing the evidence if our hypothesis was true. Effectively it measures how well our hypothesis/current beliefs are compatible with the evidence we just observed.
  3. is called the Evidence or Marginal Likelihood. This is the probability of observing across all possible hypotheses. It serves as a normalisation constant.
  4. is called the Posterior. This is our updated belief in our hypothesis after accounting for the new evidence .

Note Often Bayes’ Theorem is written as and we leave out the normalisation factor (the evidence). As it is often hard to compute as well as the fact that we don’t have to compute it if we have conjugacy that is the posterior PDF is in the same family as the prior PDF but just with new parameter values.

This theorem is the case for having one specific event but it can be easily generalised for a full probability distribution by replacing all the probability functions with a distribution . (Continuous distributions are always denoted by a lowercase ).

Bayes theorem allows us to take our current belief about the state and when we make a new observation update our distribution to be more accurate according to the observation we just made.

Normal Distribution

The normal distribution is the most common distribution in the project. Don’t worry about knowing the actual function for it. All you need to know is that it is symmetric at the mean and the probability density exponentially decays as you move further from the mean. The (co)variance controls how fast this decays. It works with multiple variables using vectors and a covariance matrix. It is a very simple distribution that comes up all the time in the real world. A key property that makes it useful in this context is that applying a linear (or affine) transform to a normal distribution you get another normal distribution. It is denoted as where is the mean and is the covariance matrix. If we are only dealing with one variable then the variance is the standard deviation squared so we write If we want the actual value (probability density) at a specific point we can write basically is the value we are plugging into the function are the parameters and the semi-colon just serves to separate them making it clear which is which. The normal distribution is often called a Gaussian. These are the exact same things.


Setup and Notation

Model Equation

The state vector will be denoted and is the state vector at time . Remember this is a random variable not a single value. The amount of variables/the state dimension is .

We will have a model that takes in the previous state and predicts the next state

are all random variables here. is the function that predicts the next state given the current one. This also makes a key assumption called the a Markov Assumption which is . This means that the next state is only dependant on the previous state and nothing else.

The term is a zero-mean gaussian. It doesn’t affect the mean of it serves as a term to spread out the distribution of as the model may not be perfect so after each model step we may want to increase the variance on our guess to represent growing uncertainty. If you want to set model uncertainty to zero you should instead make it really small e.g. Q = 1e-7 * np.eye(NUM_VARIABLES) * TIME_STEP as making it zero can cause errors

When modelling ODEs with discrete steps, is timestep dependant so in code you should have Q=... * TIME_STEP

You can also think of this equation as telling us that for a fixed , is modelled by a normal with mean and the variance which is the variance we get each model step due to model limitations. This can also be written as . This just means that the probability distribution of given a fixed is modelled by this normal.

Just for extra detail: If we explicitly want to get our new distribution we use the Chapman-Kolmogorov equation (Only valid under Markov Assumption)

This is a generalisation of the discrete case Basically, across all possible , sum weighted by the probability of actually occurring. Replacing and replacing the discrete sum with an integral gets us the exact same form as above.

We almost never actually use this integral directly. We either use it implicitly (like with linear Kalman) or approximate it directly or indirectly with Monte-Carlo sampling in Ensemble Kalman Filter.

Observations

Don’t worry about this too much but: When we make observations about the state we write it as

Similar intuition with what the equation actually represents than with the model equation. However even though is a random variable we only get a single value for it when we observe it.

The just represents the noise in our measurements. is the observation map and it tells us given a true state what vector we actually measure. It can be non-linear but for now we will assume that we can either observe a variable directly or not observe at all. (In more complex cases you can only observe a non-linear transformed version of the variable). So is a linear function and is therefore a matrix (although generally non-square. (We now denote it with a regular as it is a matrix). All you need to know about it, is that when we can’t measure all variables we and we want to perform an update step we need to map our state into the observation space. Basically we apply and now we have a vector that is the same format/of the same variables as the measurement and now we can perform calculations between our observation and current state guess.


Kalman Filter

The Kalman Filter is a subset of Bayesian Data Assimilation but it covers both propagating our distribution through the model and the update step with Bayes’. It is enough on it’s own to work for the project. We can also look at particle filters, which replace Kalman Filters. Or 4D variational methods which can be built on top of Kalman Filters (although I don’t really know anything about this).

There are many variations on the Kalman filter: Extended, unscented, ensemble etc… They achieve the exact same thing as a regular Kalman filter but they deal with more complex cases and different assumptions e.g. non-linear models and/or non-gaussian distributions. They do this in varying different ways.

Linear Kalman Filter

This is the original Kalman Filter. It assumes that every distribution is gaussian and the model is linear.

Propagation

For our model equation we have

The nice thing about every distribution being gaussian is that if a random variable is modelled by a Gaussian then any linear (or affine) transformation of that variable will also be modelled by a Gaussian. So if . is used to denote our guess covariance. We can apply a simple formula to get . Then we have that . If we have a random variable being the sum of two independent normally distributed variables then the resultant distribution is a simple gaussian where you add the mean and covariance of each so we have . So

Update

When we receive an observation. we want to update our distribution with Bayes’ theorem in the following form

Here the colon in the subscript denotes a list e.g. . Identifying the 4 key components

  1. Posterior - . This is basically updating our distribution on the state to include the information gained by out measurements.
  2. Likelihood - . This is the probability of our measurement if was the real state. Comes directly from the observation model/noise. . Therefore
  3. Prior - . This is the probability distribution we had before seeing the current measurement. It was the result of propagating the posterior at through our model as we saw earlier. Our current belief modelled by
  4. Evidence - . The probability of observing under the model. In practice we often don’t compute/care about this.

Substituting in our distributions and ignoring the normalising evidence constant we get

The following is lots of algebra, you don’t really need to know this but we have an exact formula for the above expression so we can easily find our posterior but for completion here is some proof: Also note here we are actually multiplying the probability density functions here this is NOT the same as finding the product of two normally distributed variables. The product of two normals is proportional to a normal so using this fact we don’t have to compute this directly and we can use the formula where

However our two normals are currently in different variables ( and ). But we can get them in the same variable by, instead of viewing as a function in for a fixed , we view it as a function in for fixed . Then with some complicated algebra we get that so now we have:

Putting this together we get that and . So we have a closed formula for the new mean and covariance once we receive an observation. Here we can do some more algebra and get the following

The innovation vector is effectively the error in our prediction. . Since it is the subtraction of two normally distributed variables it is modelled by a normal itself that is . However we only receive on observation so we only have a single value for it not the whole distribution. We can see is the sum of two terms this is our prior uncertainty (projected into measurement space) meaning we have uncertainty on what the error is as we are uncertain in our prediction. We also have the term which reflects our observation uncertainty. Then we have Kalman gain matrix which is essentially a measure of how much we trust the model vs how much we trust the measurement. This is calculated based on the prior uncertainty covariance and the measurement covariance uncertainty. And we can update our mean and covariance accordingly. Note that this form with using the Kalman gain is algebraically equivalent to how we derived it from Bayes’. In this specific case you could get away with using the “intermediate step” of the Kalman gain and use the gaussian multiplication formula we found above. However the idea of Kalman gain is used in the other variations of the Kalman filters and it also give nice intuition for how it works.

This exact formula can also be derived by first assuming that we have a linear update e.g. then we find optimal by minimising the posterior covariance.

Ensemble Kalman Filter

The ensemble Kalman filter is one way to use the Kalman filter we’ve just derived but with different assumptions. The ensemble filter works with non-linear models and since non-linearity do not preserve gaussians our distributions will have to be non-gaussian. However we will still model observation noise and model uncertainty as gaussians it’s just that our actual state estimate distribution can’t be gaussian.

Propagate

To get around this instead of transforming a whole distribution. We will take random samples (called particles) from our distribution and now that we have an array consisting of single fixed values (particles) of possible state vectors we will apply our state transition to each of them individually to get a new set of particles.

But since we only have particles we will can only approximate the features of our distribution. Instead of having one mean that we can nicely propagate through the model like in linear Kalman we have an array of particles and we take their mean which is simply a discrete arithmetic mean.

Update

For the update step we will use the linear Kalman filter like before by treating our distribution as a gaussian. To do this we simply compute the mean and covariance of the particles and then treat our distribution as though it was modelled by a gaussian with that mean and covariance. To actually perform this update we compute the Kalman gain with the exact same formula but using the covariance of the particle distribution as our prior covariance. Then for each particle we have

Here although we have only one measurement we form a whole array of perturbed measurements with . Basically for each particle we add a little bit of noise to the observation we received using the distribution in which we believe observation noise follows. In this case . The reason we inject this noise is that we would be updating the particles to have the correct mean but we need to have some method for the particle distribution’s covariance to increase due to observation covariance so we add random noise to reflect this. Without this there is no way for the ensemble to “see” the randomness in the measurements.