This guide isn’t complete yet I will keep updating it adding new information about the other filters/techniques and maybe expand on what I’ve already done but for now this should be enough to get you up to understanding the problem and the Kalman filter (KF) in the linear all gaussian case as well as ensemble Kalman filter (EnKF).


The Problem We are Trying to Solve

We are trying to model a dynamic system - a system that evolves over time - and make predictions about state of the system. This could be anything really, such as the weather, some position and velocity of an object, a population over time, volatility of a stock. Just anything that changes over time. We are looking specifically at dynamic systems that are governed by/can be modelled with differential equations. However all these techniques apply to dynamic systems that are modelled without an ODE.

The problem is that we don’t know the exact state of the system. We can make observations/measurements but these aren’t exact as they have error/noise. (This could be due to a sensor only having finite precision). So we can’t completely trust our observations, and if we are slightly off this error can compound over time and the predictions we make could become wildly off from the true value.

Another problem that we have to deal with is that our model may not be perfect and/or the system may be stochastic (random/non-deterministic). Even if we did know the exact state of the system at some point our model might not be able to perfectly predict the true value and again this error may compound over time. This could be due to the fact there are external conditions we can’t account for or that the system is too complicated to model every single part so we simplify a bit. E.g. we can’t model every single particle in a system. It is too complicated to deal with, we don’t have enough computation power and we can’t make measurements of the whole system.

The Solution / Core Loop of Bayesian Data Assimilation

Instead of the model making predictions of the single value of the state, We model the state as a random variable and keep track of the probability distribution. That is instead of predicting that the current state of the system is a specific value we talk about what the probability that the true state value falls within some region is. And we take the mean of this random variable to be our guess for the true state. This also allows us to model how confident we are in the guess.

The two core functions of any model is

  1. How we take our current distribution estimate and get a new distribution of the state at some later time. We will refer to this as the propagation step.
  2. When we receive an observation of the state (may not be of every variable we are trying to model), How do we update a distribution. To what extent do we prefer either the noisy observation or our noisy prediction more? How do we combine these two to get a more accurate prediction of the state?. We will refer to this as the update step

Prerequisites

Brief Note on Vectors, Matrices and Linear Algebra

Vectors

The system that we are trying to model will have more than 1 variable. Instead of treating them as separate variables we group together as a vector. A vector is not “something with direction and magnitude”. This only applies to certain vector spaces that have geometric meaning but vectors are not inherently geometric objects. In the case of our project the vectors we are dealing with are just the most natural mathematical way to group things and they do not have any geometric meaning/properties unless specified.

For the purpose of the project a vector we will only be talking about vectors in so they can be thought of as simply a list of real numbers. In code this will simply be a 1D array of floats (more specifically an np.array) And the dimension of the vector is just the length of the list. The set of dimensional vectors is denoted as . For a vector with 5 variables we can say that .

Matrices

Matrices are a two dimensional grid of numbers (The actual definition is a bit more abstract than this but this definition is enough for the project). Generally this grid is a square but can be non-square although those only come up very briefly in the project with the observation map. We can have an matrix ( rows, columns) and we denote the set of these matrices as . (Not an actual multiplication in the “exponent”) However it is often better to think of a matrix as a list of vectors. An matrix can either be interpreted as different dimensional row vectors or different dimensional column vectors. In code this will be a 2D array of floats again more specifically an np.array.

In this project matrices are used for three purposes

  1. As a function on vectors - focus on column vectors.
  2. A way to represent systems of equations - focus on row vectors.
  3. The covariance matrix - just a grid of numbers. It’s matrix properties and the properties of it’s row and column vectors aren’t really relevant.

For the project knowing much about matrices as functions or systems of equations isn’t too vital but it is helpful in some cases. The most common occurrence of matrices is the covariance. Matrices as functions on vectors comes up quite a bit as the state transition matrix but depending on the system we are modelling it is not always possible. Representing systems of equations with matrices is only applicable with a very specific type of equation system and for most cases won’t work.

”Linear”

The word linear will come up a few times in this project. This does not refer to a linear polynomial . It refers to a linear function which is a function ( are vector spaces) that satisfies

  1. This definition isn’t really too important all you really need to know is that linear refers to a “nice” type and will lead to simpler properties. A linear function that can be represented as a matrix. If a linear function’s domain and codomain are the same vector space. Such as then is a linear transform

Annoyingly isn’t actually linear with this definition it’s affine (linear map + translation). These terms are often used interchangeably as they have similar properties and many times if we say linear we also include affine.

Matrices as functions

I won’t go into too much detail as it’s not really necessary but matrices are linear transforms on vectors. Because of the properties of linear functions they can be determined, by where the basis vectors end up after the transformation. For these are the unit vectors A matrices column vectors are where each of the basis vectors end up after the transform. Matrix-Vector multiplication is how these functions are applied. E.g. then is the result of applying the linear transform on the vector . And matrix multiplication is function composition so if then is a new matrix and is the composition of functions of . Like regular functional composition it is read right to left. We apply first then .

Matrices as Systems of Equations

The linear system of equations:

Is equivalent to this matrix equation

Notice how we have just taken each row of coefficients and put the as a row vector. If we have a linear system of equations we can solve it by inverting the matrix. to get . In code you should try to not compute inverses directly and instead opt to use solve. If calculating or use np.linalg.solve(A, B) or likewise np.linalg.solve(A, v). If you are computing then use np.linalg.solve(B.T, A.T).T Although np.linalg.inv(A) does it exist you should never really use it, always prefer np.linalg.solve as it has better performance and stability.

Transpose

If we have a matrix then we can compute it’s and it is just swapping the rows and columns of a matrix. It turns an matrix into a matrix. More formally where is the element in the th row and th column. It also applies to row or column vectors by treating them as or matrices.

You will most commonly see it in matrix multiplications with forms this is the matrix equivalent of scaling by square e.g. . You will also see with a vector , the product this is a generalisation of in the scalar case. For an dimensional vector it produces an matrix.

Differential Equations

There are different types: Ordinary differential equations (ODEs), Partial (PDEs) and stochastic (SDEs). We will mainly be looking at ODEs for now. An ODE is an equation consisting of derivatives (and higher order derivatives e.g. second and third derivative - however I haven’t dealt with any of these in the context of Bayesian data assimilation yet) If we have an ODE in multiple variables such as the Lorenz equations

We can group the variables into a vector and write the equation as

where is a vector function e.g.

And if the set of equations is linear then we can write the function as a matrix and get .

For this project we will be discretising the ODE meaning instead of looking at the whole continuous solutions we care about the solution at discrete points of time. For the most part we won’t really be solving ODEs analytically (exactly) we will be using approximations using methods like Euler’s method or RK4, DOP853. The exception being linear ODEs where we will use matrix exponentials.

Continuous Random Variables

You don’t need to learn all of this in detail you just need to understand at a high level what a random variable and a probability distribution is and what variance/covariance is. We generally don’t really care about evaluating any actual probabilities using integrals we just care about the distribution/random variable as a whole.

Definition

A continuous random variable is a random variable which can take any real value in an interval. For continuous random variables . Instead probabilities are defined over an interval e.g. for we consider .

Probability Density Function

In order to do this we have a probability density function which satisfies

  1. as total probability sums to

Expected Value

The Expected Value denoted of is the mean value of . It is defined/found as

This comes from a generalisation of the discrete case .

Variance

Variance is defined as

Since is the mean, is the deviation of from the mean. So variance is the mean squared distance of from the mean. It is a measure of Spread of the variable/data.

It has the properties that:

  1. .
  2. Weight larger deviations more than small ones. Hence the squaring.

We then define standard deviation as

To go from squared distance to distance to get spread in the original units. However it is NOT quite the average distance. But we use it over as the standard deviation has some nicer properties.

Dealing with Multiple Random Variables

When we have multiple random variables we consider it as a vector in . E.g. . We can consider how individual variables in are distributed by just consider as a single continuous random variable. But also many concepts generalise nicely to variables at once.

Joint Probability Density Function

Again we don’t really care about actual probabilities or the integral that much but for the sake understanding of how distributions work over a vector/multiple variables and completion: A Joint Probability Density Function is

Where

This means to integrate over a region/volume. Basically it takes the case of an integral finding an area under a curve over a range and generalises it to finding the -dimensional volume over a region. Thinking of it too geometrically isn’t really helpful here though. It is just a generalisation of the idea that we can’t assign individual probabilities for a continuous variable and we have to talk the probability that the variable lies within a given regions. Here a region is just the generalisation of a range into variables.

Bayes Theorem

Terminology

  1. is called the Prior. This is what our current belief/hypothesis is about is before we see the data/evidence. It is our starting point.
  2. is called the Likelihood. It is the probability of observing the evidence if our hypothesis was true. Effectively it measures how well our hypothesis/current beliefs are compatible with the evidence we just observed.
  3. is called the Evidence or Marginal Likelihood. This is the probability of observing across all possible hypotheses. It serves as a normalisation constant.
  4. is called the Posterior. This is our updated belief in our hypothesis after accounting for the new evidence .

Note Often Bayes’ Theorem is written as and we leave out the normalisation factor (the evidence). As it is often hard to compute as well as the fact that we don’t have to compute it if we have conjugacy that is the posterior PDF is in the same family as the prior PDF but just with new parameter values.

This theorem is the case for having one specific event but it can be easily generalised for a full probability distribution by replacing all the probability functions with a distribution . (Continuous distributions are always denoted by a lowercase ).

Bayes theorem allows us to take our current belief and obtain a more accurate belief by incorporating new evidence.

Normal Distribution

The normal distribution is the most common distribution. It’s PDF is:

You don’t need to know the actual function - it will only be used in some proofs. just means

All you need to know is that it is symmetric at the mean and the probability density exponentially decays as you move further from the mean. The (co)variance controls how fast this decays. A key property that makes it useful in this context is that when applying a linear (or affine) transform to a normal distributed random variable, the resultant random variable will be normally distributed and the mean and the new covariance can be easily calculated exactly.

It is denoted as where is the mean and is the covariance matrix. If we are only dealing with one variable then the variance is the standard deviation squared so we write If we want the actual value (probability density) at a specific point we can write basically is the value we are plugging into the function are the parameters and the semi-colon just serves to separate them making it clear which is which. The normal distribution is often called a Gaussian. These are the exact same things.

Markov Assumption and Markov Chains

“A Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event”. This effectively mean if we have some random sequence of events such as moves in a game of chess the next state only depends on the current state meaning we don’t care about how we got to our current state e.g. what moves were actually played in order to predict the next state. Mathematically if we have a sequence of random variables that form a Markov chain then . Explicitly the Markov assumption we use is that and for our observations . This subscript just means the sequence of .

Time Homogeneous

We will be only be looking at time homogeneous systems. All this means is that the system has no dependence on absolute time only on state. There is no absolute frame of reference only relative.


Setup and Notation

General Setup

The current state at time will be denoted as . It is a random vector. Although physically the TRUE state can only have one single value we represent our uncertainty by always viewing the true state as a random variable. This is the key Bayesian way of thinking. More precisely with KF we keep track of the conditional distribution

If we are dealing with a system that does not evolve with time but just with discrete steps you may see instead. Likewise any subscript may be seen as instead.

Our estimate for the state will be denoted as . The hat ( ) on any variable indicates it is an estimator. It is the single output value for the model being the optimum estimate given our distribution. For the KF and it’s variants and most other cases it is defined as the mean of our distribution: . In more complex cases the optimum estimate isn’t necessarily the mean but for all our cases it is.

When talking about the state at time before and after an observation; is after the observation (the posterior) and indicates our estimate at time before the observation (the prior - conditioned on rather than )

You may in other sources see and instead of and respectively the subscript means at time given all observations up to and including at time . Likewise the subscript means at time given all observations up to and including at time .

Model Equation

Our model equation will take in some state and return the next state. It may also take in a control vector this is a deterministic input that we know to the model. An example of where we would have this is if we are tracking an autonomous vehicle/robot and we are giving it commands. (Including doesn’t complicate any maths it only effects the mean propagation).

Our model equation is

Again, here is a random variable not a single value. Where is our state transition model, is the control input model, is the control vector. is process noise how we model uncertainty growing due to model imperfections and/or the system being stochastic. is the process noise covariance at time however, often the process noise doesn’t change over time and is constant.

Since is zero mean it has no effect on the mean of and is unaffected by it. Keep in mind that out state transition, control model and process noise covariance are all allowed to vary over time however often in practice they stay the same and the subscript may be omitted.

For the regular KF our model needs to be linear (or affine - can happen you include the control vector). So we get model equation

The control model can be non-linear as we only require linearity in . However this only works if our control vector is not random, sometimes we can have a random control vector in this case we need to be linear. IN this case you may see it being a matrix (or ). It is always deterministic and does not depend on the state so no matter the state the same control input gives the same additive term on the state.

However there are some cases where can depend on the state e.g. we get but this won’t work with regular KF. In general I haven’t looked into having the control vector but it doesn’t really change anything else equation wise if you have it or not so we will mostly be ignoring it for now. But if you do want to include a deterministic control vector it is as simple as just adding the term to the mean propagation and everything else stays the same. If you want to the control vector to be non-deterministic/a random variable other things might change.

Observations

We can make observations about our system and we denote an observation at time as it has dimension with often . Before we actually make the observation it is a random variable as our observations aren’t exact. But we will only receive one value for the observation and it is our realised value and it is what the KF conditions on. Our observations won’t be in the exact same form as the state vector as we may not observe every variable in the state, we may observe only a transformed version/related version of the variable e.g. we are trying to predict velocity but can only observe speed. So we have the observation equation

Again to emphasise, we are treating as a random vector of the true state here. is the observation map it is a function that maps state space to observation space e.g. given the true state what observations do we actually make? It can be non-linear but for now we will assume that if we can observe a variable we can observe it directly meaning can be represented as a non-square matrix (or the square identity matrix if we observe every variable at once). Again can vary with time but the subscript is often omitted. We will be dealing with time varying as we may be observing different variables at different times. Measurements may not be made every step.

used so we can transform our estimate into the same form as the observation - what observations would our estimate give - so the terms become compatible and we can compute the innovation


Kalman Filter (KF)

The Kalman filter covers both propagation and update. It requires certain assumptions/properties that being the model is linear/affine and that all probability distributions our gaussian. It also requires the observation map to be linear. The gaussians allow us to just keep track of mean and covariance and the calculations on the distribution are all exact, the linear model is required to preserve all gaussian distributions as the propagation of a gaussian through a linear model will give us back a gaussian. So we have where is the covariance of our distribution. Like before we may want to differentiate between our covariance at time but before and after an update/observation. We will use to denote after and to denote before. Again in other sources you may see and .

Propagation

The propagation is fairly simple. We have our distribution on the state and we put it through the model equation to get

Again since was gaussian to the affine nature of our model must be. To calculate our estimate of this new distribution we have

The key part of the KF that made this so simple is that is linear so we could pull it out of the expectation. And (being the affine part) is deterministic so it’s expectation is itself. Also a key part is the noise being mean so the expectation makes it vanish.

To propagate the covariance is simple as we have the distribution so we just calculate and substituting in , and therefore . Performing the expectation/matrix algebra we get

Update

There are multiple ways to derive the following KF equations we will start with MMSE - minimum mean square error - as it is much more intuitive and motivates the why/how better. Then we will derive it again with Bayes’ theorem.

MMSE Derivation

When we receive an observation we have two differing estimates about the state one being our current estimate and the other being our realised measurement . We want to combine them in a way to get the optimum estimate about the true state.

Just for now, for ease and clarity, we will ignore the observation map and assume measures all the variables of We will do this by taking a weighted average . Here is our weight and is called the Kalman gain. is different each update step so you may see but the subscript is often omitted. Here this equation only covers the mean however we can update the whole random variable with

This is more commonly written as . The vector is called the innovation vector and is the difference between our estimate and the observation it is also generally denoted as . In this form we have . Effectively we can think of the innovation being the amount of new information and the gain being how sensitive we are to new information.

This innovation form is always used as when when we introduce the observation map and may not be in the same form/have the same dimension so taking an average of them won’t work directly so we prefer the innovation-gain form. The weighted average is only for intuition (but when they are algebraically equivalent)

Introducing back the observation map we get the equation

The goal of the KF is how do we find the optimum given our current knowledge? The answer is that we pick such that the resultant distribution of minimal (co)variance (in practice this means the minimal trace of the covariance).

In order to go from just the mean update equation we have right now to the whole distribution update we just replace the estimate with the random variable so we get

here we are treating as a random variable not just the realised value we receive. Then we want to find the covariance of this distribution

We want to find such that is minimal. This involves taking partial derivates with respect to the matrix but after all that we find that

Now that we’ve found we can update our mean with the linear update here for we use the single realised value that we received as our observation. We then update our covariance using and we are done with the update step.

Bayes’ Theorem Derivation

Although this derivation is definitely less intuitive, more algebra heavy and doesn’t really motivate the idea of Kalman gain well, we will have to use this method for non Kalman filter methods of BDA. It also doesn’t make the linear update assumption it just so happens that in the case of everything gaussian it is optimal.

When we receive an observation . we want to update our distribution with Bayes’ theorem in the following form

Identifying the 4 key components

  1. Posterior - . This is basically updating our distribution on the state to include the information gained by out measurements.
  2. Likelihood - . This is the probability of our measurement if was the real state. Comes directly from the observation model/noise. . Therefore
  3. Prior - . This is the probability distribution we had before seeing the current measurement. It was the result of propagating the posterior at through our model as we saw earlier. Our current belief modelled by
  4. Evidence - . The probability of observing under the model. In practice we often don’t compute/care about this.

Substituting in our distributions and ignoring the normalising evidence constant we get

Note here we are actually multiplying the probability density functions here this is NOT the same as finding the product of two normally distributed variables.

The product of two normals is proportional to a normal so using this fact we don’t have to compute this directly and we can use the formula where

However our two normals are currently in different variables ( and ). But we can get them in the same variable by, instead of viewing as a function in for a fixed , we view it as a function in for fixed . Then with some complicated algebra we get that so now we have:

Putting this together we get that and . So we have a closed formula for the new mean and covariance once we receive an observation. We can then do some rearranging (and using the Woodbury matrix identity) to factor out a matrix to get the exact same KF equations as before with the same gain .


Ensemble Kalman Filter (EnKF)

EnKF is one way to use the KF equations we’ve just derived but with different assumptions. EnKF works with non-linear models and since non-linearity do not preserve gaussians our distributions will have to be non-gaussian. We will still model observation noise () and process noise ( as gaussians it’s just that our actual state estimate distribution can’t be gaussian.

The set up and notation stays the same except now our model function is non-linear so we have

To manage dealing with arbitrary distributions we deal with an array of samples (particles). And instead of acting with exact maths on a whole distribution we use samples to get estimates of the distribution properties. This method of using random samples to approximate computations that would otherwise be very costly/complex is called Monte Carlo.

We take our initial belief which, for example, could be a gaussian and we take some random samples from it, forming an array of particles which is our distribution. And our update and propagate step will act on the array of particles. We can then get our estimate by taking the simple mean of the particles and our covariance with a sample mean (applying Bessel’s correction - only to covariance not mean here).

To denote the th particle in the array we will use .

Propagate

Propagating our distribution is very simple. We treat each particle individually and propagate it through the model so

Here is constant across all particles as we are only considering deterministic input. The is a single random sample from the normal and the means independent (and) identically distributed. Basically for each particle to add in process noise we add a independent random amount of noise sampled from the process noise distribution.

Update

For the update step we will use the linear Kalman filter like before by approximating our distribution with a gaussian. To do this we simply compute the mean and sample covariance of the particles and then treat our distribution as though it was modelled by a gaussian with that mean and covariance.

To actually perform this update we compute the Kalman gain like before using the sample covariance as . Then for each particle we have

And although we have only one measurement we form a whole array of perturbed measurements with . Basically for each particle we add a little bit of noise to the observation we received using the distribution in which we believe observation noise follows. In this case . The reason we inject this noise is that we would be updating the particles to have the correct mean but we need to have some method for the particle distribution’s covariance to increase due to observation covariance so we add random noise to reflect this. Without this there is no way for the ensemble to “see” the randomness in the measurements.


Extended Kalman Filter (EKF)

EKF is another way of dealing with a non-linear model. It is pretty simple. EKF works by preserving gaussians by instead of applying the non-linear model function applying a linear approximation of it.

It makes this linear approximation with the Jacobian this is a matrix of partial derivatives. It is the generalisation of the derivative with an function. It is linear function and can be represented as a matrix .

Propagate

We propagate our mean using the non-linear model function. Meaning . However in order to keep a normal we propagate the covariance with a linear approximation. We expand using the Taylor series around the mean our function

Here is precisely the Jacobian of at so call it . So we have and putting this in our model equation we have

.

We want to calculate the new covariance of this so we get

Note this is the exact same form as KF but replaced with the Jacobian .

Update

Since we’ve forced our distribution to be gaussian via linear approximations we can use the exact same update step as the KF with no modifications.