This guide isn’t complete yet I will keep updating it adding new information about the other filters/techniques and maybe expand on what I’ve already done but for now this should be enough to get you up to understanding the problem and the Kalman filter (KF) in the linear all gaussian case as well as ensemble Kalman filter (EnKF).
The Problem We are Trying to Solve
We are trying to model a dynamic system - a system that evolves over time - and make predictions about state of the system. This could be anything really, such as the weather, some position and velocity of an object, a population over time, volatility of a stock. Just anything that changes over time. We are looking specifically at dynamic systems that are governed by/can be modelled with differential equations. However all these techniques apply to dynamic systems that are modelled without an ODE.
The problem is that we don’t know the exact state of the system. We can make observations/measurements but these aren’t exact as they have error/noise. (This could be due to a sensor only having finite precision). So we can’t completely trust our observations, and if we are slightly off this error can compound over time and the predictions we make could become wildly off from the true value.
Another problem that we have to deal with is that our model may not be perfect and/or the system may be stochastic (random/non-deterministic). Even if we did know the exact state of the system at some point our model might not be able to perfectly predict the true value and again this error may compound over time. This could be due to the fact there are external conditions we can’t account for or that the system is too complicated to model every single part so we simplify a bit. E.g. we can’t model every single particle in a system. It is too complicated to deal with, we don’t have enough computation power and we can’t make measurements of the whole system.
The Solution / Core Loop of Bayesian Data Assimilation
Instead of the model making predictions of the single value of the state, We model the state as a random variable and keep track of the probability distribution. That is instead of predicting that the current state of the system is a specific value we talk about what the probability that the true state value falls within some region is. And we take the mean of this random variable to be our guess for the true state. This also allows us to model how confident we are in the guess.
The two core functions of any model is
- How we take our current distribution estimate and get a new distribution of the state at some later time. We will refer to this as the propagation step.
- When we receive an observation of the state (may not be of every variable we are trying to model), How do we update a distribution. To what extent do we prefer either the noisy observation or our noisy prediction more? How do we combine these two to get a more accurate prediction of the state?. We will refer to this as the update step
Prerequisites
Brief Note on Vectors, Matrices and Linear Algebra
Vectors
The system that we are trying to model will have more than 1 variable. Instead of treating them as separate variables we group together as a vector. A vector is not “something with direction and magnitude”. This only applies to certain vector spaces that have geometric meaning but vectors are not inherently geometric objects. In the case of our project the vectors we are dealing with are just the most natural mathematical way to group things and they do not have any geometric meaning/properties unless specified.
For the purpose of the project a vector we will only be talking about vectors in np.array) And the dimension of the vector is just the length of the list. The set of
Matrices
Matrices are a two dimensional grid of numbers (The actual definition is a bit more abstract than this but this definition is enough for the project). Generally this grid is a square but can be non-square although those only come up very briefly in the project with the observation map. We can have an np.array.
In this project matrices are used for three purposes
- As a function on vectors - focus on column vectors.
- A way to represent systems of equations - focus on row vectors.
- The covariance matrix - just a grid of numbers. It’s matrix properties and the properties of it’s row and column vectors aren’t really relevant.
For the project knowing much about matrices as functions or systems of equations isn’t too vital but it is helpful in some cases. The most common occurrence of matrices is the covariance. Matrices as functions on vectors comes up quite a bit as the state transition matrix but depending on the system we are modelling it is not always possible. Representing systems of equations with matrices is only applicable with a very specific type of equation system and for most cases won’t work.
”Linear”
The word linear will come up a few times in this project. This does not refer to a linear polynomial
This definition isn’t really too important all you really need to know is that linear refers to a “nice” type and will lead to simpler properties. A linear function that can be represented as a matrix. If a linear function’s domain and codomain are the same vector space. Such as then is a linear transform
Annoyingly
Matrices as functions
I won’t go into too much detail as it’s not really necessary but matrices are linear transforms on vectors. Because of the properties of linear functions they can be determined, by where the basis vectors end up after the transformation. For
Matrices as Systems of Equations
The linear system of equations:
Is equivalent to this matrix equation
Notice how we have just taken each row of coefficients and put the as a row vector.
If we have a linear system of equations np.linalg.solve(A, B) or likewise np.linalg.solve(A, v). If you are computing np.linalg.solve(B.T, A.T).T
Although np.linalg.inv(A) does it exist you should never really use it, always prefer np.linalg.solve as it has better performance and stability.
Transpose
If we have a matrix then we can compute it’s
You will most commonly see it in matrix multiplications with forms
Differential Equations
There are different types: Ordinary differential equations (ODEs), Partial (PDEs) and stochastic (SDEs). We will mainly be looking at ODEs for now. An ODE is an equation consisting of derivatives (and higher order derivatives e.g. second and third derivative - however I haven’t dealt with any of these in the context of Bayesian data assimilation yet) If we have an ODE in multiple variables such as the Lorenz equations
We can group the variables into a vector
where
And if the set of equations is linear then we can write the function as a matrix and get
For this project we will be discretising the ODE meaning instead of looking at the whole continuous solutions we care about the solution at discrete points of time. For the most part we won’t really be solving ODEs analytically (exactly) we will be using approximations using methods like Euler’s method or RK4, DOP853. The exception being linear ODEs where we will use matrix exponentials.
Continuous Random Variables
You don’t need to learn all of this in detail you just need to understand at a high level what a random variable and a probability distribution is and what variance/covariance is. We generally don’t really care about evaluating any actual probabilities using integrals we just care about the distribution/random variable as a whole.
Definition
A continuous random variable is a random variable
Probability Density Function
In order to do this we have a probability density function
as total probability sums to
Expected Value
The Expected Value denoted
This comes from a generalisation of the discrete case
Variance
Variance is defined as
Since
It has the properties that:
.- Weight larger deviations more than small ones. Hence the squaring.
We then define standard deviation as
To go from squared distance to distance to get spread in the original units. However it is NOT quite the average distance. But we use it over
Dealing with Multiple Random Variables
When we have multiple random variables we consider it as a vector in
Joint Probability Density Function
Again we don’t really care about actual probabilities or the integral that much but for the sake understanding of how distributions work over a vector/multiple variables and completion: A Joint Probability Density Function is
Where
This
Bayes Theorem
Terminology
is called the Prior. This is what our current belief/hypothesis is about is before we see the data/evidence. It is our starting point. is called the Likelihood. It is the probability of observing the evidence if our hypothesis was true. Effectively it measures how well our hypothesis/current beliefs are compatible with the evidence we just observed. is called the Evidence or Marginal Likelihood. This is the probability of observing across all possible hypotheses. It serves as a normalisation constant. is called the Posterior. This is our updated belief in our hypothesis after accounting for the new evidence .
Note
Often Bayes’ Theorem is written as
This theorem is the case for having one specific event but it can be easily generalised for a full probability distribution by replacing all the probability functions
Bayes theorem allows us to take our current belief and obtain a more accurate belief by incorporating new evidence.
Normal Distribution
The normal distribution is the most common distribution. It’s PDF is:
You don’t need to know the actual function - it will only be used in some proofs.
All you need to know is that it is symmetric at the mean and the probability density exponentially decays as you move further from the mean. The (co)variance controls how fast this decays. A key property that makes it useful in this context is that when applying a linear (or affine) transform to a normal distributed random variable, the resultant random variable will be normally distributed and the mean and the new covariance can be easily calculated exactly.
It is denoted as
Markov Assumption and Markov Chains
“A Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event”.
This effectively mean if we have some random sequence of events such as moves in a game of chess the next state only depends on the current state meaning we don’t care about how we got to our current state e.g. what moves were actually played in order to predict the next state.
Mathematically if we have a sequence of random variables that form a Markov chain
Time Homogeneous
We will be only be looking at time homogeneous systems. All this means is that the system has no dependence on absolute time only on state. There is no absolute frame of reference only relative.
Setup and Notation
General Setup
The current state at time
If we are dealing with a system that does not evolve with time but just with discrete steps you may see
Our estimate for the state will be denoted as
When talking about the state at time
You may in other sources see
Model Equation
Our model equation will take in some state and return the next state. It may also take in a control vector this is a deterministic input that we know to the model. An example of where we would have this is if we are tracking an autonomous vehicle/robot and we are giving it commands. (Including doesn’t complicate any maths it only effects the mean propagation).
Our model equation is
Again, here
Since
For the regular KF our model needs to be linear (or affine - can happen you include the control vector). So we get model equation
The control model can be non-linear as we only require linearity in
However there are some cases where
Observations
We can make observations about our system and we denote an observation at time
Again to emphasise, we are treating
Kalman Filter (KF)
The Kalman filter covers both propagation and update. It requires certain assumptions/properties that being the model is linear/affine and that all probability distributions our gaussian. It also requires the observation map to be linear.
The gaussians allow us to just keep track of mean and covariance and the calculations on the distribution are all exact, the linear model is required to preserve all gaussian distributions as the propagation of a gaussian through a linear model will give us back a gaussian.
So we have
Propagation
The propagation is fairly simple.
We have our distribution on the state
Again since
The key part of the KF that made this so simple is that
To propagate the covariance is simple as we have the distribution
Update
There are multiple ways to derive the following KF equations we will start with MMSE - minimum mean square error - as it is much more intuitive and motivates the why/how better. Then we will derive it again with Bayes’ theorem.
MMSE Derivation
When we receive an observation we have two differing estimates about the state one being our current estimate
Just for now, for ease and clarity, we will ignore the observation map and assume
This is more commonly written as
This innovation form is always used as when when we introduce the observation map
Introducing back the observation map we get the equation
The goal of the KF is how do we find the optimum
In order to go from just the mean update equation we have right now to the whole distribution update we just replace the estimate
here we are treating
We want to find
Now that we’ve found
Bayes’ Theorem Derivation
Although this derivation is definitely less intuitive, more algebra heavy and doesn’t really motivate the idea of Kalman gain well, we will have to use this method for non Kalman filter methods of BDA. It also doesn’t make the linear update assumption it just so happens that in the case of everything gaussian it is optimal.
When we receive an observation
Identifying the 4 key components
- Posterior -
. This is basically updating our distribution on the state to include the information gained by out measurements. - Likelihood -
. This is the probability of our measurement if was the real state. Comes directly from the observation model/noise. . Therefore - Prior -
. This is the probability distribution we had before seeing the current measurement. It was the result of propagating the posterior at through our model as we saw earlier. Our current belief modelled by - Evidence -
. The probability of observing under the model. In practice we often don’t compute/care about this.
Substituting in our distributions and ignoring the normalising evidence constant we get
Note here we are actually multiplying the probability density functions here this is NOT the same as finding the product of two normally distributed variables.
The product of two normals is proportional to a normal so using this fact we don’t have to compute this directly and we can use the formula
However our two normals are currently in different variables (
Putting this together we get that
Ensemble Kalman Filter (EnKF)
EnKF is one way to use the KF equations we’ve just derived but with different assumptions. EnKF works with non-linear models and since non-linearity do not preserve gaussians our distributions will have to be non-gaussian. We will still model observation noise (
The set up and notation stays the same except now our model function is non-linear so we have
To manage dealing with arbitrary distributions we deal with an array of samples (particles). And instead of acting with exact maths on a whole distribution we use samples to get estimates of the distribution properties. This method of using random samples to approximate computations that would otherwise be very costly/complex is called Monte Carlo.
We take our initial belief which, for example, could be a gaussian
To denote the
Propagate
Propagating our distribution is very simple. We treat each particle individually and propagate it through the model so
Here
Update
For the update step we will use the linear Kalman filter like before by approximating our distribution with a gaussian. To do this we simply compute the mean and sample covariance of the particles and then treat our distribution as though it was modelled by a gaussian with that mean and covariance.
To actually perform this update we compute the Kalman gain like before using the sample covariance as
And although we have only one measurement
Extended Kalman Filter (EKF)
EKF is another way of dealing with a non-linear model. It is pretty simple. EKF works by preserving gaussians by instead of applying the non-linear model function applying a linear approximation of it.
It makes this linear approximation with the Jacobian this is a matrix of partial derivatives. It is the generalisation of the derivative with an
Propagate
We propagate our mean using the non-linear model function. Meaning
Here
.
We want to calculate the new covariance of this so we get
Note this is the exact same form as KF but
Update
Since we’ve forced our distribution to be gaussian via linear approximations we can use the exact same update step as the KF with no modifications.