This guide isn’t complete yet I will keep updating it adding new information about the other filters/techniques and maybe expand on what I’ve already done but for now this should be enough to get you up to understanding the problem and the Kalman filter in the linear all gaussian case as well as ensemble Kalman filter.
I will also add more on the linear Kalman filter soon by deriving a second way (by leading with the assumption we have a linear update and minimising variance) this will make the Kalman filter more natural as it might seem like more of an algebraic coincidence rather than the natural way to go about things. This proof doesn’t use Bayes’ theorem but still arrives at the exact same formulae.
The Problem We are Trying to Solve
We are trying to model a dynamic system - a system that evolves over time. This could be anything really, such as the weather, some position and velocity of an object, a population over time, volatility of a stock. Just anything that changes over time. We are looking specifically at dynamic systems that are governed by/can be modelled with differential equations. We might look at non differential equation systems that use a state transition matrix
The problem is that we don’t know the initial state of the system exactly. But we can make observations/measurements however our observations aren’t exact, they have error/noise. (This could be due to a sensor only having finite precision.) So we can’t completely trust our measurements and if we are slightly off this error can compound over time as we make predictions and could become wildly off from the true value.
Another problem that we have to deal with is that our model may not be perfect. Even if we did know the exact state of the system at some point our model might not be able to perfectly predict the true value and again this error may compound over time. This could be due to the fact there are external conditions we can’t account for or that the system is too complicated to model every single part so we simplify a bit. E.g. we can’t model every single particle in a system. It is too complicated to deal with, we don’t have enough computation power and we can’t make measurements of the whole system.
The Solution / Core Loop of Bayesian Data Assimilation
Instead of the model making predictions of the single value of the state. We model the state as a random variable and keep track of the probability distribution. That is instead of predicting that the current state of the system is a specific value we talk about what the probability that the true state value falls within some region is. And we take the mean of this distribution to be our guess for the true state. This also allows us to model how confident we are in the guess.
The two core functions of any model is
- How we take our current distribution estimate and get a new distribution of the state at some later time.
- When we receive an observation of the state (may not be of every variable we are trying to model), How do we update a distribution. To what extent do we prefer either the noisy observation or our noisy prediction more? How do we combine these two to get a more accurate prediction of the state?
Prerequisites
Brief Note on Vectors, Matrices and Linear Algebra
Vectors
The system that we are trying to model will have more than 1 variable. Instead of treating them as separate variables we group together as a vector. However a vector is not “something with direction and magnitude”. This only applies to certain vector spaces that have geometric meaning but vectors are not inherently geometric objects. In the case of our project the vectors we are dealing with are just the most natural mathematical way to group things and they do not have any geometric meaning/properties unless specified.
The definition of a vector is much more abstract but really it boils down to if something can be added together and multiplied by a scalar (vector times vector multiplication is not required). Many different mathematical objects such as functions can be vectors. A vector space refers to a whole set of vectors but don’t worry too much about the details of this.
For the purpose of the project a vector we will only be talking about vectors in np.array) And the dimension of the vector is just the length of the list. The set of
Matrices
Matrices are a two dimensional grid of numbers (The actual definition is a bit more abstract than this but this definition is enough for the project). Generally this grid is a square but can be non-square although those only come up very briefly in the project with the observation map. We can have an np.array.
In this project matrices are used for three purposes
- As a function on vectors - focus on column vectors.
- A way to represent systems of equations - focus on row vectors.
- The covariance matrix - just a grid of numbers. It’s matrix properties and the properties of it’s row and column vectors aren’t really relevant.
For the project knowing much about matrices as functions or systems of equations isn’t too vital but it is helpful in some cases. The most common occurrence of matrices is the covariance. Matrices as functions on vectors comes up quite a bit as the state transition matrix but depending on the system we are modelling it is not always possible. Representing systems of equations with matrices is only applicable with a very specific type of equation system and for most cases won’t work.
”Linear”
The word linear will come up a few times in this project. This does not refer to a linear polynomial
This definition isn’t really important all you really need to know is that linear refers to a “nice” type of function that can be represented as a matrix. If a linear function’s domain and codomain are the same vector space. Such as then is a linear transform
Annoyingly
Matrices as functions
I won’t go into too much detail as it’s not really necessary but matrices are linear transforms on vectors. Because of the properties of linear functions they can be determined, by where the basis vectors end up after the transformation. For
Matrices as Systems of Equations
The linear system of equations:
Is equivalent to this matrix equation
Notice how we have just taken each row of coefficients and put the as a row vector.
If we have a linear system of equations np.linalg.solve(A, np.eye(len(A)), np.linalg.inv(A), or np.linalg.pinv(A).
Solve has the best performance and stability so use that unless you have reason to use something different.
Differential Equations
There are different types: Ordinary differential equations (ODEs), Partial (PDEs) and stochastic (SDEs). We will mainly be looking at ODEs for now. An ODE is an equation consisting of derivatives (and higher order derivatives e.g. second and third derivative - however I haven’t dealt with any of these in the context of Bayesian data assimilation yet) If we have an ODE in multiple variables such as the Lorenz equations
We can group the variables into a vector
where
And if the set of equations is linear then we can write the function as a matrix and get
For the project we won’t really be solving ODEs analytically (exactly) we will be using approximations using methods like Euler’s method or RK4.
Continuous Random Variables
I’ve copied all of the following from my previous notes. You don’t need to learn all of this in detail you just need to understand at a high level what a probability distribution is and what variance/covariance is for both single and multiple variables. We don’t really care about actual probabilities (using integrals) we just care about the distribution.
Definition
A continuous random variable is a random variable
Probability Density Function
In order to do this we have a probability density function
*as total probability sums to
Expected Value
The Expected Value denoted
This comes from a generalisation of the discrete case
Variance
Variance is defined as
Since
It has the properties that:
.- Weight larger deviations more than small ones. Hence the squaring.
We then define standard deviation as
To go from squared distance to distance to get spread in the original units. However it is NOT quite the average distance. But we use it over
Dealing with Multiple Random Variables
When we have multiple random variables we consider it as a vector in
Joint Probability Density Function
Again we don’t really care about actual probabilities or the integral that much but for the sake understanding of how distributions work over a vector/multiple variables and completion: A Joint Probability Density Function is
Where
This
Bayes Theorem
Terminology
is called the Prior. This is what our current belief/hypothesis is about is before we see the data/evidence. It is our starting point. is called the Likelihood. It is the probability of observing the evidence if our hypothesis was true. Effectively it measures how well our hypothesis/current beliefs are compatible with the evidence we just observed. is called the Evidence or Marginal Likelihood. This is the probability of observing across all possible hypotheses. It serves as a normalisation constant. is called the Posterior. This is our updated belief in our hypothesis after accounting for the new evidence .
Note
Often Bayes’ Theorem is written as
This theorem is the case for having one specific event but it can be easily generalised for a full probability distribution by replacing all the probability functions
Bayes theorem allows us to take our current belief about the state and when we make a new observation update our distribution to be more accurate according to the observation we just made.
Normal Distribution
The normal distribution is the most common distribution in the project. Don’t worry about knowing the actual function for it. All you need to know is that it is symmetric at the mean and the probability density exponentially decays as you move further from the mean. The (co)variance controls how fast this decays.
It works with multiple variables using vectors and a covariance matrix. It is a very simple distribution that comes up all the time in the real world.
A key property that makes it useful in this context is that applying a linear (or affine) transform to a normal distribution you get another normal distribution.
It is denoted as
Setup and Notation
Model Equation
The state vector will be denoted
We will have a model that takes in the previous state and predicts the next state
The Q = 1e-7 * np.eye(NUM_VARIABLES) * TIME_STEP as making it zero can cause errors
When modelling ODEs with discrete steps, Q=... * TIME_STEP
You can also think of this equation as telling us that for a fixed
Just for extra detail: If we explicitly want to get our new distribution we use the Chapman-Kolmogorov equation (Only valid under Markov Assumption)
This is a generalisation of the discrete case
We almost never actually use this integral directly. We either use it implicitly (like with linear Kalman) or approximate it directly or indirectly with Monte-Carlo sampling in Ensemble Kalman Filter.
Observations
Don’t worry about this too much but: When we make observations about the state we write it as
Similar intuition with what the equation actually represents than with the model equation. However even though
The
Kalman Filter
The Kalman Filter is a subset of Bayesian Data Assimilation but it covers both propagating our distribution through the model and the update step with Bayes’. It is enough on it’s own to work for the project. We can also look at particle filters, which replace Kalman Filters. Or 4D variational methods which can be built on top of Kalman Filters (although I don’t really know anything about this).
There are many variations on the Kalman filter: Extended, unscented, ensemble etc… They achieve the exact same thing as a regular Kalman filter but they deal with more complex cases and different assumptions e.g. non-linear models and/or non-gaussian distributions. They do this in varying different ways.
Linear Kalman Filter
This is the original Kalman Filter. It assumes that every distribution is gaussian and the model is linear.
Propagation
For our model equation we have
The nice thing about every distribution being gaussian is that if a random variable is modelled by a Gaussian then any linear (or affine) transformation of that variable will also be modelled by a Gaussian.
So if
Update
When we receive an observation. we want to update our distribution with Bayes’ theorem in the following form
Here the colon in the subscript denotes a list e.g.
- Posterior -
. This is basically updating our distribution on the state to include the information gained by out measurements. - Likelihood -
. This is the probability of our measurement if was the real state. Comes directly from the observation model/noise. . Therefore - Prior -
. This is the probability distribution we had before seeing the current measurement. It was the result of propagating the posterior at through our model as we saw earlier. Our current belief modelled by - Evidence -
. The probability of observing under the model. In practice we often don’t compute/care about this.
Substituting in our distributions and ignoring the normalising evidence constant we get
The following is lots of algebra, you don’t really need to know this but we have an exact formula for the above expression so we can easily find our posterior but for completion here is some proof:
Also note here we are actually multiplying the probability density functions here this is NOT the same as finding the product of two normally distributed variables.
The product of two normals is proportional to a normal so using this fact we don’t have to compute this directly and we can use the formula
However our two normals are currently in different variables (
Putting this together we get that
The innovation vector is effectively the error in our prediction.
This exact formula can also be derived by first assuming that we have a linear update e.g.
Ensemble Kalman Filter
The ensemble Kalman filter is one way to use the Kalman filter we’ve just derived but with different assumptions. The ensemble filter works with non-linear models and since non-linearity do not preserve gaussians our distributions will have to be non-gaussian. However we will still model observation noise and model uncertainty as gaussians it’s just that our actual state estimate distribution can’t be gaussian.
Propagate
To get around this instead of transforming a whole distribution. We will take random samples (called particles) from our distribution and now that we have an array consisting of single fixed values (particles) of possible state vectors we will apply our state transition to each of them individually to get a new set of particles.
But since we only have particles we will can only approximate the features of our distribution. Instead of having one mean that we can nicely propagate through the model like in linear Kalman we have an array of particles and we take their mean which is simply a discrete arithmetic mean.
Update
For the update step we will use the linear Kalman filter like before by treating our distribution as a gaussian. To do this we simply compute the mean and covariance of the particles and then treat our distribution as though it was modelled by a gaussian with that mean and covariance. To actually perform this update we compute the Kalman gain with the exact same formula but using the covariance of the particle distribution as our prior covariance.
Then for each particle
Here although we have only one measurement