B.1.1 Ordinary and Simple Kriging#
Table of Contents:#
Read point data,
Set semivariogram model,
Set Ordinary Kriging and Simple Kriging models,
Predict values at unknown locations.
Introduction#
This tutorial will teach us how to perform spatial interpolation with Ordinary and Simple Kriging. We will compare a different number of ranges, and test outcomes of processing with the root mean squared error.
Ordinary and Simple Kriging is the simplest form of Kriging, but they’re still powerful techniques.
We use DEM data which is stored in a file samples/point_data/txt/pl_dem_epsg2180.txt
.
Import packages#
[1]:
from typing import List
import numpy as np
from pyinterpolate import read_txt, build_experimental_variogram, build_theoretical_variogram, kriging
1) Read point data#
[2]:
dem = read_txt('samples/point_data/txt/pl_dem_epsg2180.txt')
dem[: 3]
[2]:
array([[2.37685325e+05, 5.45416708e+05, 5.12545509e+01],
[2.37674140e+05, 5.45209671e+05, 4.89582825e+01],
[2.37449255e+05, 5.41045935e+05, 1.68178635e+01]])
In the beginning, we remove 70 % of our points from the dataset, and we will leave them as a test set to estimate how good our models are.
[3]:
def create_train_test(dataset: np.ndarray, training_set_ratio=0.3):
"""
Function divides base dataset into a training and a test set.
Parameters
----------
dataset : np.ndarray
training_set_ratio : float, default = 0.3
Returns
-------
training_set, test_set : List[np.ndarray]
"""
np.random.seed(101) # To ensure that we will get the same results every time
indexes_of_training_set = np.random.choice(range(len(dataset) - 1), int(training_set_ratio * len(dataset)), replace=False)
training_set = dataset[indexes_of_training_set]
validation_set = np.delete(dataset, indexes_of_training_set, 0)
return training_set, validation_set
train_set, test_set = create_train_test(dem)
We have removed a subset of points from a dataset to be sure that Kriging is working. In this scenario, 70% of available points are removed, but in real-world cases, you will probably have even fewer points to perform estimations, down to the 1% of known values.
Function create_train_test()
divides our dataset into two subsets:
training set used for semivariogram model derivation,
test set used for the model error calculation.
Points for each set are chosen randomly to avoid bias related to the geographical location. Let’s imagine we have a sorted list of Digital Elevation Model points. The western part of our measurements covers a mountain, and the eastern part is plain. When we use the part of the west for modeling and the east part for tests, we are going directly into a catastrophe! That’s why performing multiple random sampling and testing various realizations from our data is better. We will prepare only one realization with a fixed random seed, but you should be aware that in the real-world analysis, we must test more realizations (e.g., set multiple random seeds).
2) Set Semivariogram model#
In this step, we are going to create experimental and theoretical semivariograms.
[4]:
# Prepare experimental semivariogram
step_radius = 500 # meters
max_range = 10000 # meters
exp_semivar = build_experimental_variogram(input_array=train_set, step_size=step_radius, max_range=max_range)
[5]:
# Plot experimental semivariogram
exp_semivar.plot()
[7]:
# Fit data into a theoretical model
semivar = build_theoretical_variogram(experimental_variogram=exp_semivar,
model_name='linear',
sill=exp_semivar.variance,
rang=max_range)
[8]:
semivar.plot()
[9]:
print(semivar)
* Selected model: Linear model
* Nugget: 0.0
* Sill: 493.778772754868
* Range: 10000
* Spatial Dependency Strength is Unknown
* Mean Bias: 21.92700357030904
* Mean RMSE: 37.17324824914244
* Error-lag weighting method: None
+--------+--------------------+--------------------+---------------------+
| lag | theoretical | experimental | bias (y-y') |
+--------+--------------------+--------------------+---------------------+
| 500.0 | 24.6889386377434 | 15.614084077885668 | -9.074854559857734 |
| 1000.0 | 49.3778772754868 | 34.151521896711735 | -15.226355378775068 |
| 1500.0 | 74.0668159132302 | 56.61060460335831 | -17.45621130987189 |
| 2000.0 | 98.7557545509736 | 81.75250857008155 | -17.00324598089206 |
| 2500.0 | 123.444693188717 | 107.39147411240353 | -16.053219076313468 |
| 3000.0 | 148.1336318264604 | 138.01527174633642 | -10.118360080123978 |
| 3500.0 | 172.82257046420378 | 170.17760461535644 | -2.6449658488473347 |
| 4000.0 | 197.5115091019472 | 201.7643955690108 | 4.2528864670635755 |
| 4500.0 | 222.2004477396906 | 235.6737606045218 | 13.4733128648312 |
| 5000.0 | 246.889386377434 | 274.49115425599933 | 27.60176787856534 |
| 5500.0 | 271.5783250151774 | 304.81328096735956 | 33.23495595218213 |
| 6000.0 | 296.2672636529208 | 341.13970843945583 | 44.87244478653503 |
| 6500.0 | 320.9562022906642 | 367.8150566677254 | 46.85885437706122 |
| 7000.0 | 345.64514092840756 | 403.5663320998641 | 57.92119117145654 |
| 7500.0 | 370.334079566151 | 430.51175648403176 | 60.177676917880774 |
| 8000.0 | 395.0230182038944 | 456.5353159209564 | 61.51229771706198 |
| 8500.0 | 419.7119568416378 | 472.6792052919712 | 52.967248450333386 |
| 9000.0 | 444.4008954793812 | 496.87388696572737 | 52.47299148634619 |
| 9500.0 | 469.08983411712455 | 517.9344861183605 | 48.84465200123594 |
+--------+--------------------+--------------------+---------------------+
3) Set Ordinary Kriging and Simple Kriging models#
This is the essential step of our tutorial. We’ve set our semivariogram model and can now predict unknown values. We will “predict” a known and arbitrary point in the first run. It is a test of Kriging, which should work as an unbiased linear estimator. Thus it should return the exact value if we pass into it a point used for training. In the second step, we will try to guess values at unknown locations and calculate the Root Mean Squared Error (RMSE) of interpolated values.
We can use the same kriging()
function for both kriging types. It takes those arguments:
observations
: array with known points,theoretical_model
: fittedTheoreticalVariogram
model,points
: points to interpolate values,how
:ok
- ordinary kriging,sk
- simple kriging,neighbors_range
:None
orfloat
, the maximum distance where we search for point neighbors. IfNone
is given, the range is selected from thetheoretical_model
rang
attribute.no_neighbors
:int
, number of neighbors to estimate unknown value.use_all_neighbors_in_range
:bool
, default isFalse
.True
: if the number of neighbors within theneighbors_range
is greater than thenumber_of_neighbors
, then take all of them for modeling.sk_mean
:None
orfloat
, the mean value of a process over a study area. It should be known before processing, and that’s why Simple Kriging has a limited number of applications. You must have multiple samples and a well-known area to use this parameter.allow_approx_solutions
: Allows the approximation of kriging weights based on the OLS algorithm. Not recommended to set it toTrue
if you don’t know what you are doing! By default, it is set toFalse
,number_of_workers
: if we pass more than 10k points to interpolate, setting this parameter to the number of your CPU workers (or -1) will speed up calculations.
We use only the first four parameters and sk_mean
when we perform Simple Kriging. Let’s start! The first step is an interpolation of the value known by our model.
[10]:
# Select one known value
known_value = train_set[10]
known_value
[10]:
array([2.49304390e+05, 5.40766289e+05, 2.03649998e+01])
[11]:
# Predict with Ordinary Kriging
ok_interpolation = kriging(train_set, semivar, [known_value[:-1]])
ok_interpolation
100%|██████████| 1/1 [00:00<00:00, 372.13it/s]
[11]:
array([[2.03649998e+01, 0.00000000e+00, 2.49304390e+05, 5.40766289e+05]])
The first value is our prediction: it is precisely the same as the input in the training set! So far, the algorithm has worked well. The second value is prediction variance error, equal to zero - we are sure it is the exact value.
Simple Kriging is slightly different than Ordinary Kriging, and we must set the process mean to retrieve valid results. It is rarely the case. That’s why Ordinary Kriging is the first choice for many applications. We know the global mean because we have the whole dataset, but in the real-world scenario, we cannot divide the set into training and test sets and then get the mean from the entire dataset - it is an information leak from the test set into a model!
[12]:
sk_interpolation = kriging(train_set, semivar, [known_value[:-1]], how='sk', sk_mean=float(np.mean(dem)))
sk_interpolation
100%|██████████| 1/1 [00:00<00:00, 576.93it/s]
[12]:
array([[2.03649998e+01, 0.00000000e+00, 2.49304390e+05, 5.40766289e+05]])
The Simple Kriging algorithm returns the same output as the Ordinary Kriging: [prediction, error variance, pt x, pt y]
. And as with Ordinary Kriging, Simple Kriging has returned the same value as the actual value fed to the algorithm.
4) Predict values at unknown locations#
Using kriging for interpolation of known points values is pointless, and it is an excellent tool for a testing algorithm but nothing more. Here we will interpolate with Kriging, and we will interpolate values at unknown locations. Additionally, we will control the no_neighbors
parameter to check how it influences predictions.
In the first step, we will create a simple function to test our kriging results and calculate RMSE.
[13]:
def test_kriging(train_data, variogram_model, ktype, test_values, number_of_neighbors, sk_mean_value=None):
predictions = kriging(observations=train_data,
theoretical_model=variogram_model,
points=test_values[:, :-1],
how=ktype,
no_neighbors=number_of_neighbors,
number_of_workers=1,
sk_mean=sk_mean_value)
mse = np.mean((predictions[:, 0] - test_values[:, -1])**2)
rmse = np.sqrt(mse)
return rmse
[14]:
# Number of neighbors
no_of_n = [4, 8, 16, 32, 64, 128, 256]
print('Ordinary Kriging: tests')
print('')
for nn in no_of_n:
print('Number of neighbors:', nn)
rmse_pred = test_kriging(train_data=train_set, variogram_model=semivar, ktype='ok', test_values=test_set, number_of_neighbors=nn)
print('RMSE:', rmse_pred)
print('')
Ordinary Kriging: tests
Number of neighbors: 4
100%|██████████| 4826/4826 [00:01<00:00, 3356.06it/s]
RMSE: 3.404341396262182
Number of neighbors: 8
100%|██████████| 4826/4826 [00:01<00:00, 3344.89it/s]
RMSE: 3.2909488461696372
Number of neighbors: 16
100%|██████████| 4826/4826 [00:01<00:00, 3243.19it/s]
RMSE: 3.2674275543792795
Number of neighbors: 32
100%|██████████| 4826/4826 [00:19<00:00, 253.42it/s]
RMSE: 3.2608735889738663
Number of neighbors: 64
100%|██████████| 4826/4826 [00:55<00:00, 86.70it/s]
RMSE: 3.2565732386644757
Number of neighbors: 128
100%|██████████| 4826/4826 [01:05<00:00, 73.89it/s]
RMSE: 3.2547868359770336
Number of neighbors: 256
100%|██████████| 4826/4826 [00:42<00:00, 114.38it/s]
RMSE: 3.2550170696805005
[15]:
print('Simple Kriging: tests')
print('')
sk_mean = np.mean(dem[:, -1])
for nn in no_of_n:
print('Number of neighbors:', nn)
rmse_pred = test_kriging(train_data=train_set, variogram_model=semivar, ktype='sk', test_values=test_set, number_of_neighbors=nn, sk_mean_value=sk_mean)
print('RMSE:', rmse_pred)
print('')
Simple Kriging: tests
Number of neighbors: 4
100%|██████████| 4826/4826 [00:01<00:00, 3898.96it/s]
RMSE: 8.697161939348147
Number of neighbors: 8
100%|██████████| 4826/4826 [00:01<00:00, 3866.67it/s]
RMSE: 3.650066190084651
Number of neighbors: 16
100%|██████████| 4826/4826 [00:01<00:00, 3833.67it/s]
RMSE: 3.348632949010878
Number of neighbors: 32
100%|██████████| 4826/4826 [00:06<00:00, 729.77it/s]
RMSE: 3.2785837126433313
Number of neighbors: 64
100%|██████████| 4826/4826 [00:10<00:00, 477.52it/s]
RMSE: 3.261651162610457
Number of neighbors: 128
100%|██████████| 4826/4826 [00:09<00:00, 510.67it/s]
RMSE: 3.255919635781099
Number of neighbors: 256
100%|██████████| 4826/4826 [00:26<00:00, 185.51it/s]
RMSE: 3.254332343473019
Usually, Simple Kriging will give us worse results than Ordinary Kriging because we need to know the process mean to build a valid Simple Kriging model. Only if we know the process mean we can use Simple Kriging with a large number of neighbors. We see that the results of Simple Kriging are better when we pass more values into it. You shouldn’t be shocked. Simple Kriging discovers the global mean and utilizes this information when building a Kriging system.
On the other hand, Ordinary Kriging is a Swiss knife in our geostatistical toolbox. It works well even if we don’t know the process mean, and it should be our first-choice technique.
Let’s look into processing times with a growing number of neighbors. We can theoretically take more and more neighbors, but is it worth it?
Small dataset: with small datasets (up to 1000 points), we may consider using many neighbors. It shouldn’t be a problem for processing time.
Large dataset: more than 5000 points? Then we must take into account two scenarios:
A single report or only a one-time analysis - then use as many neighbors as it is possible (within range);
Near-real time and systematic analyses - use fewer neighbors, between 8 - 128. I usually use 32 neighbors for datasets with more than 1000 point pairs.
Where to go from here?#
B.1.2 Kriging Benchmarking
B.1.3 Outliers and Kriging Model
B.2.1 Directional Ordinary Kriging
C.1.1 Blocks to Points Interpolation with Ordinary Kriging
Changelog#
Date |
Change description |
Author |
---|---|---|
2023-08-22 |
The tutorial was refreshed and set along with the 0.5.0 version of the package |
@SimonMolinsky |
2023-04-15 |
Tutorial debugged and updated to the 0.4.1 version of the package |
@SimonMolinsky |
2022-11-05 |
Tutorial updated for the 0.3.5 version of the package |
@SimonMolinsky |
2022-08-20 |
The tutorial is updated to the version 0.3.0 of the package |
@SimonMolinsky |
2021-12-14 |
Sill selection was upgraded: now optimal sill is derived from the grid search within |
@SimonMolinsky |
2021-12-13 |
Changed behavior of |
@SimonMolinsky |
2021-12-11 |
Behavior of |
@SimonMolinsky |
2021-05-11 |
Refactored TheoreticalSemivariogram class |
@SimonMolinsky |
2021-04-03 |
Simple Kriging |
@SimonMolinsky |
2021-03-31 |
Update related to the change of semivariogram weighting. Updated point data source. |
@SimonMolinsky |
[ ]: