B.1.2 Kriging Benchmarking#
Table of Contents:#
Read point data,
Divide dataset into two sets: modeling and validation set,
Perform IDW and evaluate it,
Perform variogram modeling on the modeling set,
Validate Kriging and compare Kriging and IDW validation results,
Bonus scenario: only 2% of values are known!
Introduction#
In this tutorial, we will learn how to validate our Kriging model. We’ll compare it to the Inverse Distance Weighting function, where the unknown point value is interpolated as the weighted mean of its neighbors. Weights are inversely proportional to the distance between neighboring points, dropping faster with a higher power.
GENERAL FORM OF IDW
Where:
\(z(u)\): is the value at an unknown location,
\(i\): is an i-th known location,
\(z_{i}\): is a value at known location \(i\),
\(\lambda_{i}\): is a weight assigned to the known location \(i\).
WEIGHTING PARAMETER
Where:
\(d\): is a distance from the known point \(z_{i}\) to the unknown point \(z(u)\),
\(p\): is a hyperparameter that controls how strong a relationship is between a known and unknown point. You may set large \(p\) to establish a strong relationship between the closest points and the weak influence of distant points. On the other hand, you may set a small \(p\) to emphasize that the distance between points matters a little.
As you noticed IDW is a simple but powerful technique. Unfortunately, it has a significant drawback: we must set ``p`` - power - manually, which isn’t derived from the data and variogram. That’s why it can be used for other tasks. An example is to use IDW as a baseline for comparison to the different techniques.
Import packages#
[1]:
import numpy as np
from pyinterpolate import inverse_distance_weighting # function for idw
from pyinterpolate import read_txt
from pyinterpolate import build_experimental_variogram # experimental semivariogram
from pyinterpolate import build_theoretical_variogram, TheoreticalVariogram # theoretical models
from pyinterpolate import kriging # kriging models
1) Read point data#
[4]:
dem = read_txt('samples/point_data/txt/pl_dem_epsg2180.txt')
dem[:3]
[4]:
array([[2.37685325e+05, 5.45416708e+05, 5.12545509e+01],
[2.37674140e+05, 5.45209671e+05, 4.89582825e+01],
[2.37449255e+05, 5.41045935e+05, 1.68178635e+01]])
2) Divide the dataset into two sets: modeling and validation set#
In this step, we will divide our dataset into two sets:
modeling set (10%): points used for variogram modeling,
validation set (90%): points used for prediction and results validation.
The baseline dataset will be divided randomly.
[5]:
# Create modeling and validation sets
def create_model_validation_sets(dataset: np.array, training_set_ratio=0.5, rnd_seed=101):
np.random.seed(rnd_seed)
indexes_of_training_set = np.random.choice(range(len(dataset) - 1), int(training_set_ratio * len(dataset)), replace=False)
training_set = dataset[indexes_of_training_set]
validation_set = np.delete(dataset, indexes_of_training_set, 0)
return training_set, validation_set
known_points, unknown_points = create_model_validation_sets(dem)
3) Perform IDW and evaluate it#
Inverse Distance Weighting doesn’t require variogram modeling or other steps. We pass power that affects weights in a denominator. Things to remember are:
Large
power
-> closer neighbors are more important,power
which is close to the zero -> all neighbors are important, and we assume that the distant process has the same effect on our variable as the closest events.
[6]:
IDW_POWER = 2
NUMBER_OF_NEIGHBOURS = 128
idw_predictions = []
for pt in unknown_points:
idw_result = inverse_distance_weighting(known_points, pt[:-1], NUMBER_OF_NEIGHBOURS, IDW_POWER)
idw_predictions.append(idw_result)
[7]:
# Evaluation
idw_rmse = np.mean(np.sqrt((unknown_points[:, -1] - np.array(idw_predictions))**2))
print(f'Root Mean Squared Error of prediction with IDW is {idw_rmse}')
Root Mean Squared Error of prediction with IDW is 2.5045203528088513
Clarification: Obtained Root Mean Squared Error could serve as a baseline for further model development. To build a better reference, we create four IDW models of powers:
0.5
1
2
4
[8]:
IDW_POWERS = [0.5, 1, 2, 4]
idw_rmse = {}
for pw in IDW_POWERS:
results = []
for pt in unknown_points:
idw_result = inverse_distance_weighting(known_points, pt[:-1], NUMBER_OF_NEIGHBOURS, pw)
results.append(idw_result)
idw_rmse[pw] = np.mean(np.sqrt((unknown_points[:, -1] - np.array(results))**2))
[9]:
for pw in IDW_POWERS:
print(f'Root Mean Squared Error of prediction with IDW of power {pw} is {idw_rmse[pw]:.4f}')
Root Mean Squared Error of prediction with IDW of power 0.5 is 4.4434
Root Mean Squared Error of prediction with IDW of power 1 is 3.8513
Root Mean Squared Error of prediction with IDW of power 2 is 2.5045
Root Mean Squared Error of prediction with IDW of power 4 is 1.8198
4) Perform variogram modeling on the modeling set#
In this step, we will go through semivariogram modeling for Kriging interpolation.
[10]:
max_range = 10000
step_size = 500
exp_semivar = build_experimental_variogram(known_points, step_size=step_size, max_range=max_range)
[11]:
semivar = build_theoretical_variogram(exp_semivar, model_name='circular', sill=exp_semivar.variance, rang=max_range)
[12]:
semivar.plot()
5) Validate Kriging and compare Kriging and IDW validation results#
Lastly, we perform Kriging interpolation and compare results to the IDW models. We use all points to weight values at unknown locations and the semivariogram model we chose in the previous step.
[13]:
# Set Kriging model
predictions = kriging(observations=known_points, theoretical_model=semivar, points=unknown_points[:, :-1], no_neighbors=128)
100%|██████████| 3447/3447 [01:17<00:00, 44.31it/s]
[14]:
# Evaluation
kriging_rmse = np.mean(np.sqrt((unknown_points[:, -1] - np.array(predictions[:, 0]))**2))
print(f'Root Mean Squared Error of prediction with Kriging is {kriging_rmse}')
Root Mean Squared Error of prediction with Kriging is 1.6638061333290732
Clarification: Kriging is better than any of the IDW models, and we may assume that our modeling approach gives us better insights into the spatial process we are observing. But this is not the end. Let’s consider a more complex scenario!
6) Bonus scenario: only 2% of values are known!#
Data sampled from the real world is less good than the sample from the tutorial. It is too expensive to densely sample every location, and usually, we get only a tiny percent of the area covered by data. That’s why it is good to compare IDW vs. Kriging in this scenario! We repeat steps 1-5 with a change in the division for the modeling/validation set. (I encourage you to try it alone and compare your code and results with those in this notebook).
[15]:
# Data preparation
known_points, unknown_points = create_model_validation_sets(dem, 0.02, rnd_seed=112)
[16]:
IDW_POWERS = [0.5, 1, 2, 4, 6, 8]
idw_rmse = {}
for pw in IDW_POWERS:
results = []
for pt in unknown_points:
idw_result = inverse_distance_weighting(known_points, pt[:-1], NUMBER_OF_NEIGHBOURS, pw)
results.append(idw_result)
idw_rmse[pw] = np.mean(np.sqrt((unknown_points[:, -1] - np.array(results))**2))
[17]:
for pw in IDW_POWERS:
print(f'Root Mean Squared Error of prediction with IDW of power {pw} is {idw_rmse[pw]:.4f}')
Root Mean Squared Error of prediction with IDW of power 0.5 is 17.1027
Root Mean Squared Error of prediction with IDW of power 1 is 13.8879
Root Mean Squared Error of prediction with IDW of power 2 is 7.1886
Root Mean Squared Error of prediction with IDW of power 4 is 5.1183
Root Mean Squared Error of prediction with IDW of power 6 is 5.2371
Root Mean Squared Error of prediction with IDW of power 8 is 5.3415
[18]:
# Variogram
max_range = 10000
step_size = 500
exp_semivar = build_experimental_variogram(known_points, step_size=step_size, max_range=max_range)
[19]:
semivar = TheoreticalVariogram()
semivar.autofit(exp_semivar, model_name='spherical')
semivar.plot()
[20]:
# Set Kriging model
predictions = kriging(observations=known_points, theoretical_model=semivar, points=unknown_points[:, :-1], no_neighbors=128, use_all_neighbors_in_range=False)
100%|██████████| 6756/6756 [00:58<00:00, 115.93it/s]
[21]:
# Evaluation
kriging_rmse = np.mean(np.sqrt((unknown_points[:, -1] - np.array(predictions[:, 0]))**2))
print(f'Root Mean Squared Error of prediction with Kriging is {kriging_rmse}')
Root Mean Squared Error of prediction with Kriging is 4.886647202272596
[22]:
# Comparison
for pw in IDW_POWERS:
print(f'Root Mean Squared Error of prediction with IDW of power {pw} is {idw_rmse[pw]:.4f}')
print(f'Root Mean Squared Error of prediction with Kriging is {kriging_rmse:.4f}')
Root Mean Squared Error of prediction with IDW of power 0.5 is 17.1027
Root Mean Squared Error of prediction with IDW of power 1 is 13.8879
Root Mean Squared Error of prediction with IDW of power 2 is 7.1886
Root Mean Squared Error of prediction with IDW of power 4 is 5.1183
Root Mean Squared Error of prediction with IDW of power 6 is 5.2371
Root Mean Squared Error of prediction with IDW of power 8 is 5.3415
Root Mean Squared Error of prediction with Kriging is 4.8866
Your results may be different (if you set another random seed), but in most cases, Kriging will be better than IDW or very close to the best results from IDW. Even more important is that for the single data source with a low number of samples, we don’t have the opportunity to perform the validation step, and we’re unable to guess how big the power parameter should be. With Kriging, we model variogram, and voila! - model works.
Where to go from here?#
B.1.3 Outliers and Kriging Model
B.2.1 Directional Ordinary Kriging
C.1.1 Blocks to Points Interpolation with Ordinary Kriging
C.1.2 Semivariogram Regularization
Changelog#
Date |
Change description |
Author |
---|---|---|
2023-08-23 |
The tutorial was refreshed and set along with the 0.5.0 version of the package |
@SimonMolinsky |
2023-04-15 |
Tutorial debugged and updated to the 0.4.1 version of the package |
@SimonMolinsky |
2022-11-05 |
Tutorial updated for the 0.3.5 version of the package |
@SimonMolinsky |
2022-08-23 |
The new version of tutorial for the 0.3.0 version of a package |
@SimonMolinsky |
2021-12-14 |
Sill selection was upgraded: now optimal sill is derived from the grid search within |
@SimonMolinsky |
2021-12-13 |
Changed behavior of |
@SimonMolinsky |
2021-12-08 |
Behavior of |
@SimonMolinsky |
2021-05-12 |
First version of tutorial |
@SimonMolinsky |
[ ]: