B.1.2 Kriging Benchmarking#

Table of Contents:#

  1. Read point data,

  2. Divide dataset into two sets: modeling and validation set,

  3. Perform IDW and evaluate it,

  4. Perform variogram modeling on the modeling set,

  5. Validate Kriging and compare Kriging and IDW validation results,

  6. Bonus scenario: only 2% of values are known!

Introduction#

In this tutorial, we will learn how to validate our Kriging model. We’ll compare it to the Inverse Distance Weighting function, where the unknown point value is interpolated as the weighted mean of its neighbors. Weights are inversely proportional to the distance between neighboring points, dropping faster with a higher power.

  1. GENERAL FORM OF IDW

\[z(u) = \frac{\sum_{i}\lambda_{i}*z_{i}}{\sum_{i}\lambda_{i}}\]

Where:

  • \(z(u)\): is the value at an unknown location,

  • \(i\): is an i-th known location,

  • \(z_{i}\): is a value at known location \(i\),

  • \(\lambda_{i}\): is a weight assigned to the known location \(i\).

  1. WEIGHTING PARAMETER

\[\lambda_{i} = \frac{1}{d^{p}_{i}}\]

Where:

  • \(d\): is a distance from the known point \(z_{i}\) to the unknown point \(z(u)\),

  • \(p\): is a hyperparameter that controls how strong a relationship is between a known and unknown point. You may set large \(p\) to establish a strong relationship between the closest points and the weak influence of distant points. On the other hand, you may set a small \(p\) to emphasize that the distance between points matters a little.


As you noticed IDW is a simple but powerful technique. Unfortunately, it has a significant drawback: we must set ``p`` - power - manually, which isn’t derived from the data and variogram. That’s why it can be used for other tasks. An example is to use IDW as a baseline for comparison to the different techniques.

Import packages#

[1]:
import numpy as np

from pyinterpolate import inverse_distance_weighting  # function for idw
from pyinterpolate import read_txt
from pyinterpolate import build_experimental_variogram  # experimental semivariogram
from pyinterpolate import build_theoretical_variogram, TheoreticalVariogram  # theoretical models
from pyinterpolate import kriging  # kriging models

1) Read point data#

[4]:
dem = read_txt('samples/point_data/txt/pl_dem_epsg2180.txt')
dem[:3]
[4]:
array([[2.37685325e+05, 5.45416708e+05, 5.12545509e+01],
       [2.37674140e+05, 5.45209671e+05, 4.89582825e+01],
       [2.37449255e+05, 5.41045935e+05, 1.68178635e+01]])

2) Divide the dataset into two sets: modeling and validation set#

In this step, we will divide our dataset into two sets:

  • modeling set (10%): points used for variogram modeling,

  • validation set (90%): points used for prediction and results validation.

The baseline dataset will be divided randomly.

[5]:
# Create modeling and validation sets

def create_model_validation_sets(dataset: np.array, training_set_ratio=0.5, rnd_seed=101):

    np.random.seed(rnd_seed)

    indexes_of_training_set = np.random.choice(range(len(dataset) - 1), int(training_set_ratio * len(dataset)), replace=False)
    training_set = dataset[indexes_of_training_set]
    validation_set = np.delete(dataset, indexes_of_training_set, 0)
    return training_set, validation_set

known_points, unknown_points = create_model_validation_sets(dem)

3) Perform IDW and evaluate it#

Inverse Distance Weighting doesn’t require variogram modeling or other steps. We pass power that affects weights in a denominator. Things to remember are:

  • Large power -> closer neighbors are more important,

  • power which is close to the zero -> all neighbors are important, and we assume that the distant process has the same effect on our variable as the closest events.

[6]:
IDW_POWER = 2
NUMBER_OF_NEIGHBOURS = 128

idw_predictions = []

for pt in unknown_points:
    idw_result = inverse_distance_weighting(known_points, pt[:-1], NUMBER_OF_NEIGHBOURS, IDW_POWER)
    idw_predictions.append(idw_result)
[7]:
# Evaluation

idw_rmse = np.mean(np.sqrt((unknown_points[:, -1] - np.array(idw_predictions))**2))
print(f'Root Mean Squared Error of prediction with IDW is {idw_rmse}')
Root Mean Squared Error of prediction with IDW is 2.5045203528088513

Clarification: Obtained Root Mean Squared Error could serve as a baseline for further model development. To build a better reference, we create four IDW models of powers:

  1. 0.5

  2. 1

  3. 2

  4. 4

[8]:
IDW_POWERS = [0.5, 1, 2, 4]
idw_rmse = {}

for pw in IDW_POWERS:
    results = []
    for pt in unknown_points:
        idw_result = inverse_distance_weighting(known_points, pt[:-1], NUMBER_OF_NEIGHBOURS, pw)
        results.append(idw_result)
    idw_rmse[pw] = np.mean(np.sqrt((unknown_points[:, -1] - np.array(results))**2))
[9]:
for pw in IDW_POWERS:
    print(f'Root Mean Squared Error of prediction with IDW of power {pw} is {idw_rmse[pw]:.4f}')
Root Mean Squared Error of prediction with IDW of power 0.5 is 4.4434
Root Mean Squared Error of prediction with IDW of power 1 is 3.8513
Root Mean Squared Error of prediction with IDW of power 2 is 2.5045
Root Mean Squared Error of prediction with IDW of power 4 is 1.8198

4) Perform variogram modeling on the modeling set#

In this step, we will go through semivariogram modeling for Kriging interpolation.

[10]:
max_range = 10000
step_size = 500

exp_semivar = build_experimental_variogram(known_points, step_size=step_size, max_range=max_range)
[11]:
semivar = build_theoretical_variogram(exp_semivar, model_name='circular', sill=exp_semivar.variance, rang=max_range)
[12]:
semivar.plot()
../../_images/usage_tutorials_B12-Kriging-Benchmarking_17_0.png

5) Validate Kriging and compare Kriging and IDW validation results#

Lastly, we perform Kriging interpolation and compare results to the IDW models. We use all points to weight values at unknown locations and the semivariogram model we chose in the previous step.

[13]:
# Set Kriging model

predictions = kriging(observations=known_points, theoretical_model=semivar, points=unknown_points[:, :-1], no_neighbors=128)
100%|██████████| 3447/3447 [01:17<00:00, 44.31it/s]
[14]:
# Evaluation

kriging_rmse = np.mean(np.sqrt((unknown_points[:, -1] - np.array(predictions[:, 0]))**2))
print(f'Root Mean Squared Error of prediction with Kriging is {kriging_rmse}')
Root Mean Squared Error of prediction with Kriging is 1.6638061333290732

Clarification: Kriging is better than any of the IDW models, and we may assume that our modeling approach gives us better insights into the spatial process we are observing. But this is not the end. Let’s consider a more complex scenario!

6) Bonus scenario: only 2% of values are known!#

Data sampled from the real world is less good than the sample from the tutorial. It is too expensive to densely sample every location, and usually, we get only a tiny percent of the area covered by data. That’s why it is good to compare IDW vs. Kriging in this scenario! We repeat steps 1-5 with a change in the division for the modeling/validation set. (I encourage you to try it alone and compare your code and results with those in this notebook).

[15]:
# Data preparation

known_points, unknown_points = create_model_validation_sets(dem, 0.02, rnd_seed=112)
[16]:
IDW_POWERS = [0.5, 1, 2, 4, 6, 8]
idw_rmse = {}

for pw in IDW_POWERS:
    results = []
    for pt in unknown_points:
        idw_result = inverse_distance_weighting(known_points, pt[:-1], NUMBER_OF_NEIGHBOURS, pw)
        results.append(idw_result)
    idw_rmse[pw] = np.mean(np.sqrt((unknown_points[:, -1] - np.array(results))**2))
[17]:
for pw in IDW_POWERS:
    print(f'Root Mean Squared Error of prediction with IDW of power {pw} is {idw_rmse[pw]:.4f}')
Root Mean Squared Error of prediction with IDW of power 0.5 is 17.1027
Root Mean Squared Error of prediction with IDW of power 1 is 13.8879
Root Mean Squared Error of prediction with IDW of power 2 is 7.1886
Root Mean Squared Error of prediction with IDW of power 4 is 5.1183
Root Mean Squared Error of prediction with IDW of power 6 is 5.2371
Root Mean Squared Error of prediction with IDW of power 8 is 5.3415
[18]:
# Variogram

max_range = 10000
step_size = 500

exp_semivar = build_experimental_variogram(known_points, step_size=step_size, max_range=max_range)
[19]:
semivar = TheoreticalVariogram()
semivar.autofit(exp_semivar, model_name='spherical')
semivar.plot()
../../_images/usage_tutorials_B12-Kriging-Benchmarking_27_0.png
[20]:
# Set Kriging model

predictions = kriging(observations=known_points, theoretical_model=semivar, points=unknown_points[:, :-1], no_neighbors=128, use_all_neighbors_in_range=False)
100%|██████████| 6756/6756 [00:58<00:00, 115.93it/s]
[21]:
# Evaluation

kriging_rmse = np.mean(np.sqrt((unknown_points[:, -1] - np.array(predictions[:, 0]))**2))
print(f'Root Mean Squared Error of prediction with Kriging is {kriging_rmse}')
Root Mean Squared Error of prediction with Kriging is 4.886647202272596
[22]:
# Comparison

for pw in IDW_POWERS:
    print(f'Root Mean Squared Error of prediction with IDW of power {pw} is {idw_rmse[pw]:.4f}')
print(f'Root Mean Squared Error of prediction with Kriging is {kriging_rmse:.4f}')
Root Mean Squared Error of prediction with IDW of power 0.5 is 17.1027
Root Mean Squared Error of prediction with IDW of power 1 is 13.8879
Root Mean Squared Error of prediction with IDW of power 2 is 7.1886
Root Mean Squared Error of prediction with IDW of power 4 is 5.1183
Root Mean Squared Error of prediction with IDW of power 6 is 5.2371
Root Mean Squared Error of prediction with IDW of power 8 is 5.3415
Root Mean Squared Error of prediction with Kriging is 4.8866

Your results may be different (if you set another random seed), but in most cases, Kriging will be better than IDW or very close to the best results from IDW. Even more important is that for the single data source with a low number of samples, we don’t have the opportunity to perform the validation step, and we’re unable to guess how big the power parameter should be. With Kriging, we model variogram, and voila! - model works.


Where to go from here?#

  • B.1.3 Outliers and Kriging Model

  • B.2.1 Directional Ordinary Kriging

  • C.1.1 Blocks to Points Interpolation with Ordinary Kriging

  • C.1.2 Semivariogram Regularization

Changelog#

Date

Change description

Author

2023-08-23

The tutorial was refreshed and set along with the 0.5.0 version of the package

@SimonMolinsky

2023-04-15

Tutorial debugged and updated to the 0.4.1 version of the package

@SimonMolinsky

2022-11-05

Tutorial updated for the 0.3.5 version of the package

@SimonMolinsky

2022-08-23

The new version of tutorial for the 0.3.0 version of a package

@SimonMolinsky

2021-12-14

Sill selection was upgraded: now optimal sill is derived from the grid search within TheoreticalSemivariogram class

@SimonMolinsky

2021-12-13

Changed behavior of select_values_in_range() function

@SimonMolinsky

2021-12-08

Behavior of prepare_kriging_data() function has changed

@SimonMolinsky

2021-05-12

First version of tutorial

@SimonMolinsky

[ ]: