A.1.1 Semivariogram Estimation#

Table of Contents:#

  1. Read point data

  2. Create the experimental variogram

  3. Set manually different semivariogram models

  4. Set a semivariogram model automatically

  5. Export model

  6. Import model

Introduction#

In this tutorial, we will learn how to read and prepare data for semivariogram modeling, manually set semivariogram type, and do it automatically. We will compare different semivariogram models by visualizing the outcomes of the models.

Semivariogram modeling is an initial step before we can perform spatial interpolation of unknown values with Kriging. When you complete this tutorial, you may learn how to:

  • perform point Kriging (ordinary and simple),

  • regularize semivariogram of areal data.

We use DEM data which is stored in a file samples/point_data/txt/pl_dem.txt.

Import packages and modules#

[1]:
import numpy as np

import matplotlib.pyplot as plt

# IO - read text
from pyinterpolate import read_txt
# Experimental variogram
from pyinterpolate import build_experimental_variogram
# Theoretical Variogram
from pyinterpolate import TheoreticalVariogram, build_theoretical_variogram

1) Read point data#

[2]:
DATA = 'samples/point_data/txt/pl_dem_epsg2180.txt'
dem = read_txt(DATA)
[3]:
# Look into a first few lines of data

dem[-5:, :]
[3]:
array([[2.55032701e+05, 5.47047700e+05, 6.29624100e+01],
       [2.54975796e+05, 5.45920408e+05, 2.03861294e+01],
       [2.54751735e+05, 5.41480271e+05, 4.03093376e+01],
       [2.54682103e+05, 5.40099921e+05, 2.19432678e+01],
       [2.54521994e+05, 5.36925123e+05, 5.15251350e+01]])

2) Create the experimental variogram#

Checking spatial correlation in our dataset is the first required step. We use variograms - plots of the dissimilarity between point pairs (y-axis) along a rising distance (x-axis). There are two “types” of variograms.

  • Experimental (semi)variogram: directly showing differences between point pairs at specific distances - lags. It could be very messy. We use it to check if there is a spatial correlation.

  • Theoretical (semi)variogram: is a particular function applied to the experimental variogram. We fit a function from a pre-defined set to experimental semivariogram points and choose the model with the lowest error possible. We have a few models to try and only three hyperparameters for controlling their behavior.

We start our analysis by setting up the experimental model. We use the build_experimental_variogram() function, which creates for us an ExperimentalVariogram object. We pass to this function three parameters:

  1. input_array: numpy array with coordinates and observed values, for example: [[0, 0, 10], [0, 1, 20]],

  2. step_size: we must divide our area of analysis into discrete lags. The lags are intervals where we check if the point has a neighbor. For example, if we look into the lag 500, then we are going to compare one point with other points in a distance (0, 1000] from this point,

  3. max_range: this parameter represents the possible maximum range of spatial dependency. This parameter should be at most half of the area extent.

[4]:
# Create experimental semivariogram

step_radius = 500  # meters
max_range = 10000  # meters

experimental_variogram = build_experimental_variogram(input_array=dem, step_size=step_radius, max_range=max_range)
[5]:
# What is a type of experimental variogram?

type(experimental_variogram)
[5]:
pyinterpolate.variogram.empirical.experimental_variogram.ExperimentalVariogram

The ExperimentalVariogram object represents all information that we could estimate from our measurements:

  • semivariance: dissimilarity over a distance,

  • covariance: similarity over a distance,

  • variance: non-spatial variance of all points.

The class has two useful methods:

  1. We can print object statistics with the print() function,

  2. or we can plot semivariance, covariance, and variance and check how they behave with the .plot() method.

Let’s check both!

[6]:
print(experimental_variogram)
+--------+--------------------+---------------------+--------------------+
|  lag   |    semivariance    |      covariance     |    var_cov_diff    |
+--------+--------------------+---------------------+--------------------+
| 500.0  | 14.716058620972868 |  480.99904783812127 | 8.821278469608444  |
| 1000.0 | 34.85400868491261  |  464.2196256667607  | 25.600700640969023 |
| 1500.0 | 57.712760007769795 |  433.73651582698574 | 56.08381048074398  |
| 2000.0 | 82.49483539006093  |  403.5799288984297  | 86.24039740929999  |
| 2500.0 | 111.4449493651956  |  364.9336781724029  | 124.88664813532682 |
| 3000.0 | 143.4007980025356  |  325.6459001433955  | 164.1744261643342  |
| 3500.0 | 175.10704725708965 |  284.4093681493841  | 205.41095815834564 |
| 4000.0 | 209.3522343651705  |  245.59671700665646 | 244.22360930107325 |
| 4500.0 | 244.46785072173265 |  204.4305639103993  | 285.3897623973304  |
| 5000.0 | 281.0897575749926  |  159.61259251127132 | 330.2077337964584  |
| 5500.0 | 315.67549229122216 |  118.99201963546533 | 370.8283066722644  |
| 6000.0 | 349.5718260502048  |  81.90206675127321  | 407.9182595564565  |
| 6500.0 | 381.7274369750636  |  47.01829083567338  | 442.80203547205633 |
| 7000.0 | 413.81104972218697 |  15.768848642349267 | 474.05147766538045 |
| 7500.0 |  439.311684307325  |  -9.119116076399862 | 498.93944238412956 |
| 8000.0 | 461.6911951584344  | -29.173423730506535 | 518.9937500382363  |
| 8500.0 | 482.51366131979273 |  -44.75615663700621 |  534.576482944736  |
| 9000.0 | 501.63014408854025 | -58.013038178981994 | 547.8333644867117  |
| 9500.0 | 522.2111025684128  |  -72.92100663867586 | 562.7413329464056  |
+--------+--------------------+---------------------+--------------------+
  • Lag is a column with the lag center,

  • semivariance is a column with a dissimilarity metric,

  • covariance is a column with a similarity metric,

  • var_cov_diff is a difference between the variance and the covariance; ideally, it should equal the semivariance. But it is possible only if a spatial process is second-order stationary. It is a rare situation in real-world applications. Usually, this value is very close to the semivariance.

We can plot semivariance and covariance to understand their relation.

[7]:
experimental_variogram.plot(plot_semivariance=True, plot_covariance=True, plot_variance=True)
../../_images/usage_tutorials_A11-Semivariogram-Estimation_13_0.png

Our plot shows three objects:

  1. Circles represent semivariance,

  2. Plus signs represent covariance,

  3. The dashed line is a variance.

We see here that semivariance and covariance are mirrored. What does it mean? It is normal behavior, and we should expect it - semivariance and covariance have symmetrical trends (they differ in a sign). We can read it as:

  • with semivariance, the dissimilarity between point pairs over a distance increases,

  • covariance shows that the similarity between point pairs over a distance decreases.

In the best-case scenario (data is stationary, e.g., mean and variance don’t change with a distance), variance and covariance should be equal to semivariance. That’s why our ExperimentalVariogram object prints the difference between those two. We can quickly check if the process we measure is stationary.

With the experimental variogram, we can create a theoretical model of semivariance.

3. Set manually different semivariogram models#

With an experimental variogram, we can start modeling theoretical function that optimally describes observed data. Our role is to choose the modeling function and to set three hyperparameters: nugget, sill, and range. You can read more about semivariogram models here: Geostatistics: Theoretical Variogram Models.

Semivariogram has three basic properties:

  • nugget: the initial value at a zero distance. In most cases, it is zero, but sometimes it represents a bias in observations.

  • sill: a distance where the semivariogram flattens and reaches approximately 95% of dissimilarity. Sometimes we cannot find a sill; for example, if differences grow exponentially, but in pyinterpolate it is usually set close to the variance of data,

  • range: is a distance where a variogram reaches its sill. Larger distances are negligible for interpolation.

We are not forced to know all of them at the beginning. The package may easily derive them all with the class TheoriticalVariogram autofit() method.

We can create a theoretical model in two ways: 1. Manually, with the build_theoretical_variogram() function, 2. Semi-automatically, but here’s the catch: we must create a TheoreticalVariogram object first, and in the second step, we may use the .autofit() method. Why is that? The reason is simple - if we want to fit a semivariogram automatically, the algorithm (or creator) assumes we know what we are doing. And with knowledge, we can control multiple parameters of the class.

We start from the build_theoretical_variogram() function and look into different models fitted to our data.

Models#

We can choose from a set of predefined models:

  • circular,

  • cubic,

  • exponential,

  • gaussian,

  • linear,

  • power,

  • spherical.

We will set the following:

  • sill to the experimental variogram variance,

  • nugget to zero,

  • range to 8000 (we see in the experimental variogram plot that the semivariance flattens around this range, and it becomes close to the variance).

[8]:
sill = experimental_variogram.variance
nugget = 0
var_range = 8000
[9]:
# circular

circular_model = build_theoretical_variogram(experimental_variogram=experimental_variogram,
                                             model_name='circular',
                                             sill=sill,
                                             rang=var_range,
                                             nugget=nugget)
[10]:
print(circular_model)
* Selected model: Circular model
* Nugget: 0
* Sill: 489.8203263077297
* Range: 8000
* Spatial Dependency Strength is Unknown
* Mean Bias: -52.60629816538764
* Mean RMSE: 63.00280857591535
* Error-lag weighting method: None


+--------+--------------------+--------------------+---------------------+
|  lag   |    theoretical     |    experimental    |     bias (y-y')     |
+--------+--------------------+--------------------+---------------------+
| 500.0  | 38.953271455641215 | 14.716058620972868 | -24.237212834668348 |
| 1000.0 | 77.75383379923626  | 34.85400868491261  | -42.899825114323654 |
| 1500.0 | 116.24715805742842 | 57.712760007769795 |  -58.53439804965863 |
| 2000.0 | 154.2749647378938  | 82.49483539006093  |  -71.78012934783288 |
| 2500.0 | 191.6730553536057  | 111.4449493651956  |  -80.22810598841009 |
| 3000.0 | 228.26874220429437 | 143.4007980025356  |  -84.86794420175877 |
| 3500.0 | 263.87764398483665 | 175.10704725708965 |   -88.770596727747  |
| 4000.0 | 298.29949183176774 | 209.3522343651705  |  -88.94725746659725 |
| 4500.0 | 331.3123650670836  | 244.46785072173265 |  -86.84451434535094 |
| 5000.0 | 362.66434202230323 | 281.0897575749926  |  -81.57458444731066 |
| 5500.0 | 392.0606537618678  | 315.67549229122216 |  -76.38516147064564 |
| 6000.0 | 419.14238179486523 | 349.5718260502048  |  -69.57055574466045 |
| 6500.0 | 443.44740428898785 | 381.7274369750636  |  -61.71996731392426 |
| 7000.0 | 464.3273586391969  | 413.81104972218697 |  -50.51630891700995 |
| 7500.0 | 480.71958538405306 |  439.311684307325  |  -41.40790107672808 |
| 8000.0 | 489.8203263077297  | 461.6911951584344  |  -28.12913114929529 |
| 8500.0 | 489.8203263077297  | 482.51366131979273 |  -7.306664987936983 |
| 9000.0 | 489.8203263077297  | 501.63014408854025 |  11.809817780810533 |
| 9500.0 | 489.8203263077297  | 522.2111025684128  |  32.39077626068308  |
+--------+--------------------+--------------------+---------------------+
[11]:
circular_model.plot(experimental=True)
../../_images/usage_tutorials_A11-Semivariogram-Estimation_20_0.png
[12]:
# cubic

cubic_model = build_theoretical_variogram(experimental_variogram=experimental_variogram,
                                          model_name='cubic',
                                          sill=sill,
                                          rang=var_range,
                                          nugget=nugget)
[13]:
print(cubic_model)
* Selected model: Cubic model
* Nugget: 0
* Sill: 489.8203263077297
* Range: 8000
* Spatial Dependency Strength is Unknown
* Mean Bias: -77.44307814715333
* Mean RMSE: 100.93081804872125
* Error-lag weighting method: None


+--------+--------------------+--------------------+---------------------+
|  lag   |    theoretical     |    experimental    |     bias (y-y')     |
+--------+--------------------+--------------------+---------------------+
| 500.0  | 12.34878902539015  | 14.716058620972868 |  2.3672695955827177 |
| 1000.0 | 45.255288629599086 | 34.85400868491261  | -10.401279944686479 |
| 1500.0 | 92.68405535286398  | 57.712760007769795 | -34.971295345094184 |
| 2000.0 | 148.9805383955685  | 82.49483539006093  |  -66.48570300550756 |
| 2500.0 | 209.04428231682044 | 111.4449493651956  |  -97.59933295162485 |
| 3000.0 | 268.48143737523486 | 143.4007980025356  | -125.08063937269927 |
| 3500.0 | 323.7296800593239  | 175.10704725708965 | -148.62263280223428 |
| 4000.0 | 372.1486463548962  | 209.3522343651705  |  -162.7964119897257 |
| 4500.0 | 412.06898029686596 | 244.46785072173265 |  -167.6011295751333 |
| 5000.0 | 442.79310035287546 | 281.0897575749926  | -161.70334277788288 |
| 5500.0 | 464.5407861861308  | 315.67549229122216 | -148.86529389490863 |
| 6000.0 | 478.33268834485466 | 349.5718260502048  | -128.76086229464988 |
| 6500.0 | 485.8048634257556  | 381.7274369750636  | -104.07742645069203 |
| 7000.0 |  488.947437258918  | 413.81104972218697 |  -75.13638753673104 |
| 7500.0 | 489.7604986615125  |  439.311684307325  | -50.448814354187505 |
| 8000.0 | 489.8203263077297  | 461.6911951584344  |  -28.12913114929529 |
| 8500.0 | 489.8203263077297  | 482.51366131979273 |  -7.306664987936983 |
| 9000.0 | 489.8203263077297  | 501.63014408854025 |  11.809817780810533 |
| 9500.0 | 489.8203263077297  | 522.2111025684128  |  32.39077626068308  |
+--------+--------------------+--------------------+---------------------+
[14]:
cubic_model.plot()
../../_images/usage_tutorials_A11-Semivariogram-Estimation_23_0.png
[15]:
# Exponential model

exponential_model = build_theoretical_variogram(experimental_variogram=experimental_variogram,
                                                model_name='exponential',
                                                sill=sill,
                                                rang=var_range,
                                                nugget=nugget)
print(exponential_model)
* Selected model: Exponential model
* Nugget: 0
* Sill: 489.8203263077297
* Range: 8000
* Spatial Dependency Strength is Unknown
* Mean Bias: 62.87847144716388
* Mean RMSE: 97.41020702761479
* Error-lag weighting method: None


+--------+--------------------+--------------------+---------------------+
|  lag   |    theoretical     |    experimental    |     bias (y-y')     |
+--------+--------------------+--------------------+---------------------+
| 500.0  | 29.676713342689204 | 14.716058620972868 | -14.960654721716336 |
| 1000.0 | 57.555405518182404 | 34.85400868491261  | -22.701396833269797 |
| 1500.0 | 83.74501312199658  | 57.712760007769795 | -26.032253114226783 |
| 2000.0 | 108.34787261497875 | 82.49483539006093  | -25.853037224917827 |
| 2500.0 | 131.46012020525072 | 111.4449493651956  |  -20.01517084005512 |
| 3000.0 | 153.17206750253146 | 143.4007980025356  |  -9.771269499995867 |
| 3500.0 | 173.56855441271478 | 175.10704725708965 |  1.538492844374872  |
| 4000.0 | 192.72928065164504 | 209.3522343651705  |  16.622953713525447 |
| 4500.0 | 210.72911717348904 | 244.46785072173265 |   33.7387335482436  |
| 5000.0 | 227.63839873061636 | 281.0897575749926  |  53.451358844376216 |
| 5500.0 | 243.5231987081728  | 315.67549229122216 |  72.15229358304936  |
| 6000.0 | 258.44558730726845 | 349.5718260502048  |  91.12623874293632  |
| 6500.0 | 272.4638740856379  | 381.7274369750636  |  109.26356288942571 |
| 7000.0 | 285.63283580350344 | 413.81104972218697 |  128.17821391868353 |
| 7500.0 |  298.003930464957  |  439.311684307325  |  141.30775384236796 |
| 8000.0 | 309.6254983912286  | 461.6911951584344  |  152.06569676720585 |
| 8500.0 | 320.54295111154215 | 482.51366131979273 |  161.97071020825058 |
| 9000.0 | 330.79894880965327 | 501.63014408854025 |  170.83119527888698 |
| 9500.0 | 340.4335670194438  | 522.2111025684128  |   181.777535548969  |
+--------+--------------------+--------------------+---------------------+
[16]:
exponential_model.plot()
../../_images/usage_tutorials_A11-Semivariogram-Estimation_25_0.png
[17]:
# Gaussian model

gaussian_model = build_theoretical_variogram(experimental_variogram=experimental_variogram,
                                             model_name='gaussian',
                                             sill=sill,
                                             rang=var_range,
                                             nugget=nugget)
print(gaussian_model)
* Selected model: Gaussian model
* Nugget: 0
* Sill: 489.8203263077297
* Range: 8000
* Spatial Dependency Strength is Unknown
* Mean Bias: 106.76795548020137
* Mean RMSE: 116.3487709038113
* Error-lag weighting method: None


+--------+--------------------+--------------------+--------------------+
|  lag   |    theoretical     |    experimental    |    bias (y-y')     |
+--------+--------------------+--------------------+--------------------+
| 500.0  | 1.9096284783003161 | 14.716058620972868 | 12.806430142672552 |
| 1000.0 | 7.593960284943274  | 34.85400868491261  | 27.26004839996933  |
| 1500.0 | 16.92106251490172  | 57.712760007769795 | 40.791697492868074 |
| 2000.0 | 29.676713342689204 | 82.49483539006093  | 52.81812204737172  |
| 2500.0 | 45.57258050587133  | 111.4449493651956  | 65.87236885932427  |
| 3000.0 | 64.25705194799532  | 143.4007980025356  | 79.14374605454027  |
| 3500.0 | 85.32815073541408  | 175.10704725708965 | 89.77889652167556  |
| 4000.0 | 108.34787261497875 | 209.3522343651705  | 101.00436175019173 |
| 4500.0 | 132.85723424545685 | 244.46785072173265 | 111.6106164762758  |
| 5000.0 | 158.39131498993746 | 281.0897575749926  | 122.69844258505512 |
| 5500.0 | 184.49361349105533 | 315.67549229122216 | 131.18187880016683 |
| 6000.0 | 210.72911717348904 | 349.5718260502048  | 138.84270887671573 |
| 6500.0 | 236.69559082440182 | 381.7274369750636  | 145.03184615066178 |
| 7000.0 | 262.0327201487193  | 413.81104972218697 | 151.77832957346766 |
| 7500.0 | 286.42888738780334 |  439.311684307325  | 152.88279691952164 |
| 8000.0 | 309.6254983912286  | 461.6911951584344  | 152.06569676720585 |
| 8500.0 | 331.41891440687124 | 482.51366131979273 | 151.0947469129215  |
| 9000.0 | 351.66015926974364 | 501.63014408854025 | 149.9699848187966  |
| 9500.0 | 370.2526675939889  | 522.2111025684128  | 151.95843497442388 |
+--------+--------------------+--------------------+--------------------+
[18]:
gaussian_model.plot()
../../_images/usage_tutorials_A11-Semivariogram-Estimation_27_0.png
[19]:
# Linear model

linear_model = build_theoretical_variogram(experimental_variogram=experimental_variogram,
                                           model_name='linear',
                                           sill=sill,
                                           rang=var_range,
                                           nugget=nugget)
print(linear_model)
* Selected model: Linear model
* Nugget: 0
* Sill: 489.8203263077297
* Range: 8000
* Spatial Dependency Strength is Unknown
* Mean Bias: -21.58683474038296
* Mean RMSE: 28.218692107535404
* Error-lag weighting method: None


+--------+--------------------+--------------------+---------------------+
|  lag   |    theoretical     |    experimental    |     bias (y-y')     |
+--------+--------------------+--------------------+---------------------+
| 500.0  | 30.613770394233107 | 14.716058620972868 | -15.897711773260239 |
| 1000.0 | 61.227540788466214 | 34.85400868491261  | -26.373532103553607 |
| 1500.0 | 91.84131118269931  | 57.712760007769795 |  -34.12855117492952 |
| 2000.0 | 122.45508157693243 | 82.49483539006093  |  -39.9602461868715  |
| 2500.0 | 153.06885197116554 | 111.4449493651956  | -41.623902605969946 |
| 3000.0 | 183.68262236539863 | 143.4007980025356  | -40.281824362863034 |
| 3500.0 | 214.29639275963174 | 175.10704725708965 | -39.189345502542096 |
| 4000.0 | 244.91016315386486 | 209.3522343651705  |  -35.55792878869437 |
| 4500.0 | 275.52393354809794 | 244.46785072173265 | -31.056082826365298 |
| 5000.0 | 306.1377039423311  | 281.0897575749926  |  -25.04794636733851 |
| 5500.0 | 336.7514743365642  | 315.67549229122216 | -21.075982045342016 |
| 6000.0 | 367.36524473079726 | 349.5718260502048  |  -17.79341868059248 |
| 6500.0 | 397.9790151250304  | 381.7274369750636  | -16.251578149966804 |
| 7000.0 | 428.5927855192635  | 413.81104972218697 | -14.781735797076522 |
| 7500.0 | 459.20655591349663 |  439.311684307325  | -19.894871606171648 |
| 8000.0 | 489.8203263077297  | 461.6911951584344  |  -28.12913114929529 |
| 8500.0 | 489.8203263077297  | 482.51366131979273 |  -7.306664987936983 |
| 9000.0 | 489.8203263077297  | 501.63014408854025 |  11.809817780810533 |
| 9500.0 | 489.8203263077297  | 522.2111025684128  |  32.39077626068308  |
+--------+--------------------+--------------------+---------------------+
[20]:
linear_model.plot()
../../_images/usage_tutorials_A11-Semivariogram-Estimation_29_0.png
[21]:
# Power model

power_model = build_theoretical_variogram(experimental_variogram=experimental_variogram,
                                          model_name='power',
                                          sill=sill,
                                          rang=var_range,
                                          nugget=nugget)
print(power_model)
* Selected model: Power model
* Nugget: 0
* Sill: 489.8203263077297
* Range: 8000
* Spatial Dependency Strength is Unknown
* Mean Bias: 46.89133587829636
* Mean RMSE: 58.334699200244835
* Error-lag weighting method: None


+--------+--------------------+--------------------+--------------------+
|  lag   |    theoretical     |    experimental    |    bias (y-y')     |
+--------+--------------------+--------------------+--------------------+
| 500.0  | 1.9133606496395692 | 14.716058620972868 | 12.802697971333298 |
| 1000.0 | 7.653442598558277  | 34.85400868491261  | 27.20056608635433  |
| 1500.0 | 17.22024584675612  | 57.712760007769795 | 40.49251416101367  |
| 2000.0 | 30.613770394233107 | 82.49483539006093  | 51.88106499582782  |
| 2500.0 | 47.83401624098923  | 111.4449493651956  | 63.61093312420637  |
| 3000.0 | 68.88098338702449  | 143.4007980025356  | 74.51981461551111  |
| 3500.0 | 93.75467183233889  | 175.10704725708965 | 81.35237542475076  |
| 4000.0 | 122.45508157693243 | 209.3522343651705  | 86.89715278823806  |
| 4500.0 | 154.98221262080511 | 244.46785072173265 | 89.48563810092753  |
| 5000.0 | 191.33606496395691 | 281.0897575749926  | 89.75369261103566  |
| 5500.0 | 231.5166386063879  | 315.67549229122216 | 84.15885368483427  |
| 6000.0 | 275.52393354809794 | 349.5718260502048  | 74.04789250210683  |
| 6500.0 | 323.3579497890872  | 381.7274369750636  | 58.369487185976425 |
| 7000.0 | 375.01868732935554 | 413.81104972218697 | 38.79236239283142  |
| 7500.0 | 430.50614616890306 |  439.311684307325  | 8.805538138421923  |
| 8000.0 | 489.8203263077297  | 461.6911951584344  | -28.12913114929529 |
| 8500.0 | 489.8203263077297  | 482.51366131979273 | -7.306664987936983 |
| 9000.0 | 489.8203263077297  | 501.63014408854025 | 11.809817780810533 |
| 9500.0 | 489.8203263077297  | 522.2111025684128  | 32.39077626068308  |
+--------+--------------------+--------------------+--------------------+
[22]:
power_model.plot()
../../_images/usage_tutorials_A11-Semivariogram-Estimation_31_0.png
[23]:
# Spherical model

spherical_model = build_theoretical_variogram(experimental_variogram=experimental_variogram,
                                              model_name='spherical',
                                              sill=sill,
                                              rang=var_range,
                                              nugget=nugget)

print(spherical_model)
* Selected model: Spherical model
* Nugget: 0
* Sill: 489.8203263077297
* Range: 8000
* Spatial Dependency Strength is Unknown
* Mean Bias: -72.94546270439245
* Mean RMSE: 87.44243944688779
* Error-lag weighting method: None


+--------+--------------------+--------------------+---------------------+
|  lag   |    theoretical     |    experimental    |     bias (y-y')     |
+--------+--------------------+--------------------+---------------------+
| 500.0  | 45.86086307104843  | 14.716058620972868 |  -31.14480445007556 |
| 1000.0 | 91.36297102028944  | 34.85400868491261  |  -56.50896233537683 |
| 1500.0 | 136.1475687259156  | 57.712760007769795 |  -78.4348087181458  |
| 2000.0 | 179.85590106611951 | 82.49483539006093  |  -97.36106567605859 |
| 2500.0 | 222.12921291909373 | 111.4449493651956  | -110.68426355389813 |
| 3000.0 | 262.6087491630309  | 143.4007980025356  |  -119.2079511604953 |
| 3500.0 | 300.9357546761235  | 175.10704725708965 | -125.82870741903386 |
| 4000.0 | 336.7514743365642  | 209.3522343651705  | -127.39923997139368 |
| 4500.0 | 369.69715302254554 | 244.46785072173265 | -125.22930230081289 |
| 5000.0 | 399.4140356122601  | 281.0897575749926  |  -118.3242780372675 |
| 5500.0 | 425.54336698390046 | 315.67549229122216 |  -109.8678746926783 |
| 6000.0 | 447.7263920156592  | 349.5718260502048  |  -98.15456596545442 |
| 6500.0 | 465.6043555857289  | 381.7274369750636  |  -83.87691861066531 |
| 7000.0 | 478.8185025723022  | 413.81104972218697 |  -65.00745285011521 |
| 7500.0 | 487.0100778535716  |  439.311684307325  |  -47.69839354624662 |
| 8000.0 | 489.8203263077297  | 461.6911951584344  |  -28.12913114929529 |
| 8500.0 | 489.8203263077297  | 482.51366131979273 |  -7.306664987936983 |
| 9000.0 | 489.8203263077297  | 501.63014408854025 |  11.809817780810533 |
| 9500.0 | 489.8203263077297  | 522.2111025684128  |  32.39077626068308  |
+--------+--------------------+--------------------+---------------------+
[24]:
spherical_model.plot()
../../_images/usage_tutorials_A11-Semivariogram-Estimation_33_0.png

A quick look into plots, and we can bet that the best model is linear. We can read each table printed with the print() method, but instead, we will read Root Mean Squared Error of each model and select the model with the lowest value of RMSE.

[25]:
models = [
    circular_model, cubic_model, exponential_model, gaussian_model, linear_model, power_model, spherical_model
]

lowest_rmse = np.inf
chosen_model = ''

for _model in models:

    # Get attrs
    model_name = _model.name
    model_rmse = _model.rmse

    # Check error
    if model_rmse < lowest_rmse:
        lowest_rmse = model_rmse
        chosen_model = model_name

    # Print status
    msg = f'Model: {model_name}, RMSE: {model_rmse}'
    print(msg)

msg = f'The best model is {chosen_model} with RMSE {lowest_rmse}'
print(msg)
Model: circular, RMSE: 63.00280857591535
Model: cubic, RMSE: 100.93081804872125
Model: exponential, RMSE: 97.41020702761479
Model: gaussian, RMSE: 116.3487709038113
Model: linear, RMSE: 28.218692107535404
Model: power, RMSE: 58.334699200244835
Model: spherical, RMSE: 87.44243944688779
The best model is linear with RMSE 28.218692107535404

4) Set automatically semivariogram model#

We can find the optimal semivariogram model automatically, but we can’t do it with the build_theoretical_variogram() function. Instead, we use the TheoreticalVariogram class. Manual initialization allows us to find the best set of hyperparameters automatically.

This process has two steps:

  1. The TheoreticalVariogram class object initialization.

  2. Fitting the best model and finding the optimal set of parameters with the .autofit() method.

The autofit() method takes multiple parameters. The most important for us are:

  • experimental_variogram - the experimental variogram object (ExperimentalVariogram type),

  • model_name or model_types - the first argument is a name of the model. The second is a list with names to test - if you pass it, then algorithm will check every model from this list. Basic model_name parameters are:

    • circular,

    • cubic,

    • exponential,

    • gaussian,

    • linear,

    • power,

    • spherical,

      1. all: checks all available model types,

      1. safe: check linear, spherical, and power models.

If you want to test different set of models instead then use model_types with list of models, as for example: ['circular', 'power']. You may choose any model excluding all and safe keywords which refer to multiple models.

  • nugget - nugget (bias) of a variogram. If given, then the nugget is fixed to this value.

  • min_nugget - default = 0.0 - the minimal fraction of the variance at a distance 0 to search for the optimal nugget,

  • max_nugget - default = 0.5 - the maximum fraction of the variance at a distance 0 to search for the optimal nugget,

  • number_of_nuggets - default = 16 - how many equally spaced steps between min_nugget and max_nugget to check.

  • rang - if given, then the range is fixed to this value.

  • min_range - default = 0.1 - the minimal fraction of a study (maximum) extent to find the optimal range.

  • max_range - default = 0.5 - the maximum fraction of a study (maximum) extent to find the optimal range.

  • number_of_ranges - default = 16 - how many equally spaced ranges are tested between min_range and max_range.

  • sill - if given, it is fixed to this value.

  • min_sill - default = 0 - the minimal fraction of the dataset variance to find the sill.

  • max_sill - default = 1 - the maximum fraction of the dataset variance to find the sill. It should be lower or equal to 1, but it is possible to set it above. A warning is printed if max_sill is greater than 1.

  • number_of_sills - default = 16 - how many equally spaced sill values are tested between min_sill and max_sill.

  • error_estimator - default = ‘rmse’ - Error estimation to choose the best model. Available options are:

    • rmse: Root Mean Squared Error,

    • mae: Mean Absolute Error,

    • bias: Forecast Bias,

    • smape: Symmetric Mean Absolute Percentage Error.

In the first run, we will set nugget, sill, and range as fixed values, and we will see which model algorithm chooses:

[26]:
semivariogram_model = TheoreticalVariogram()
[27]:
fitted = semivariogram_model.autofit(
    experimental_variogram=experimental_variogram,
    model_name='all',
    nugget=0,
    rang=var_range,
    sill=sill)
[28]:
fitted
[28]:
{'model_name': 'linear',
 'nugget': 0,
 'sill': 489.8203263077297,
 'range': 8000,
 'fitted_model': array([[ 500.        ,   30.61377039],
        [1000.        ,   61.22754079],
        [1500.        ,   91.84131118],
        [2000.        ,  122.45508158],
        [2500.        ,  153.06885197],
        [3000.        ,  183.68262237],
        [3500.        ,  214.29639276],
        [4000.        ,  244.91016315],
        [4500.        ,  275.52393355],
        [5000.        ,  306.13770394],
        [5500.        ,  336.75147434],
        [6000.        ,  367.36524473],
        [6500.        ,  397.97901513],
        [7000.        ,  428.59278552],
        [7500.        ,  459.20655591],
        [8000.        ,  489.82032631],
        [8500.        ,  489.82032631],
        [9000.        ,  489.82032631],
        [9500.        ,  489.82032631]]),
 'rmse': 28.218692107535404,
 'bias': nan,
 'mae': nan,
 'smape': nan}

The chosen model is linear, automatically set as the class parameter. The algorithm performs the same steps as we did before. It has selected a model based on the RMSE. All other parameters were untouched.

Now, we let the algorithm find the optimal sill and range, and we check if our RMSE will be better with a different pair of sill and range:

[29]:
fitted = semivariogram_model.autofit(experimental_variogram=experimental_variogram, nugget=0)
[30]:
print(fitted)
{'model_name': 'linear', 'nugget': 0, 'sill': 489.8203263077297, 'range': 8990.27235812866, 'fitted_model': array([[ 500.        ,   27.2416845 ],
       [1000.        ,   54.483369  ],
       [1500.        ,   81.7250535 ],
       [2000.        ,  108.966738  ],
       [2500.        ,  136.2084225 ],
       [3000.        ,  163.450107  ],
       [3500.        ,  190.6917915 ],
       [4000.        ,  217.933476  ],
       [4500.        ,  245.1751605 ],
       [5000.        ,  272.41684501],
       [5500.        ,  299.65852951],
       [6000.        ,  326.90021401],
       [6500.        ,  354.14189851],
       [7000.        ,  381.38358301],
       [7500.        ,  408.62526751],
       [8000.        ,  435.86695201],
       [8500.        ,  463.10863651],
       [9000.        ,  489.82032631],
       [9500.        ,  489.82032631]]), 'rmse': 21.744673287932994, 'bias': nan, 'mae': nan, 'smape': nan}

After .autofit() our algorithm has chosen a different set of parameters and a model type! Let’s plot it against experimental variogram:

[31]:
semivariogram_model.plot()
../../_images/usage_tutorials_A11-Semivariogram-Estimation_45_0.png

Compare this model to the linear model fitted before:

[32]:
_lags = semivariogram_model.lags

plt.figure(figsize=(20, 8))
plt.scatter(_lags, experimental_variogram.experimental_semivariances, color='#762a83')  # Experimental
plt.plot(_lags, linear_model.fitted_model[:, 1], ':p', color='#e7d4e8', mec='black')
plt.plot(_lags, semivariogram_model.fitted_model[:, 1], ':o', color='#1b7837', mec='black')
plt.title('Comparison of theoretical semivariance models')
plt.legend(['Experimental Semivariance',
            'Linear',
            'Cubic'])
plt.xlabel('Distance')
plt.ylabel('Semivariance')
plt.show()
../../_images/usage_tutorials_A11-Semivariogram-Estimation_47_0.png

5) Export model#

Models could be exported and used for other purposes. It is vital for the semivariogram regularization. Those calculations are computationally intensive, and in production, it is not a good idea to build a complete pipeline; we divide it into two steps.

Model can be exported to dict with .to_dict() method or to json with .to_json() method:

[33]:
# Set spherical model

dict_model = semivariogram_model.to_dict()

# Save to json
semivariogram_model.to_json('output/semivariogram_calculation_model.json')

6) Import model#

We can import a semivariogram model into a new TheoreticalSemivariogram class instance without passing the experimental semivariogram or actual data points. It is helpful for some applications focused on kriging, where we are sure that our semivariogram model fits data well.

We can import a model with two methods .from_dict() if we have our model parameters in the Python dictionary or .from_json() if model parameters are stored in a flat file:

[34]:
other_model_from_dict = TheoreticalVariogram()

print(other_model_from_dict)
Theoretical model is not calculated yet. Use fit() or autofit() methods to build or find a model or import model with from_dict() or from_json() methods.
[35]:
other_model_from_dict.from_dict(dict_model)
print(other_model_from_dict)
* Selected model: Linear model
* Nugget: 0.0
* Sill: 489.8203263077297
* Range: 8990.27235812866
* Spatial Dependency Strength is Unknown
* Mean Bias: 0.0
* Mean RMSE: 0.0
* Error-lag weighting method: None
[36]:
other_model_from_json = TheoreticalVariogram()
print(other_model_from_json)
Theoretical model is not calculated yet. Use fit() or autofit() methods to build or find a model or import model with from_dict() or from_json() methods.
[37]:
other_model_from_json.from_json('output/semivariogram_calculation_model.json')
[38]:
print(other_model_from_json)
* Selected model: Linear model
* Nugget: 0.0
* Sill: 489.8203263077297
* Range: 8990.27235812866
* Spatial Dependency Strength is Unknown
* Mean Bias: 0.0
* Mean RMSE: 0.0
* Error-lag weighting method: None


Where to go from here?

  • A.1.2 Theoretical Models

  • A.1.3 Spatial Dependence Index

  • A.2.1 Directional Semivariogram

  • A.2.2 Variogram Point Cloud

  • A.2.3 Experimental Variogram and Variogram Point Cloud Classes

  • B.1.1 Ordinary and Simple Kriging

Changelog#

Date

Change description

Author

2023-08-18

The tutorial was refreshed and set along with the 0.5.0 version of the package

@SimonMolinsky

2023-04-15

Tutorial debugged and updated to the 0.4.1 version of the package

@SimonMolinsky

2023-03-11

Spatial Dependence Index

@SimonMolinsky

2022-11-05

Tutorial updated for the 0.3.5 version of the package

@SimonMolinsky

2022-08-09

Tutorial updated to work with version 0.3.0

@SimonMolinsky

2021-12-14

Sill selection was upgraded: now optimal sill is derived from the grid search within the TheoreticalSemivariogram class

@SimonMolinsky

2021-12-13

Changed behavior of select_values_in_range() function

@SimonMolinsky

2021-10-13

Refactored TheoreticalSemivariogram (name change of class attribute)

@ethmtrgt

2021-05-28

Updated paths for input/output data

@SimonMolinsky

2021-05-11

Refactored TheoreticalSemivariogram class

@SimonMolinsky

2021-03-31

Update related to the change of semivariogram weighting

@SimonMolinsky

[ ]: