Experimental Variogram and Variogram Cloud classes#
Table of Contents:#
Create Experimental Variogram with the
ExperimentalVariogram
class.Create Experimental Variogram Point Cloud with the
VariogramCloud
class.
Introduction#
The geostatistical analysis starts with a dissimilarity estimation. We must know if a process is spatially dependent and at which distance points tend to be related to each other. Thus, we start with the experimental variogram estimation. In this tutorial, we will look closely into the API and learn what can be done with two basic experimental semivariogram estimation functionalities. One is the ExperimentalVariogram
class, and another is the VariogramCloud
class. The first is a
foundation of every other complex function, from semivariogram modeling to Poisson Kriging. The latter is used to analyze relations between points and their dispersion.
We will use DEM data stored in a file samples/point_data/txt/pl_dem_epsg2180.txt
Import packages & read data#
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyinterpolate import read_txt, ExperimentalVariogram, VariogramCloud
[2]:
dem = read_txt('samples/point_data/txt/pl_dem_epsg2180.txt')
# Take a sample
sample_size = int(0.05 * len(dem))
indices = np.random.choice(len(dem), sample_size, replace=False)
dem = dem[indices]
# Look into a first few lines of data
dem[:10, :]
[2]:
array([[2.39250338e+05, 5.48631524e+05, 8.84804077e+01],
[2.41201861e+05, 5.49103756e+05, 7.72773132e+01],
[2.50945606e+05, 5.48731991e+05, 8.15302811e+01],
[2.50384028e+05, 5.36743412e+05, 3.87784348e+01],
[2.53739451e+05, 5.44230034e+05, 2.12480278e+01],
[2.42744089e+05, 5.29666893e+05, 5.23056946e+01],
[2.38056136e+05, 5.34391921e+05, 2.95818558e+01],
[2.45023500e+05, 5.51069770e+05, 3.36740112e+01],
[2.44771460e+05, 5.32997525e+05, 4.72923431e+01],
[2.53090205e+05, 5.48829925e+05, 8.16635361e+01]])
1. Experimental Variogram class#
Let’s start from the ExperimentalVariogram
class.
class ExperimentalVariogram:
def __init__(self,
input_array,
step_size,
max_range,
weights=None,
direction=None,
tolerance=1.0,
method='t',
is_semivariance=True,
is_covariance=True):
...
The class has three required parameters and seven optional.
input_array
is a list of coordinates and values in the form(x, y, value)
. Those are our observations.step_size
parameter, which describes the distance between lags.max_range
, which describes the maximum range of analysis.
What do we know at the beginning? Our coordinate reference system is metric. It is EPSG:2180. We don’t know the limits of the region of interest to set the max_range
parameter. We must define our area size:
[3]:
# Get max distances for a region
minx = np.min(dem[:, 0])
maxx = np.max(dem[:, 0])
miny = np.min(dem[:, 1])
maxy = np.max(dem[:, 1])
x_max_distance = maxx - minx
y_max_distance = maxy - miny
print('Max lon distance:', x_max_distance)
print('Max lat distance:', y_max_distance)
Max lon distance: 17926.58479234294
Max lat distance: 25125.91187161021
We can assume that the maximum distance max_range
parameter shouldn’t be larger than the maximum distance in a lower dimension (~18000 m). In reality, the max_range
parameter is relatively close to the halved value of this distance, and we can check it on a variogram.
But we don’t know the step_size
parameter… Let’s assume it is 100 meters, and we will see if we’ve estimated it correctly.
[4]:
max_range = 18000
step_size = 100
experimental_variogram = ExperimentalVariogram(
input_array=dem,
step_size=step_size,
max_range=max_range
)
We will visually inspect semivariogram with the .plot()
method of ExperimentalVariogram
class:
[10]:
experimental_variogram.plot(plot_covariance=False, plot_variance=False)
Variogram looks nice. We can see a trend, but points are oscillating around it. This piece of information tells us that step_size
should be larger.
What with the max_range
? It is a tricky problem because the semivariogram seems right up to the end of its range. The idea is to set the maximum range around half of the study extent because the number of point pairs will fall from this distance.
We can check how many point pairs we have for each distance to make a better decision. The ExperimentalVariogram
class stores information about points within its attribute .experimental_semivariance_array
. It is a numpy
array with three columns: [lag, semivariance, number of point pairs]
, and we will use the last column to decide where to cut max_range
.
[7]:
_ds = experimental_variogram.experimental_semivariance_array
plt.figure()
plt.scatter(x=_ds[:, 0], y=_ds[:, -1])
plt.show()
Around the twelve-kilometer number of point pairs rapidly falls, so we can set max_range
to some value between 7.5 and 12.5 kilometers, we will choose 10000 kilometers as our max_range
.
[11]:
max_range = 10_000
step_size = 250
experimental_variogram = ExperimentalVariogram(
input_array=dem,
step_size=step_size,
max_range=max_range
)
experimental_variogram.plot(plot_covariance=False, plot_variance=False)
This time variogram looks cleaner. However, it still has oscillating readings. Let’s set step_size
to 500 meters:
[12]:
step_size = 500
experimental_variogram = ExperimentalVariogram(
input_array=dem,
step_size=step_size,
max_range=max_range
)
experimental_variogram.plot(plot_covariance=False, plot_variance=False)
We can use those parameters! Next, we will look into the is_covariance
parameter of the ExperimentalVariogram
class. It is set to True
by default. If we want to know the covariance of a dataset, then we should set the is_covariance
parameter to True
. Covariance is a measure of spatial similarity, opposite to semivariance. You will see in the plot that they mirror each other. But we can use it in kriging systems instead of semivariance.
ExperimentalVariogram
plotting function .plot()
has additional parameters:
plot_semivariance
(boolean, set toTrue
by default),plot_covariance
(boolean, set toTrue
by default),plot_variance
(boolean, set toTrue
by default).
We will plot all of those objects in a single figure.
[13]:
experimental_variogram.plot()
Two important things that we can read from this plot:
Experimental covariance is a mirrored version of experimental semivariance, and we can use it to solve the Kriging system instead of semivariance. It is a measure of similarity between point pairs. The highest similarity can be noticed on the first lags and decreases with a rising distance.
Variance calculated from data is used later as the sill in kriging models, and it can be used to transform semivariances into covariances.
This plot summarizes the basic steps we can do with the ExperimentalVariogram
class. We left many parameters untouched, but only for a moment. Now we are going to explore all capabilities of the class.
Optional parameter: weights
#
We won’t use this parameter in our tutorial, but we should learn where it comes from. It is a part of Poisson Kriging operations. Poisson Kriging algorithms use weights
to weight the semivariance between blocks by in-block population.
Optional parameters: direction, tolerance, method
#
Those three parameters are used to define a directional variogram. We will check how the variogram behaves in the Weast-East axis (0 degrees).
[14]:
direction = 0
tolerance = 0.1
method = 'e'
experimental_variogram = ExperimentalVariogram(
input_array=dem,
step_size=step_size,
max_range=max_range,
direction=direction,
tolerance=tolerance,
method=method
)
experimental_variogram.plot()
Let’s look into the North-South axis too:
[15]:
direction = 90
experimental_variogram = ExperimentalVariogram(
input_array=dem,
step_size=step_size,
max_range=max_range,
direction=direction,
tolerance=tolerance,
method=method
)
experimental_variogram.plot()
What can we learn from those plots? Variability along the W-E axis is lower than along the N-S axis. It may indicate continuous spatial structures along the W-E axis (for example, a river plain).
At this point, we are ready for modeling. However, if we want to be precise, we should look into outliers and analyze VariogramCloud
.
A classic variogram analysis tells us almost everything about spatial dependencies and dissimilarities in data. But there are more spatial investigation tools. We need to learn about the distribution of variances within a single lag, which is valuable information. Semivariogram shows us averaged semivariances. If we want to dig deeper, we should use the VariogramCloud
class and plot… the variogram cloud.
The VariogramCloud
class definition API is:
class VariogramCloud:
def __init__(self,
input_array,
step_size,
max_range,
direction=0,
tolerance=1,
calculate_on_creation=True):
pass
We pass the same parameters to the class as for the ExperimentalVariogram
class, and the only difference is the calculate_on_creation
boolean value. Variogram point cloud calculation is a slow operation, and that’s why we can start it during or after the class initialization. If we set calculate_on_creation
to False
, we must run the .calculate_experimental_variogram()
method later.
VariogramCloud
has more methods than ExperimentalVariogram
:
calculate_experimental_variogram()
to obtain the variogram (similar to the output from theExperimentalVariogram
class,describe()
to get lag statistics,plot()
to plot variogram cloud,remove_outliers()
to clean the variogram.
We will start with calculation and plotting variogram cloud.
[16]:
variogram_cloud = VariogramCloud(
input_array=dem,
step_size=step_size,
max_range=max_range
)
We can plot three different kinds of distribution plots: 'scatter', 'box', 'violin'
. We will choose box
because it clearly shows data distribution in contrast to scatter
and it’s easier to interpret than violin
plot.
[17]:
variogram_cloud.plot(kind='box')
[17]:
True
The general shape is similar to the shape of the experimental variogram. We can see that semivariances between point pairs contain outliers, especially for the lags close to our maximum range. To be sure about outliers’ existence, let’s look into a more complex violin plot:
[18]:
variogram_cloud.plot(kind='violin')
[18]:
True
Extreme outliers are more visible. Another thing is the difference between the mean and median. Distribution is highly skewed toward large values. If graphical representation is not enough, we may describe statistics with the method .describe()
:
[19]:
variogram_cloud.describe()
[19]:
{500: {'count': 234,
'avg semivariance': 15.48837895423851,
'std': 129.30689361933185,
'min': 0.0,
'max': 1143.5633926376795,
'25%': 0.3451266676565865,
'median': 1.3447192054627521,
'75%': 13.65846407461504,
'skewness': 6.865653366086163,
'kurtosis': 50.79024671616988,
'lag': 500},
1000: {'count': 606,
'avg semivariance': 25.151443803715708,
'std': 130.72724691457688,
'min': 1.3315911928657442e-06,
'max': 1077.959058797118,
'25%': 0.7740007161801259,
'median': 5.823250695397292,
'75%': 31.42736609817075,
'skewness': 4.567298658957882,
'kurtosis': 24.214435519746083,
'lag': 1000},
1500: {'count': 1008,
'avg semivariance': 57.835574935521514,
'std': 331.73215147893416,
'min': 7.597882358822972e-05,
'max': 3046.333026852517,
'25%': 0.968566885097971,
'median': 7.822932168508487,
'75%': 64.66452046956692,
'skewness': 5.608396965535479,
'kurtosis': 37.38797721538239,
'lag': 1500},
2000: {'count': 1348,
'avg semivariance': 79.75218552208693,
'std': 428.8461018481998,
'min': 3.025872865691781e-05,
'max': 3796.9629175248883,
'25%': 1.2690777453244664,
'median': 18.391886290972252,
'75%': 109.60415145155275,
'skewness': 4.869674455730696,
'kurtosis': 27.620361886790118,
'lag': 2000},
2500: {'count': 1698,
'avg semivariance': 145.95094727929936,
'std': 805.2460216899517,
'min': 4.243338480591774e-08,
'max': 5699.690698992359,
'25%': 1.7351306386444776,
'median': 22.872865243130946,
'75%': 162.32509036679767,
'skewness': 4.209560111010213,
'kurtosis': 18.750587407226615,
'lag': 2500},
3000: {'count': 1992,
'avg semivariance': 199.33218908929288,
'std': 911.2555224174007,
'min': 1.0608346201479435e-08,
'max': 5720.016503487772,
'25%': 2.4267591861680557,
'median': 32.15229412448389,
'75%': 249.17055305647318,
'skewness': 3.2787661143480946,
'kurtosis': 11.293592711474577,
'lag': 3000},
3500: {'count': 2154,
'avg semivariance': 209.00387222055934,
'std': 913.5411306035585,
'min': 6.0737496824003756e-05,
'max': 5516.191060331708,
'25%': 2.9810246527231357,
'median': 48.88620506064035,
'75%': 306.07525457920565,
'skewness': 3.132272141037236,
'kurtosis': 9.972795491912944,
'lag': 3500},
4000: {'count': 2572,
'avg semivariance': 268.3259773207153,
'std': 1056.4261950767657,
'min': 2.6099187380168587e-06,
'max': 5514.402312608567,
'25%': 2.6644885819441697,
'median': 61.834192301528674,
'75%': 381.5180078523699,
'skewness': 2.473310799225225,
'kurtosis': 5.473415218908773,
'lag': 4000},
4500: {'count': 2832,
'avg semivariance': 307.6134533281816,
'std': 1113.7493026553932,
'min': 4.211964551359415e-06,
'max': 5829.092762578512,
'25%': 5.008626806946268,
'median': 82.54248778484907,
'75%': 606.8352795074716,
'skewness': 2.3016213999312303,
'kurtosis': 4.838611890454055,
'lag': 4500},
5000: {'count': 2924,
'avg semivariance': 341.33197977216213,
'std': 1184.6820756378802,
'min': 7.017538882791996e-05,
'max': 5967.1652707402245,
'25%': 6.364002417831216,
'median': 105.43215119686101,
'75%': 714.8062540315768,
'skewness': 2.217432052486689,
'kurtosis': 4.458010177884344,
'lag': 5000},
5500: {'count': 3192,
'avg semivariance': 397.7678526465361,
'std': 1310.190495093556,
'min': 3.9138831198215485e-07,
'max': 7494.872101836896,
'25%': 7.306133981550374,
'median': 117.5330814906647,
'75%': 973.5924935907096,
'skewness': 2.048285104289412,
'kurtosis': 3.80948479120119,
'lag': 5500},
6000: {'count': 3188,
'avg semivariance': 441.4588322976255,
'std': 1341.9100279445179,
'min': 4.899957275483757e-07,
'max': 6701.4388024286745,
'25%': 9.152769157113653,
'median': 191.89130915285932,
'75%': 1115.490027037231,
'skewness': 1.8576607961887481,
'kurtosis': 2.8155666464376194,
'lag': 6000},
6500: {'count': 3480,
'avg semivariance': 457.2660981146536,
'std': 1364.1466464135244,
'min': 6.846037285868078e-05,
'max': 7663.801433086061,
'25%': 10.418562886014115,
'median': 204.0729240650271,
'75%': 1193.259201243348,
'skewness': 1.8383466986007484,
'kurtosis': 3.1069471071828714,
'lag': 6500},
7000: {'count': 3662,
'avg semivariance': 467.70408815019107,
'std': 1383.7114275988758,
'min': 2.620747545734048e-05,
'max': 7732.3306628377795,
'25%': 12.876975894745556,
'median': 302.43025554287306,
'75%': 1214.8304466051668,
'skewness': 1.9378885777226609,
'kurtosis': 3.5563617512199306,
'lag': 7000},
7500: {'count': 3634,
'avg semivariance': 516.9270183174606,
'std': 1389.344310228298,
'min': 1.1408701539039612e-06,
'max': 7813.73612767295,
'25%': 18.949909821152687,
'median': 359.12963762053187,
'75%': 1490.0087687096093,
'skewness': 1.554949463509357,
'kurtosis': 1.7478735159246037,
'lag': 7500},
8000: {'count': 3896,
'avg semivariance': 541.1530674487567,
'std': 1428.070471120106,
'min': 0.00011741125854314305,
'max': 7973.467257830838,
'25%': 19.034676041905186,
'median': 433.6325574710372,
'75%': 1563.9827408457786,
'skewness': 1.528514122993604,
'kurtosis': 1.7506732493933423,
'lag': 8000},
8500: {'count': 3858,
'avg semivariance': 547.5056312500515,
'std': 1468.6396045259646,
'min': 1.7320417100563645e-06,
'max': 7673.916123840958,
'25%': 23.704184239628376,
'median': 448.1990261501778,
'75%': 1405.225225292179,
'skewness': 1.5977571608298324,
'kurtosis': 1.8190966680505722,
'lag': 8500},
9000: {'count': 3948,
'avg semivariance': 570.1379938012599,
'std': 1447.8532805956781,
'min': 3.0405793950194493e-05,
'max': 6551.827118135992,
'25%': 37.25730361373644,
'median': 532.4391559513406,
'75%': 1588.2390081191297,
'skewness': 1.4628469824631245,
'kurtosis': 1.232784763520586,
'lag': 9000},
9500: {'count': 3730,
'avg semivariance': 585.3034449746383,
'std': 1511.2436330394091,
'min': 0.00023756933296681382,
'max': 7855.885675042067,
'25%': 39.582661400436336,
'median': 571.2002264822368,
'75%': 1571.7003516764962,
'skewness': 1.5794687444921045,
'kurtosis': 1.8317167345076042,
'lag': 9500}}
We get Python’s dict
as an output, but it is not especially readable; let’s convert it to a DataFrame
:
[20]:
desc = pd.DataFrame(variogram_cloud.describe())
[21]:
desc
[21]:
500 | 1000 | 1500 | 2000 | 2500 | 3000 | 3500 | 4000 | 4500 | 5000 | 5500 | 6000 | 6500 | 7000 | 7500 | 8000 | 8500 | 9000 | 9500 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 234.000000 | 606.000000 | 1008.000000 | 1348.000000 | 1.698000e+03 | 1.992000e+03 | 2154.000000 | 2572.000000 | 2832.000000 | 2924.000000 | 3.192000e+03 | 3.188000e+03 | 3480.000000 | 3662.000000 | 3634.000000 | 3896.000000 | 3858.000000 | 3948.000000 | 3730.000000 |
avg semivariance | 15.488379 | 25.151444 | 57.835575 | 79.752186 | 1.459509e+02 | 1.993322e+02 | 209.003872 | 268.325977 | 307.613453 | 341.331980 | 3.977679e+02 | 4.414588e+02 | 457.266098 | 467.704088 | 516.927018 | 541.153067 | 547.505631 | 570.137994 | 585.303445 |
std | 129.306894 | 130.727247 | 331.732151 | 428.846102 | 8.052460e+02 | 9.112555e+02 | 913.541131 | 1056.426195 | 1113.749303 | 1184.682076 | 1.310190e+03 | 1.341910e+03 | 1364.146646 | 1383.711428 | 1389.344310 | 1428.070471 | 1468.639605 | 1447.853281 | 1511.243633 |
min | 0.000000 | 0.000001 | 0.000076 | 0.000030 | 4.243338e-08 | 1.060835e-08 | 0.000061 | 0.000003 | 0.000004 | 0.000070 | 3.913883e-07 | 4.899957e-07 | 0.000068 | 0.000026 | 0.000001 | 0.000117 | 0.000002 | 0.000030 | 0.000238 |
max | 1143.563393 | 1077.959059 | 3046.333027 | 3796.962918 | 5.699691e+03 | 5.720017e+03 | 5516.191060 | 5514.402313 | 5829.092763 | 5967.165271 | 7.494872e+03 | 6.701439e+03 | 7663.801433 | 7732.330663 | 7813.736128 | 7973.467258 | 7673.916124 | 6551.827118 | 7855.885675 |
25% | 0.345127 | 0.774001 | 0.968567 | 1.269078 | 1.735131e+00 | 2.426759e+00 | 2.981025 | 2.664489 | 5.008627 | 6.364002 | 7.306134e+00 | 9.152769e+00 | 10.418563 | 12.876976 | 18.949910 | 19.034676 | 23.704184 | 37.257304 | 39.582661 |
median | 1.344719 | 5.823251 | 7.822932 | 18.391886 | 2.287287e+01 | 3.215229e+01 | 48.886205 | 61.834192 | 82.542488 | 105.432151 | 1.175331e+02 | 1.918913e+02 | 204.072924 | 302.430256 | 359.129638 | 433.632557 | 448.199026 | 532.439156 | 571.200226 |
75% | 13.658464 | 31.427366 | 64.664520 | 109.604151 | 1.623251e+02 | 2.491706e+02 | 306.075255 | 381.518008 | 606.835280 | 714.806254 | 9.735925e+02 | 1.115490e+03 | 1193.259201 | 1214.830447 | 1490.008769 | 1563.982741 | 1405.225225 | 1588.239008 | 1571.700352 |
skewness | 6.865653 | 4.567299 | 5.608397 | 4.869674 | 4.209560e+00 | 3.278766e+00 | 3.132272 | 2.473311 | 2.301621 | 2.217432 | 2.048285e+00 | 1.857661e+00 | 1.838347 | 1.937889 | 1.554949 | 1.528514 | 1.597757 | 1.462847 | 1.579469 |
kurtosis | 50.790247 | 24.214436 | 37.387977 | 27.620362 | 1.875059e+01 | 1.129359e+01 | 9.972795 | 5.473415 | 4.838612 | 4.458010 | 3.809485e+00 | 2.815567e+00 | 3.106947 | 3.556362 | 1.747874 | 1.750673 | 1.819097 | 1.232785 | 1.831717 |
lag | 500.000000 | 1000.000000 | 1500.000000 | 2000.000000 | 2.500000e+03 | 3.000000e+03 | 3500.000000 | 4000.000000 | 4500.000000 | 5000.000000 | 5.500000e+03 | 6.000000e+03 | 6500.000000 | 7000.000000 | 7500.000000 | 8000.000000 | 8500.000000 | 9000.000000 | 9500.000000 |
With this detailed table, we can analyze the variogram in detail and decide: - if lags are correctly placed, - if there are enough points per lag, - if the maximum range is too low or too high, - is distribution close to normal,
And based on the answers to those questions, we can change semivariogram parameters slightly.
The interesting row is avg semivariance
. Let’s plot it against lags, and let’s plot the output from ExperimentalVariogram
in the same figure:
[22]:
# Re-calculate experimental variogram
max_range = 10000
step_size = 500
experimental_variogram = ExperimentalVariogram(
input_array=dem,
step_size=step_size,
max_range=max_range
)
[23]:
# Plot
plt.figure()
plt.plot(experimental_variogram.experimental_semivariance_array[:, 0],
experimental_variogram.experimental_semivariance_array[:, 1])
plt.plot(desc.loc['avg semivariance'])
plt.legend(['ExperimentalVariogram output', 'Average semivariance from VariogramCloud'])
plt.show()
The result is the same, and it shouldn’t be a surprise. The experimental variogram averages all semivariances per lag. We stepped back in computations and focused on all point pairs within a lag.
There is one last method within the VariogramCloud
class: .remove_outliers()
. It cleans our variogram from anomalous readings. The cleaning default algorithm uses the z-score, but we can also use the interquartile range analysis. Both methods cut outliers from the largest or the lowest intervals. For data that deviates greately from normal distribution, it is better to use an inter-quartile range for cleaning:
[24]:
new_variogram_cloud = variogram_cloud.remove_outliers(method='iqr', iqr_lower_limit=3, iqr_upper_limit=2)
Now we can check what has happend with the variogram:
[25]:
new_variogram_cloud.plot(kind='violin')
[25]:
True
Maximum semivariances are much lower than before. How does look the averaged semivariogram? Now we will use the .calculate_experimental_variogram()
method to calculate the variogram directly and compare it to the experimental variogram retrieved from the ExperimentalVariogram
class.
[26]:
# Plot
exp_var_from_point_cloud = new_variogram_cloud.calculate_experimental_variogram()
plt.figure()
plt.plot(experimental_variogram.experimental_semivariance_array[:, 0],
experimental_variogram.experimental_semivariance_array[:, 1])
plt.plot(exp_var_from_point_cloud[:, 0],
exp_var_from_point_cloud[:, 1])
plt.legend(['ExperimentalVariogram output', 'Average semivariance from VariogramCloud'])
plt.show()
Now we see that the shape of the variogram persisted, but maximum values per lag are lower. It could be helpful for some models, especially if we eliminate outliers with extreme values.
Summary#
After all those steps, you have better insights into the ExperimentalVariogram
and VariogramCloud
classes. You may check your dataset and compare results to those in the tutorial.
The main takeaways from this tutorial are:
API of
ExperimentalVariogram
andVariogramCloud
classes,differences and similarities between
ExperimentalVariogram
andVariogramCloud
classes,full variogram analysis and preparation before spatial interpolation.
Where to go from here?#
B.1.1 Ordinary and Simple Kriging
B.3.1 Directional Ordinary Kriging
C.1.1 Outliers and Kriging Model
Changelog#
Date |
Change description |
Author |
---|---|---|
2023-08-22 |
The tutorial was refreshed and set along with the 0.5.0 version of the package |
@SimonMolinsky |
2023-01-23 |
The first release of the tutorial |
@SimonMolinsky |
[ ]: