Blocks to points with Ordinary Kriging#

This tutorial is based on the question of user @ikey from Discord channel of the package (02/04/2021). Thank you @ikey for your question!

Are we able to perform Semivariogram Regularization and Poisson Kriging if our point support represents only points without any values?

No, it’s not possible. But there is a hack that we can use - we can interpolate missing values from areal centroids with Ordinary and Simple Kriging just like we do with regular points. Why is this not possible with unknown points?

The idea behind semivariogram regularization of areal aggregates is to use semivariogram of point support. Point support must reflect an ongoing process. For example, in the case of epidemiology:

We have areal counts of infections divided by POPULATION over some areas (infection rates).
We take POPULATION blocks (points) with a number of inhabitants per block and use those as support for our map. Thus, we transform the choropleth map into the point map, taking only points where someone is living and might be infected. In other words, we use point values (population size) used in the denominator of the variable aggregated in choropleth map. We skip large parts of the map where the risk of infection is not present because no one is living there.
We obtain the population-at-risk map of points or smoothed choropleth map of infection risk without large visual bias where large/small areas seem to be more important, and infection risk is dispersed over the populated places.

This works when we have counted observations divided by another parameter (time, population, probability, volume, etc.). Without support values, e.g.: rate denominator, we are not able to perform semivariogram regularization.

What we want to achieve - transforming block data into point grid without any values - might be done using Ordinary Kriging, and semivariogram regularization is not required (and not possible when denominator is equal to zero).

In summary, this notebook shows how to transform blocks data into a pre-defined point grid, where points of the point grid don’t have any meaning.

Prerequisites#

Domain:
- semivariance and covariance functions
- kriging (basics)
- outliers removal
Package:
- ordinary_kriging()
- interpolate_points()
- VariogramCloud()
Programming:
- Python basics
- pandas basics

Table of contents#

Prepare data
Detect and remove outliers
Fit semivariogram model
Prepare canvas
Interpolate from blocks on a given canvas

1. Prepare data#

The block data represents the breast cancer rates in the Northeastern part of the United States. Rate is the number of new cases divided by the population in a county, and this fraction is multiplied by a constant factor of 100,000.

[1]:

import numpy as np
import geopandas as gpd

import matplotlib.pyplot as plt
from pyinterpolate import Blocks, VariogramCloud, TheoreticalVariogram, interpolate_points

[2]:

DATASET = '../data/blocks/cancer_data.gpkg'
CANVAS = '../data/blocks/regular_grid_points.shp'

POLYGON_LAYER = 'areas'
GEOMETRY_COL = 'geometry'
POLYGON_ID = 'FIPS'
POLYGON_VALUE = 'rate'

[3]:

AREAL_INPUT = Blocks(
    ds=gpd.read_file(DATASET, layer=POLYGON_LAYER),
    value_column_name=POLYGON_VALUE,
    index_column_name=POLYGON_ID
)
AREAL_INPUT.ds.head()

[3]:

	FIPS	rate	geometry	representative_points	lon	lat
0	25019	192.2	MULTIPOLYGON (((2115688.816 556471.24, 2115699...	POINT (2133348.745 558025.221)	2.133349e+06	558025.221156
1	36121	166.8	POLYGON ((1419423.43 564830.379, 1419729.721 5...	POINT (1442153.243 550673.936)	1.442153e+06	550673.935704
2	33001	157.4	MULTIPOLYGON (((1937530.728 779787.978, 193751...	POINT (1958207.252 766008.62)	1.958207e+06	766008.620010
3	25007	156.7	MULTIPOLYGON (((2074073.532 539159.504, 207411...	POINT (2084187.905 556436.412)	2.084188e+06	556436.412055
4	25001	155.3	MULTIPOLYGON (((2095343.207 637424.961, 209528...	POINT (2101100.939 600344.125)	2.101101e+06	600344.124709

[4]:

areal_centroids = AREAL_INPUT.ds[['lon', 'lat', 'rate']].to_numpy()
areal_centroids[0]

[4]:

array([2.13334874e+06, 5.58025221e+05, 1.92200000e+02])

2. Detect and remove outliers#

We will clean data before the semivariogram fitting and the Ordinary Kriging interpolation. Variogram Point Cloud is the best way to analyze data and detect outliers. We inspect |and compare different lags and step sizes and their respective point clouds. Then we will create multiple variograms and Kriging models.

[5]:

# Create analysis parameters

maximum_range = 300000

number_of_lags = [4, 8, 16]
step_sizes = [maximum_range / x for x in number_of_lags]

variogram_clouds = []

for step_size in step_sizes:
    max_range = maximum_range + step_size
    vc = VariogramCloud(values=areal_centroids[:, -1],
                        geometries=areal_centroids[:, :-1],
                        step_size=step_size,
                        max_range=max_range)
    variogram_clouds.append(vc)

[6]:

for vc in variogram_clouds:
    print(f'Lags per area: {len(vc.lags)}')
    print('')
    vc.plot('violin')
    print('\n###########\n')

Lags per area: 4

../../../_images/usage_tutorials_functional_3-5-blocks-to-points-ordinary-kriging_7_1.png


###########

Lags per area: 8

../../../_images/usage_tutorials_functional_3-5-blocks-to-points-ordinary-kriging_7_3.png


###########

Lags per area: 16

../../../_images/usage_tutorials_functional_3-5-blocks-to-points-ordinary-kriging_7_5.png


###########

What do plots tell us?

point to point errors per lag are skewed toward positive values (the mean is always greater than the median)
less lags == less variability
variability in the last case (16 lags) seems to be too high
variability in the first case (4 lags) seems to be too small
there are extreme semivariance values, and the largest extremities are present for the most distant lags

What can we do?

remove outliers (extremely high semivariances) in each lag
use a small number of neighbors and a range close to 200,000 to avoid unnecessary computations

[7]:

# Now remove outliers from each lag

_ = [
    vc.remove_outliers(
        method='iqr',
        iqr_lower_limit=3,
        iqr_upper_limit=1.5,
        inplace=True
    ) for vc in variogram_clouds
]

[8]:

# And show data without outliers

for vc in variogram_clouds:
    print(f'Lags per area: {len(vc.lags)}')
    print('')
    vc.plot('violin')
    print('\n###########\n')

Lags per area: 4

../../../_images/usage_tutorials_functional_3-5-blocks-to-points-ordinary-kriging_10_1.png


###########

Lags per area: 8

../../../_images/usage_tutorials_functional_3-5-blocks-to-points-ordinary-kriging_10_3.png


###########

Lags per area: 16

../../../_images/usage_tutorials_functional_3-5-blocks-to-points-ordinary-kriging_10_5.png


###########

If we compare the y axis before and after removing outliers, we may see that values are now smaller and densely populated. The variability for 16 lags, especially at the low ranges, is too high. Model build on top of this variogram might behave poorly.

3. Fit semivariogram model#

Semivariogram may be fitted manually or automatically. In this case, we fit it automatically - we test multiple models. We didn’t force any model to see how different results we would get.

[9]:

theoretical_semivariograms = []

for idx, vc in enumerate(variogram_clouds):

    print(f'Semivariance calculated for {len(vc.lags)} lags.')

    # Calculate experimental model
    exp_model = vc.experimental_semivariances()

    # Assign experimental model and data to TheoreticalSemivariogram
    theo_semi = TheoreticalVariogram()
    theo_semi.autofit(experimental_variogram=exp_model)
    theoretical_semivariograms.append(theo_semi)
    print('')
    print('Model parameters:')
    print('Model type:', theo_semi.name)
    print('Nugget:', theo_semi.nugget)
    print('Sill:', theo_semi.sill)
    print('Range:', theo_semi.rang)
    print('Model error:', theo_semi.rmse)
    print('')
    print('')

Semivariance calculated for 4 lags.

Model parameters:
Model type: spherical
Nugget: 36.47814412070759
Sill: 113.9843806262891
Range: 130635.90676540829
Model error: 6.013657676895086


Semivariance calculated for 8 lags.

Model parameters:
Model type: spherical
Nugget: 56.468678977272724
Sill: 91.20165827269054
Range: 130635.90676540829
Model error: 9.285171141696564


Semivariance calculated for 16 lags.

Model parameters:
Model type: spherical
Nugget: 75.54535714285717
Sill: 78.23809400314832
Range: 165472.14856951716
Model error: 17.306288625624866

Each case has a different model type, a different sill, and a different range! How do we choose the model parameters appropriately in this scenario? Error rises with the number of lags but is it a good indicator of the semivariogram fit? No, and we should be careful when choosing variograms with a few lags instead of variograms with multiple lags. We may miss some spatial patterns that will be averaged with a smaller number of lags. We will choose variogram with 8 lags, even if the error is not the best.

4. Prepare canvas#

Our variogram model is ready to load the point file as the GeoDataFrame object, which stores the geometry column and data features as tables.

[10]:

gdf_pts = gpd.read_file(CANVAS)
gdf_pts.crs == AREAL_INPUT.ds.crs

[10]:

True

[11]:

gdf_pts.set_index('id', inplace=True)
gdf_pts['x'] = gdf_pts.geometry.x
gdf_pts['y'] = gdf_pts.geometry.y
gdf_pts.head()

[11]:

	geometry	x	y
id
81.0	POINT (1277277.671 441124.507)	1.277278e+06	441124.5068
82.0	POINT (1277277.671 431124.507)	1.277278e+06	431124.5068
83.0	POINT (1277277.671 421124.507)	1.277278e+06	421124.5068
84.0	POINT (1277277.671 411124.507)	1.277278e+06	411124.5068
85.0	POINT (1277277.671 401124.507)	1.277278e+06	401124.5068

5. Interpolate#

We can model all values in a batch using function interpolate_points().

Based on the learning from a semivariogram modeling step, we will set a number of neighboring areas to 8 and range to 200000. We will append interpolated values and errors to an existing data frame and plot them to compare results.

[12]:

preds_cols = []
errs_cols = []

for semi_model in theoretical_semivariograms:
    kriged = interpolate_points(
        theoretical_model=semi_model,
        known_values=areal_centroids[:, -1],
        known_geometries=areal_centroids[:, :-1],
        unknown_locations=gdf_pts[['x', 'y']].to_numpy(),
        neighbors_range=200000,
        no_neighbors=16,
        use_all_neighbors_in_range=True
    )

    # Interpolate missing values and uncertainty
    pred_col_name = 'p ' + semi_model.name[:3] + ' ' + str(len(semi_model.lags))
    uncertainty_col_name = 'e ' + semi_model.name[:3] + ' ' + str(len(semi_model.lags))
    gdf_pts[pred_col_name] = kriged[:, 0]
    gdf_pts[uncertainty_col_name] = kriged[:, 1]
    preds_cols.append(pred_col_name)
    errs_cols.append(uncertainty_col_name)

100%|██████████| 5419/5419 [01:02<00:00, 86.91it/s]
100%|██████████| 5419/5419 [00:18<00:00, 285.90it/s]
100%|██████████| 5419/5419 [00:02<00:00, 2350.68it/s]

[13]:

gdf_pts.head()

[13]:

	geometry	x	y	p sph 4	e sph 4	p sph 8	e sph 8	p sph 16	e sph 16
id
81.0	POINT (1277277.671 441124.507)	1.277278e+06	441124.5068	131.300950	102.178639	131.300950	109.037233	133.190048	112.323151
82.0	POINT (1277277.671 431124.507)	1.277278e+06	431124.5068	129.939483	100.512382	129.939483	107.704020	130.831391	111.218534
83.0	POINT (1277277.671 421124.507)	1.277278e+06	421124.5068	129.052488	99.533481	129.052488	106.920779	129.964626	110.470425
84.0	POINT (1277277.671 411124.507)	1.277278e+06	411124.5068	128.396747	98.920356	128.396747	106.430202	129.282825	110.013474
85.0	POINT (1277277.671 401124.507)	1.277278e+06	401124.5068	128.288679	98.517309	128.288679	106.107714	129.215478	109.573660

Now we can save our GeoDataFrame as a geopackage or proceed with the analysis.

[14]:

# Save interpolation results

# gdf_pts.to_file('interpolation_results_areal_to_point.gpkg')

[15]:

# Now compare results to choropleth maps
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(16, 16), sharex='all', sharey='all')

base1 = AREAL_INPUT.ds.plot(ax=axes[0, 0], legend=True, edgecolor='black', color='white')
base2 = AREAL_INPUT.ds.plot(ax=axes[0, 1], legend=True, edgecolor='black', color='white')
base3 = AREAL_INPUT.ds.plot(ax=axes[0, 2], legend=True, edgecolor='black', color='white')
base4 = AREAL_INPUT.ds.plot(ax=axes[1, 0], legend=True, edgecolor='black', color='white')
base5 = AREAL_INPUT.ds.plot(ax=axes[1, 1], legend=True, edgecolor='black', color='white')
base6 = AREAL_INPUT.ds.plot(ax=axes[1, 2], legend=True, edgecolor='black', color='white')

gdf_pts.plot(ax=base1, column=preds_cols[0], cmap='Spectral_r', alpha=0.3)
gdf_pts.plot(ax=base2, column=preds_cols[1], cmap='Spectral_r', alpha=0.3)
gdf_pts.plot(ax=base3, column=preds_cols[2], cmap='Spectral_r', alpha=0.3)
gdf_pts.plot(ax=base4, column=errs_cols[0], cmap='Reds', alpha=0.3)
gdf_pts.plot(ax=base5, column=errs_cols[1], cmap='Reds', alpha=0.3)
gdf_pts.plot(ax=base6, column=errs_cols[2], cmap='Reds', alpha=0.3)

axes[0, 0].set_title(preds_cols[0])
axes[0, 1].set_title(preds_cols[1])
axes[0, 2].set_title(preds_cols[2])
axes[1, 0].set_title(errs_cols[0])
axes[1, 1].set_title(errs_cols[1])
axes[1, 2].set_title(errs_cols[2])

plt.suptitle('Comparison of different Kriging models outputs')
plt.show()

../../../_images/usage_tutorials_functional_3-5-blocks-to-points-ordinary-kriging_22_0.png

Are absolute errors between maps large or not? Let’s check it in the last step:

[16]:

for pcol1 in preds_cols:
    print('Column:', pcol1)
    for pcol2 in preds_cols:
        if pcol1 == pcol2:
            pass
        else:
            mad = gdf_pts[pcol1] - gdf_pts[pcol2]
            mad = np.abs(np.mean(mad))
            print(f'Mean Absolute Difference with {pcol2} is {mad:.9f}')
    print('')

Column: p sph 4
Mean Absolute Difference with p sph 8 is 0.000000000
Mean Absolute Difference with p sph 16 is 0.020515483

Column: p sph 8
Mean Absolute Difference with p sph 4 is 0.000000000
Mean Absolute Difference with p sph 16 is 0.020515483

Column: p sph 16
Mean Absolute Difference with p sph 4 is 0.020515483
Mean Absolute Difference with p sph 8 is 0.020515483

Model with 4 lags and model with 8 lags give almost the same results because range of both models is very close.

Changelog#

Date	Changes	Author
2025-11-07	Used values and geometries parameters instead of ds in the experimental variogram initialization, and known_values and known_geometries for kriging process (instead of known_locations parameter)	@SimonMolinsky (Szymon Moliński)
2025-05-18	Tutorial has been adapted to the 1.0 release	@SimonMolinsky (Szymon Moliński)