C.1.3 Poisson Kriging - centroid based approach#

Table of Contents:#

  1. Load areal and point data,

  2. Load semivariogram (regularized),

  3. Remove 10% of areal dataset,

  4. Predict values at unknown locations,

  5. Analyse Forecast Bias and Root Mean Squared Error of prediction.

Introduction#

Before starting this tutorial, be sure you have understood concepts in the Ordinary and Simple Kriging and Semivariogram Regularization tutorials.

The Poisson Kriging technique is used to model spatial count data. We will analyze a particular case where values are counted over blocks, and those blocks may have irregular shapes and sizes. We will try to predict the breast cancer rates in the Northeastern counties of the U.S. We will use U.S. Census 2010 data to create point support for blocks.

The breast cancer rates data and the point support population counts are located in the geopackage in a directory: samples/regularization/cancer_data.gpkg

Even if our areal data has units with irregular shapes and sizes, for the case of this tutorial, we assume that each region may be represented by its centroid. This is a huge simplification; we can do it at our own risk. There are cases when we can use this method instead of the Area-to-Area Kriging. Those cases are:

  • We must do our calculations fast, and the processing time is crucial,

  • or blocks have similar shapes and sizes.

Centroid-based Poisson Kriging is much faster than Area-to-Area or Area-to-Point Poisson Kriging, and for some applications, it may be the desired property.

The tutorial covers the following steps:

  1. Read and explore data,

  2. Load semivariogram model,

  3. Prepare training and test data,

  4. Predict values of unknown locations and calculate forecast bias and root mean squared error,

  5. Analyze error metrics.

1) Read and explore data#

[1]:
import numpy as np
import pandas as pd

from pyinterpolate import TheoreticalVariogram
from pyinterpolate import Blocks, PointSupport
from pyinterpolate import centroid_poisson_kriging
[2]:
DATASET = 'samples/regularization/cancer_data.gpkg'
OUTPUT = 'samples/regularization/regularized_variogram.json'
POLYGON_LAYER = 'areas'
POPULATION_LAYER = 'points'
POP10 = 'POP10'
GEOMETRY_COL = 'geometry'
POLYGON_ID = 'FIPS'
POLYGON_VALUE = 'rate'

blocks = Blocks()
blocks.from_file(DATASET, value_col=POLYGON_VALUE, index_col=POLYGON_ID, layer_name=POLYGON_LAYER)

point_support = PointSupport()
point_support.from_files(point_support_data_file=DATASET,
                         blocks_file=DATASET,
                         point_support_geometry_col=GEOMETRY_COL,
                         point_support_val_col=POP10,
                         blocks_geometry_col=GEOMETRY_COL,
                         blocks_index_col=POLYGON_ID,
                         use_point_support_crs=True,
                         point_support_layer_name=POPULATION_LAYER,
                         blocks_layer_name=POLYGON_LAYER)
ERROR 1: PROJ: proj_create_from_database: Open of /home/szymon/miniconda3/envs/pyinterpolate38/share/proj failed
[3]:
# Lets take a look into a map of areal counts

blocks.data.plot(column=blocks.value_column_name, cmap='Spectral_r', legend=True, figsize=(12, 8))
[3]:
<Axes: >
../../_images/usage_tutorials_C13-Poisson-Kriging-Centroid-based_5_1.png

Clarification: It is a good idea to look into the spatial patterns in a dataset and to visually check if our data do not have any NaN values. We use the geopandas GeoDataFrame.plot() function with a color map that diverges regions based on the cancer incidence rates. The output choropleth map is not ideal, and (probably) it has a few unreliable results, for example:

  • In counties with a small population, where the ratio number of cases to population size is high even if the number of cases is low,

  • Counties that are very big and sparsely populated may draw more of our attention than densely populated counties,

  • Transitions of colors (rates) between counties may be too abrupt, even if we know that neighboring counties should have closer results.

We can overcome those problems using Centroid-based Poisson Kriging for filtering and denoising. But we must be aware that this method introduces a bias: each polygon falls into its centroid, but polygons’ shapes and sizes differ, so centroid representation erases spatial patterns.

2) Load semivariogram model#

We load a regularized semivariogram from the tutorial about Semivariogram Regularization. You can always perform semivariogram regularization along with the Poisson Kriging, but it is a very long process, and it is more convenient to separate those two steps.

[4]:
semivariogram = TheoreticalVariogram()  # Create TheoreticalSemivariogram object
semivariogram.from_json('output/regularized_model.json')  # Load regularized semivariogram

3) Prepare training and test data.#

We simply remove 40% of random ID’s from the areal dataset (the same for points) and create four arrays: two training arrays with areal and point geometry and values and two test arrays with the same categories of data.

[5]:
# Remove 40% of rows (values) to test our model

def create_test_areal_set(areal_dataset: Blocks, points_dataset: PointSupport, frac=0.4):
    """

    Parameters
    ----------
    areal_dataset : Blocks
    points_dataset : PointSupport
    frac : float

    Returns
    -------
    : List
        [[training_areas, test_areas, training_points, test_points]]
            equal to:
        [np.ndarray, np.ndarray, np.ndarray, np.ndarray]
            where:
        - areas: [[block index, centroid x, centroid y, value]]
        - point support: [[block id, x, y, value]]
    """
    block_id_col = areal_dataset.index_column_name
    block_data = areal_dataset.data.copy()
    all_ids = block_data[block_id_col].unique()
    training_set_size = int(len(all_ids) * (1 - frac))

    training_ids = np.random.choice(all_ids,
                                    size=training_set_size,
                                    replace=False)

    training_areas = block_data[block_data[block_id_col].isin(training_ids)]
    training_areas = training_areas[[block_id_col, 'centroid_x', 'centroid_y', areal_dataset.value_column_name]].values
    test_areas = block_data[~block_data[block_id_col].isin(training_ids)]
    test_areas = test_areas[[block_id_col, 'centroid_x', 'centroid_y', areal_dataset.value_column_name]].values

    ps_data = points_dataset.point_support.copy()
    ps_ids = points_dataset.block_index_column

    training_points = ps_data[ps_data[ps_ids].isin(training_ids)]
    training_points = training_points[[ps_ids,
                                       points_dataset.x_col,
                                       points_dataset.y_col,
                                       points_dataset.value_column]].values
    test_points = ps_data[~ps_data[ps_ids].isin(training_ids)]
    test_points = test_points[[ps_ids,
                               points_dataset.x_col,
                               points_dataset.y_col,
                               points_dataset.value_column]].values


    output = [training_areas, test_areas, training_points, test_points]
    return output

ar_train, ar_test, pt_train, pt_test = create_test_areal_set(blocks, point_support)

4) Predict values at unknown locations and calculate forecast bias and root mean squared error.#

You may start to work with predictions with a fitted semivariogram model. Centroid-based Poisson Kriging takes five arguments during the run:

  • semivariogram model (fitted semivariogram model),

  • known areas (training set),

  • known points (training set),

  • unknown block (without rate value),

  • unknown block’s point support.

Additional two parameters control the process of interpolation:

  • number of observations (the most critical parameter - how many neighbors are affecting your area of analysis),

  • weighted by population (bool parameter, if True, then distances are weighted by population between points - default is True for Poisson Kriging).

The algorithm in the cell below iteratively picks one area from the test set and performs prediction. Then, forecast bias, the difference between actual and predicted value, is calculated. Forecast Bias clarifies if our algorithm predictions are too low or too high (under- or overestimation). The following parameter is the Root Mean Squared Error. This measure tells us more about outliers and huge differences between predictions and actual values. We will see it in the last part of the tutorial.

Your work with Poisson Kriging (or Kriging) will usually be the same: - Prepare training and test data, - use training data to train the semivariogram model, - Test different hyperparameters with a test set to find the optimal number of neighbors that are affecting your area of analysis, - predict values at unseen locations OR perform data smoothing.

[6]:
number_of_obs = 4

predslist = []
for unknown_area in ar_test:
    upts = pt_test[pt_test[:, 0] == unknown_area[0]]
    upts = upts[:, 1:]
    try:
        kriging_preds = centroid_poisson_kriging(semivariogram_model=semivariogram,
                                                 blocks=ar_train,
                                                 point_support=pt_train,
                                                 unknown_block=unknown_area[:-1],
                                                 unknown_block_point_support=upts,
                                                 number_of_neighbors=number_of_obs,
                                                 is_weighted_by_point_support=True,
                                                 raise_when_negative_prediction=True,
                                                 raise_when_negative_error=True)
    except ValueError as _:
        predslist.append([unknown_area[0], np.nan, np.nan, np.nan, np.nan])
    else:
        rmse = np.sqrt((kriging_preds[1] - unknown_area[-1])**2)
        fb = unknown_area[-1] - kriging_preds[1]
        kriging_preds.extend([rmse, fb])
        predslist.append(kriging_preds)
[7]:
# [unknown block index, prediction, error, rmse, fb]
predslist = np.array(predslist)
pred_df = pd.DataFrame(data=predslist[:, 1:],
                       index=predslist[:, 0],
                       columns=['predicted', 'err', 'rmse', 'fb'])  # Store results in DataFrame

5) Analyze Forecast Bias and Root Mean Squared Error of prediction#

The analysis of errors is the last analysis step. We plot two histograms: forecast bias and root mean squared error then we calculate the base statistics of a dataset.

[8]:
# Show histograms of errors

pred_df['fb'].plot.hist(bins=10)
[8]:
<Axes: ylabel='Frequency'>
../../_images/usage_tutorials_C13-Poisson-Kriging-Centroid-based_16_1.png
[9]:
pred_df['rmse'].plot.hist(bins=10)
[9]:
<Axes: ylabel='Frequency'>
../../_images/usage_tutorials_C13-Poisson-Kriging-Centroid-based_17_1.png
[10]:
pred_df.describe()
[10]:
predicted err rmse fb
count 83.000000 83.000000 83.000000 83.000000
mean 132.867622 8.825592 9.745322 -1.017020
std 10.440986 2.751169 8.013841 12.621576
min 105.681428 1.156868 0.002419 -30.980696
25% 125.629770 7.266978 4.105316 -8.052315
50% 133.459074 8.893322 7.967837 -0.281428
75% 138.548781 10.717209 12.820111 7.643090
max 161.959004 14.974212 32.017309 32.017309

Clarification: Analysis of Forecast Bias and Root Mean Squared Error - their distribution and basic properties - could be a handy tool to analyze model performance. However, consider that the table above is a single test case (realization) and can be misleading. The good idea is to repeat the test dozens of times with a different training/test set division each time. After this, we average results from multiple tests and get insight into our model’s behavior.

Note 1: Those results are not decisive. Our sample has been selected randomly, and there is a chance that it is not a spatially representative sample! (E.g., areas only from one region). The good idea is to repeat the experiment multiple times with other samples and average results to determine how well the model performs.

Note 2: If we analyze errors’ statistics, we should consider not only an error’s mean value. Let’s look at different pieces of information:

  • Histograms clearly show us how dispersed and grouped errors are, and most importantly, we directly see the worst predictions and how many of them are generated by our model.

  • What is plotted on a histogram is described by statistics. We may check the max and the min error, but the true power comes when we analyze quartiles and standard deviation.

  • The standard deviation is a good measurement of our model’s variance; the less, the better.

  • Sometimes, we must look into quartiles. A very high mean but relatively low median (or even the 3rd quartile) indicates that we have only a few wrong predictions, most of which are acceptable.


Where to go from here?#

  • C.1.4 Poisson Kriging Area to Area

  • C.1.5 Poisson Kriging Area to Point

Changelog#

Date

Change description

Author

2023-08-25

The tutorial was refreshed and set along with the 0.5.0 version of the package

@SimonMolinsky

2023-01-16

New centroid-based PK algorithm

@SimonMolinsky

2022-10-21

Tutorial updated for the 0.3.4 version of the package

@SimonMolinsky

2022-08-27

Tutorial updated for the 0.3.0 version of the package

@SimonMolinsky

2021-12-14

Sill selection was upgraded: now optimal sill is derived from the grid search within TheoreticalSemivariogram class

@SimonMolinsky

2021-12-13

Changed behavior of select_values_in_range() function

@SimonMolinsky

2021-12-11

Behavior of prepare_kriging_data() function has benn changed

@SimonMolinsky

2021-05-28

Updated paths for input/output data

@SimonMolinsky

2021-05-11

Refactored TheoreticalSemivariogram class

@SimonMolinsky

2021-03-31

Update related to the change of semivariogram weighting. Updated cancer rates data.

@SimonMolinsky

[10]: