Poisson Kriging - Area to Area Kriging#

Table of Contents:#

  1. Load areal and point data,

  2. Load semivariogram (regularized),

  3. Remove 40% of areal dataset,

  4. Predict values at unknown locations,

  5. Analyse Forecast Bias and Root Mean Squared Error of prediction.

Introduction#

Before starting this tutorial, be sure you have understood concepts in the Ordinary and Simple Kriging and Semivariogram Regularization tutorials.

The Poisson Kriging technique is used to model spatial count data. We will analyze a particular case where values are counted over blocks, and those blocks may have irregular shapes and sizes. We will try to predict the breast cancer rates in the Northeastern counties of the U.S. We will use U.S. Census 2010 data to create point support for blocks.

The breast cancer rates data and the point support population counts are located in the geopackage in a directory: samples/regularization/cancer_data.gpkg

In tutorial Poisson Kriging - Centroid based approach, we’ve assumed that all areas have the same size and shape. It is not a valid statement. We now use the Area-to-Area Kriging technique to predict and evaluate missing values.

Area-to-Area and Area-to-Point Poisson Kriging are slower than simplified Kriging with Centroids but give more reliable results because they are tuned to areal size and shape.

This tutorial covers the following steps:

  1. Read and explore data,

  2. Load semivariogram model,

  3. Prepare training and test data,

  4. Predict values of unknown locations and calculate forecast bias and root mean squared error,

  5. Analyze error metrics.

1) Read and explore data#

[1]:
import numpy as np
import pandas as pd

from pyinterpolate import TheoreticalVariogram
from pyinterpolate import Blocks, PointSupport
from pyinterpolate import area_to_area_pk
[2]:
DATASET = 'samples/regularization/cancer_data.gpkg'
OUTPUT = 'samples/regularization/regularized_variogram.json'
POLYGON_LAYER = 'areas'
POPULATION_LAYER = 'points'
POP10 = 'POP10'
GEOMETRY_COL = 'geometry'
POLYGON_ID = 'FIPS'
POLYGON_VALUE = 'rate'

blocks = Blocks()
blocks.from_file(DATASET, value_col=POLYGON_VALUE, index_col=POLYGON_ID, layer_name=POLYGON_LAYER)

point_support = PointSupport()
point_support.from_files(point_support_data_file=DATASET,
                         blocks_file=DATASET,
                         point_support_geometry_col=GEOMETRY_COL,
                         point_support_val_col=POP10,
                         blocks_geometry_col=GEOMETRY_COL,
                         blocks_index_col=POLYGON_ID,
                         use_point_support_crs=True,
                         point_support_layer_name=POPULATION_LAYER,
                         blocks_layer_name=POLYGON_LAYER)
ERROR 1: PROJ: proj_create_from_database: Open of /home/szymon/miniconda3/envs/pyinterpolate38/share/proj failed
[3]:
# Lets take a look into a map of areal counts

blocks.data.plot(column=blocks.value_column_name, cmap='Spectral_r', legend=True, figsize=(12, 8))
[3]:
<Axes: >
../../_images/usage_tutorials_C14-Poisson-Kriging-Area-to-Area_5_1.png

Clarification: It is a good idea to investigate the spatial patterns in a dataset and to visually check if our data do not have any NaN values. We use the geopandas GeoDataFrame.plot() function with a color map that diverges regions based on the cancer incidence rates. The output choropleth map is not ideal, and (probably) it has a few unreliable results, for example:

  • In counties with a small population, where the ratio number of cases to population size is high even if the number of cases is low,

  • Counties that are very big and sparsely populated may draw more of our attention than densely populated counties,

  • Transitions of colors (rates) between counties may be too abrupt, even if we know that neighboring counties should have closer results.

We can overcome those problems using Area-to-Area Kriging for filtering and denoising. But, in this tutorial, we will interpolate with Area-to-Area Kriging and analyze root mean squared errors of predictions and their bias.

2) Load a semivariogram model#

We load a regularized semivariogram from the tutorial about Semivariogram Regularization. You can always perform semivariogram regularization along with the Poisson Kriging, but it is a very long process, and it is more convenient to separate those two steps.

[4]:
semivariogram = TheoreticalVariogram()  # Create TheoreticalSemivariogram object
semivariogram.from_json('output/regularized_model.json')  # Load regularized semivariogram

3) Prepare training and test data.#

We simply remove 40% of random ID’s from the areal dataset (the same for points) and create four arrays: two training arrays with areal and point geometry and values and two test arrays with the same categories of data.

[5]:
# Remove 40% of rows (values) to test our model

def create_test_areal_set(areal_dataset: Blocks, points_dataset: PointSupport, frac=0.4):
    """

    Parameters
    ----------
    areal_dataset : Blocks
    points_dataset : PointSupport
    frac : float

    Returns
    -------
    : List
        [[training_areas, test_areas, training_points, test_points]]
            equal to:
        [np.ndarray, np.ndarray, np.ndarray, np.ndarray]
            where:
        - areas: [[block index, centroid x, centroid y, value]]
        - point support: [[block id, x, y, value]]
    """
    block_id_col = areal_dataset.index_column_name
    block_data = areal_dataset.data.copy()
    all_ids = block_data[block_id_col].unique()
    training_set_size = int(len(all_ids) * (1 - frac))

    training_ids = np.random.choice(all_ids,
                                    size=training_set_size,
                                    replace=False)

    training_areas = block_data[block_data[block_id_col].isin(training_ids)]
    training_areas = training_areas[[block_id_col, 'centroid_x', 'centroid_y', areal_dataset.value_column_name]].values
    test_areas = block_data[~block_data[block_id_col].isin(training_ids)]
    test_areas = test_areas[[block_id_col, 'centroid_x', 'centroid_y', areal_dataset.value_column_name]].values

    ps_data = points_dataset.point_support.copy()
    ps_ids = points_dataset.block_index_column

    training_points = ps_data[ps_data[ps_ids].isin(training_ids)]
    training_points = training_points[[ps_ids,
                                       points_dataset.x_col,
                                       points_dataset.y_col,
                                       points_dataset.value_column]].values
    test_points = ps_data[~ps_data[ps_ids].isin(training_ids)]
    test_points = test_points[[ps_ids,
                               points_dataset.x_col,
                               points_dataset.y_col,
                               points_dataset.value_column]].values


    output = [training_areas, test_areas, training_points, test_points]
    return output

ar_train, ar_test, pt_train, pt_test = create_test_areal_set(blocks, point_support)

4) Predict values at unknown locations and calculate forecast bias and root mean squared error.#

You may start to work with predictions with a fitted semivariogram model. Centroid-based Poisson Kriging takes five arguments during the run:

  • semivariogram model (fitted semivariogram model),

  • known areas (training set),

  • known points (training set),

  • unknown block (without rate value),

  • unknown block’s point support.

Additional two parameters control the process of interpolation:

  • number of observations (the most critical parameter - how many neighbors are affecting your area of analysis),

  • weighted by population (bool parameter, if True, then distances are weighted by population between points - default is True for Poisson Kriging).

The algorithm in the cell below iteratively picks one area from the test set and performs prediction. Then, forecast bias, the difference between actual and predicted value, is calculated. Forecast Bias clarifies if our algorithm predictions are too low or too high (under- or overestimation). The following parameter is the Root Mean Squared Error. This measure tells us more about outliers and huge differences between predictions and actual values. We will see it in the last part of the tutorial.

Your work with Poisson Kriging (or Kriging) will usually be the same: - Prepare training and test data, - use training data to train the semivariogram model, - Test different hyperparameters with a test set to find the optimal number of neighbors that are affecting your area of analysis, - predict values at unseen locations OR perform data smoothing.

[6]:
number_of_obs = 6

predslist = []
for unknown_area in ar_test:
    upts = pt_test[pt_test[:, 0] == unknown_area[0]]
    upts = upts[:, 1:]
    # [unknown block index, prediction, error]
    try:
        kriging_preds = area_to_area_pk(
            semivariogram_model=semivariogram,
            blocks=ar_train,
            point_support=pt_train,
            unknown_block=unknown_area[:-1],
            unknown_block_point_support=upts,
            number_of_neighbors=number_of_obs,
            raise_when_negative_prediction=True,
            raise_when_negative_error=True
        )
    except ValueError:
        predslist.append([unknown_area[0], np.nan, np.nan, np.nan, np.nan])
    else:
        rmse = np.sqrt((kriging_preds[1] - unknown_area[-1])**2)
        fb = unknown_area[-1] - kriging_preds[1]
        kriging_preds.extend([rmse, fb, unknown_area[-1]])
        predslist.append(kriging_preds)
[7]:
kriged_predictions = np.array(predslist)
pred_df = pd.DataFrame(data=kriged_predictions[:, 1:],
                       index=kriged_predictions[:, 0],
                       columns=['predicted', 'err', 'rmse', 'fb', 'real'])  # Store results in DataFrame

5) Analyze Forecast Bias and Root Mean Squared Error of prediction#

The last step is the analysis of errors. We plot two histograms: forecast bias and root mean squared error then we calculate the base statistics of a dataset.

[8]:
# Show histograms of errors

pred_df['fb'].plot.hist(bins=10)
[8]:
<Axes: ylabel='Frequency'>
../../_images/usage_tutorials_C14-Poisson-Kriging-Area-to-Area_16_1.png
[9]:
pred_df['rmse'].plot.hist(bins=10)
[9]:
<Axes: ylabel='Frequency'>
../../_images/usage_tutorials_C14-Poisson-Kriging-Area-to-Area_17_1.png
[10]:
pred_df.describe()
[10]:
predicted err rmse fb real
count 87.000000 87.000000 87.000000 87.000000 87.000000
mean 131.695080 9.173880 8.432769 -0.656000 131.039080
std 7.766279 1.247393 6.342679 10.570369 11.372287
min 118.399540 5.898166 0.117618 -30.138693 98.100000
25% 126.024124 8.471337 4.251209 -6.904058 122.100000
50% 131.509964 9.102989 6.860562 -0.349700 131.500000
75% 135.699627 9.913730 10.960136 6.406806 140.000000
max 157.267297 12.248617 30.138693 24.448021 155.300000

Clarification: Analysis of Forecast Bias and Root Mean Squared Error - their distribution and basic properties - could be a handy tool to analyze model performance. However, consider that the table above is a single test case (realization) and can be misleading. The good idea is to repeat the test dozens of times with a different training/test set division each time. After this, we average results from multiple tests and get insight into our model’s behavior.

Note 1: Those results are not decisive. Our sample has been selected randomly, and there is a chance that it is not a spatially representative sample! (E.g., areas only from one region). The good idea is to repeat the experiment multiple times with other samples and average results to determine how well the model performs.

Note 2: If we analyze errors’ statistics, we should consider not only an error’s mean value. Let’s look at different pieces of information:

  • Histograms clearly show us how dispersed and grouped errors are, and most importantly, we directly see the worst predictions and how many of them are generated by our model.

  • What is plotted on a histogram is described by statistics. We may check the max and the min error, but the true power comes when we analyze quartiles and standard deviation.

  • The standard deviation is a good measurement of our model’s variance; the less, the better.

  • Sometimes, we must look into quartiles. A very high mean but relatively low median (or even the 3rd quartile) indicates that we have only a few wrong predictions, most of which are acceptable.


Where to go from here?#

  • C.1.5 Poisson Kriging Area to Point

Changelog#

Date

Change description

Author

2023-08-25

The tutorial was refreshed and set along with the 0.5.0 version of the package

@SimonMolinsky

2023-01-14

New area-to-area PK algorithm

@SimonMolinsky

2022-10-21

The tutorial was updated for the 0.3.4 version of the package

@SimonMolinsky

2022-08-27

The tutorial was updated for the 0.3.0 version of the package

@SimonMolinsky

2021-12-14

Sill selection was upgraded: now optimal sill is derived from the grid search within TheoreticalSemivariogram class

@SimonMolinsky

2021-12-13

Changed behavior of select_values_in_range() function

@SimonMolinsky

2021-12-11

Behavior of prepare_kriging_data() function has benn changed

@SimonMolinsky

2021-05-28

Updated paths for input/output data

@SimonMolinsky

2021-05-11

Refactored TheoreticalSemivariogram class

@SimonMolinsky

2021-03-31

Update related to the change of semivariogram weighting. Updated cancer rates data.

@SimonMolinsky

[10]: