B.1.3 Outliers and Kriging#

Table of Contents:#

Read point data and take 10% of it as a sample for the further analysis (dataset A),
Check if outliers are present in a data and create additional dataset without outliers (dataset B),
Create the Variogram Point Cloud model for the datasets A and B,
Remove outliers from datasets A and B,
Create four Ordinary Kriging models and compare their performance.

Introduction#

Outliers may affect our analysis and the final interpolation results. In this tutorial, we learn about their influence on the final model and compare interpolation errors for different scenarios where data is cleaned differently.

We can remove too high or too low values at the preprocessing stage (check part 2 of the tutorial) or remove outliers directly from the variogram point cloud (part 4). Results from each type of preprocessing (and a raw dataset analysis) are different, and we will compare them.

We use:

DEM data, which is stored in a file samples/point_data/txt/pl_dem_epsg2180.txt.

Import packages#

[2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pyinterpolate import calc_point_to_point_distance, read_txt, kriging, TheoreticalVariogram, VariogramCloud

1) Read point data and divide it into the training and test set#

[3]:

# Read data from file

dem = read_txt('samples/point_data/txt/pl_dem_epsg2180.txt')

[4]:

# Divide data into training and test set

def create_train_test(data, training_fraction):
    idxs = np.arange(0, len(data))
    number_of_training_samples = int(len(data) * training_fraction)
    training_idxs = np.random.choice(idxs, size=number_of_training_samples, replace=False)
    test_idxs = [i for i in idxs if i not in training_idxs]
    training_set = data[training_idxs, :]
    test_set = data[test_idxs, :]

    return training_set, test_set

[5]:

train, test = create_train_test(dem, 0.1)

[6]:

train[:3]

[6]:

array([[2.51238103e+05, 5.50262413e+05, 8.71362305e+01],
       [2.45382683e+05, 5.28767297e+05, 5.84672661e+01],
       [2.54519766e+05, 5.51709669e+05, 7.63842850e+01]])

2) Check outliers: analyze the distribution of the values#

We will inspect all values in the train set to determine if our dataset contains outliers. In the beginning, we plot data distribution with the violinplot.

[7]:

# Distribution plot

plt.figure(figsize=(10, 6))
plt.violinplot(train[:, -1])
plt.show()

../../_images/usage_tutorials_B13-Outliers-and-Kriging_10_0.png

NOTE: Your plot may differ from the tutorial’s. Why is that? Because we take a random sample of 10% of values, and after each iteration, the algorithm takes different points for the analysis.

Clarification:

Investigation of the plot tells us that our data is:

grouped around the lowest values, and most of the values are below 50 meters,
has (probably) three different distributions mixed, which can be a sign that the Digital Elevation Model covers three different types of elevation. One is grouped around 20 meters, the next roughly 50 meters and the furthest is visible around 70 meters.

Violinplot is suitable for the distribution analysis. Finding outliers in this plot may be challenging because we see mostly distributions. We should change a plot type to understand if outliers exist in a dataset. The excellent choice is the boxplot:

[8]:

# Boxplot

plt.figure(figsize=(10, 6))
plt.boxplot(train[:, -1])
plt.show()

../../_images/usage_tutorials_B13-Outliers-and-Kriging_12_0.png

Boxplot is a unique and handy data visualization tool. Let’s analyze this plot from the bottom up to the top.

NOTE: Boxplot represents values sorted in ascending order and their statistical properties: quartiles, median, and outliers.

The bottom whisker (horizontal line) represents the lower range of values in our dataset,
The box lower line is the first quartile of data or, in other words, 25% of values of our dataset are below this point. We name it Q1.
The middle line is the median of our dataset, and we name it Q2 or median.
The upper line is the third quartile of a data or, in other words, 75% of values are below this point,
The top whisker represents the upper range of values in our dataset. We name it Q3.
Individual points (if they exist, then we see them as points below the bottom whisker or above the top whisker) are considered outliers. They could be outliers in the upper range as well as the lower range of our data. The significant distance between Q1 and the bottom whisker or between Q3 and the top whisker indicates potential outliers. Points below or above this distance are treated as outliers. The outlier distance is calculated as the \(weight * (Q3 - Q1)\) where we can manually set the weight, but other parameters are read directly from the data.

We use this knowledge to remove outliers from the dataset, assuming that outliers are anomalies rather than unbiased readings. We will remove outliers from the data, assuming that outliers are placed one standard deviation from Q1 (down) and Q3 (up).

[9]:

# Create training set without outliers

q1 = np.quantile(train[:, -1], 0.25)
q3 = np.quantile(train[:, -1], 0.75)

top_limit = q3 + (q3 - q1)

train_without_outliers = train[train[:, -1] < top_limit]

[10]:

print('Length of the full training set is {} records'.format(len(train)))
print('Length of the pre-processed training set is {} records'.format(len(train_without_outliers)))

Length of the full training set is 689 records
Length of the pre-processed training set is 680 records

[11]:

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
ax[0].violinplot(train_without_outliers[:, -1])
ax[0].set_title('Distribution of the training set without outliers')
ax[1].boxplot(train_without_outliers[:, -1])
ax[1].set_title('Statistics of the training set without outliers')
plt.show()

../../_images/usage_tutorials_B13-Outliers-and-Kriging_16_0.png

Clarification: We have cut some records from the baseline training dataset. The distribution plot (violinplot) has a shorter tail and ends more abruptly upwards. The boxplot of the new data doesn’t have any outliers. The one crucial thing to notice is that the observations are still skewed, but this is not a problem for this concrete tutorial.

NOTE: if you are eager to know how to deal with skewed datasets, we recommend the article Transforming Skewed Data.

3) Create the Variogram Point Cloud model for datasets A and B#

Now, we are going one step further, and we will transform both datasets with- and without- outliers and calculate variogram point clouds. Then, we compare both clouds.

[12]:

def get_variogram_point_cloud(dataset, max_range, number_of_lags=8):
    step_size = max_range / number_of_lags
    cloud = VariogramCloud(input_array=dataset, step_size=step_size, max_range=max_range, calculate_on_creation=True)
    return cloud

[13]:

whole_distance = np.max(calc_point_to_point_distance(train[:, :-1]))

cloud_raw = get_variogram_point_cloud(train, whole_distance)
cloud_processed = get_variogram_point_cloud(train_without_outliers, whole_distance)

[14]:

# Show variogram cloud: initial training dataset

cloud_raw.plot('box')

../../_images/usage_tutorials_B13-Outliers-and-Kriging_21_0.png

[14]:

True

[15]:

# Show variogram cloud: pre-processed training dataset

cloud_processed.plot('box')

../../_images/usage_tutorials_B13-Outliers-and-Kriging_22_0.png

[15]:

True

Clarification: a quick look into the results shows that calculated semivariances are skewed into significant positive values for each lag, but most profoundly within middle lags (~15:21 km). The processed dataset has lower semivariances than the raw readings, and both variograms have a similar shape.

In the next cell, we will check a standard deviation of the per lag variances.

[16]:

for k, v in cloud_raw.experimental_point_cloud.items():
    print('Lag {:.2f}'.format(k))

    v_raw = int(np.std(v))
    v_pro = int(np.std(cloud_processed.experimental_point_cloud[k]))
    v_smape = 100 * (np.abs(v_raw - v_pro) / (0.5 * (v_raw + v_pro)))

    print('Standard Deviation raw dataset:', v_raw)
    print('Standard Deviation processed dataset:', v_pro)
    print('Symmetric Mean Absolute Percentage Error of Variances: {:.2f}'.format(v_smape))
    print('')

Lag 3602.70
Standard Deviation raw dataset: 562
Standard Deviation processed dataset: 559
Symmetric Mean Absolute Percentage Error of Variances: 0.54

Lag 7205.39
Standard Deviation raw dataset: 1078
Standard Deviation processed dataset: 1005
Symmetric Mean Absolute Percentage Error of Variances: 7.01

Lag 10808.09
Standard Deviation raw dataset: 1298
Standard Deviation processed dataset: 1185
Symmetric Mean Absolute Percentage Error of Variances: 9.10

Lag 14410.79
Standard Deviation raw dataset: 1345
Standard Deviation processed dataset: 1204
Symmetric Mean Absolute Percentage Error of Variances: 11.06

Lag 18013.49
Standard Deviation raw dataset: 1269
Standard Deviation processed dataset: 1136
Symmetric Mean Absolute Percentage Error of Variances: 11.06

Lag 21616.18
Standard Deviation raw dataset: 974
Standard Deviation processed dataset: 913
Symmetric Mean Absolute Percentage Error of Variances: 6.47

Lag 25218.88
Standard Deviation raw dataset: 747
Standard Deviation processed dataset: 712
Symmetric Mean Absolute Percentage Error of Variances: 4.80

Clarification: The differences (sMAPE) per lag vary greatly. We can see that the preprocessing of raw values introduces the information lost, and it is especially painful for the closest neighbors. It doesn’t mean that the preprocessing of raw observations is not recommended, but it is a good idea to include the spatial component in the outliers’ detection process. By spatial component, we mean working with variograms instead of the raw points.

The raw data cleaning has lowered the semivariances dispersion for the middle lags (where we have the largest number of point pairs for the analysis).

At this point, we cannot judge which dataset is better for the modeling. Instead, we will remove outliers from both variograms (instead of the raw data).

4) Remove outliers from the variograms#

In this step, we will use pyinterpolate’s function remove_outliers() to build two additional variogram point clouds from the raw and processed datasets. We delete the top part outliers of the semivariance values rather than the raw readings.

[17]:

raw_without_outliers = cloud_raw.remove_outliers(method='zscore', z_lower_limit=-2, z_upper_limit=2, inplace=False)
prep_without_outliers = cloud_processed.remove_outliers(method='zscore', z_lower_limit=-2, z_upper_limit=2, inplace=False)

[18]:

data_raw = [x for x in cloud_raw.experimental_point_cloud.values()]
data_raw_not_out = [x for x in raw_without_outliers.experimental_point_cloud.values()]
data_prep = [x for x in cloud_processed.experimental_point_cloud.values()]
data_prep_not_out = [x for x in prep_without_outliers.experimental_point_cloud.values()]

[19]:

fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(16, 14))

ax[0, 0].boxplot(data_raw)
ax[0, 0].set_title('Semivariances Distribution in the Raw Dataset')
ax[0, 0].set_xlabel('Lag number')
ax[0, 0].set_ylabel('Semivariance value')

ax[0, 1].boxplot(data_raw_not_out)
ax[0, 1].set_title('Raw Dataset after the Outliers Detection and Removal')
ax[0, 1].set_xlabel('Lag number')
ax[0, 1].set_ylabel('Semivariance value')

ax[1, 0].boxplot(data_prep)
ax[1, 0].set_title('Semivariances Distribution in the Preprocessed Dataset')
ax[1, 0].set_xlabel('Lag number')
ax[1, 0].set_ylabel('Semivariance value')

ax[1, 1].boxplot(data_prep_not_out)
ax[1, 1].set_title('Preprocessed Dataset after the Outliers Detection and Removal')
ax[1, 1].set_xlabel('Lag number')
ax[1, 1].set_ylabel('Semivariance value')

plt.show()

../../_images/usage_tutorials_B13-Outliers-and-Kriging_29_0.png

Clarification: Comparison of multiple variogram clouds could be challenging. We see that the largest semivariances are present in the raw data. Heavily processed data has the lowest number of outliers. The medians in each dataset are distributed over a similar pattern. How is it similar? We can check if we transform the variogram point cloud into the experimental semivariogram. Pyinterpolate has a method for it: .calculate_experimental_variogram(). We use it and compare four plots of semivariances to gain more insight into the transformations.

[20]:

raw_semivar = cloud_raw.calculate_experimental_variogram()
raw_semivar_not_out = raw_without_outliers.calculate_experimental_variogram()
prep_semivar = cloud_processed.calculate_experimental_variogram()
prep_semivar_not_out = prep_without_outliers.calculate_experimental_variogram()

[21]:

plt.figure(figsize=(14, 6))
plt.plot(raw_semivar[:, 1], c='#a6611a')
plt.plot(raw_semivar_not_out[:, 1], c='#dfc27d')
plt.plot(prep_semivar[:, 1], '--', c='#80cdc1')
plt.plot(prep_semivar_not_out[:, 1], '--', c='#018571')
plt.title('Comparison of experimental semivariograms created with the different data preprocessing techniques')
plt.ylabel('Semivariance')
plt.xlabel('Lag number')
plt.legend(['Raw', 'Raw | Filtered Semivariances', 'Preprocessed', 'Preprocessed | Filtered Semivariances'])
plt.show()

../../_images/usage_tutorials_B13-Outliers-and-Kriging_32_0.png

Clarification: An understanding of those plots is a challenging task. Let’s divide reasoning into multiple points:

Raw dataset and preprocessed raw dataset show a similar pattern. The differences are more pronounced for the distant lags than for the closest point pairs,
Datasets with the cleaned variograms are different from the raw data. The absolute semivariance values per lag are smaller, and the semivariogram pattern is slightly different. Interestingly, the possible two distributions within the dataset are more visible in the cleaned variograms (one with a peak around the 6th lag and the other with a peak around the 13th lag).
The differences between semivariograms are primarily visible at larger distances, and differences are minor for the closest point pairs.
Filtering semivariance sharply lowers the weights per lag, and removing outliers from the input data only slightly corrects the variogram.

Let’s check how different variograms work with real-world data.

5) Create Four Ordinary Kriging models based on the four Variogram Point Clouds and compare their performance#

[23]:

# Fit different semivariogram models into prepared datasets and variograms

# Raw
raw_theo = TheoreticalVariogram()
raw_theo.autofit(experimental_variogram=raw_semivar, return_params=False, nugget=0)

# Raw with cleaned variogram
raw_theo_no_out = TheoreticalVariogram()
raw_theo_no_out.autofit(experimental_variogram=raw_semivar_not_out, return_params=False, nugget=0)

# Preprocessed
prep_theo = TheoreticalVariogram()
prep_theo.autofit(experimental_variogram=prep_semivar, return_params=False, nugget=0)

# Preprocessed with cleaned variogram
prep_theo_no_out = TheoreticalVariogram()
prep_theo_no_out.autofit(experimental_variogram=prep_semivar_not_out, return_params=False, nugget=0)

[24]:

# Set Kriging models

# Raw
raw_model = kriging(observations=train, theoretical_model=raw_theo, points=test[:, :-1], allow_approx_solutions=True)

# Raw & cleaned
c_raw_model = kriging(observations=train, theoretical_model=raw_theo_no_out, points=test[:, :-1], allow_approx_solutions=True)

# Preprocessed
prep_model = kriging(observations=train, theoretical_model=prep_theo, points=test[:, :-1], allow_approx_solutions=True)

# Preprocessed & cleaned
c_prep_model = kriging(observations=train, theoretical_model=prep_theo_no_out, points=test[:, :-1], allow_approx_solutions=True)

  0%|          | 0/6204 [00:00<?, ?it/s]/home/szymon/Documents/Programming/Repositories/pyinterpolate-environment/pyinterpolate/pyinterpolate/kriging/utils/process.py:124: UserWarning: 'Kriging system solution is based on the approximate solution, output may be incorrect!'
  warnings.warn(LeastSquaresApproximationWarning().__str__())
100%|██████████| 6204/6204 [00:01<00:00, 4084.98it/s]
100%|██████████| 6204/6204 [00:01<00:00, 4905.61it/s]
100%|██████████| 6204/6204 [00:01<00:00, 5030.64it/s]
100%|██████████| 6204/6204 [00:01<00:00, 4967.68it/s]

[25]:

# Build test function

def calculate_model_deviation(modeled_values, test_set):
    mse = ((test_set[:, -1] - modeled_values[:, 0])**2)
    return mse

[26]:

r_test = calculate_model_deviation(raw_model, test)

[27]:

cr_test = calculate_model_deviation(c_raw_model, test)

[28]:

p_test = calculate_model_deviation(prep_model, test)

[29]:

cp_test = calculate_model_deviation(c_prep_model, test)

[30]:

df = pd.DataFrame(data=np.array([r_test, cr_test, p_test, cp_test]).transpose(),
                  columns=['Raw', 'Raw-cleaned', 'Preprocessed', 'Preprocessed-cleaned'])

We will use .describe() method from pandas to get the errors statistic:

[31]:

df.describe()

[31]:

	Raw	Raw-cleaned	Preprocessed	Preprocessed-cleaned
count	6.204000e+03	6.204000e+03	6.204000e+03	6.204000e+03
mean	4.112253e+03	1.125479e+03	1.616366e+03	9.951413e+03
std	1.496551e+05	2.407578e+04	4.314202e+04	7.176860e+05
min	7.560711e-08	3.693612e-07	4.051043e-07	2.012364e-07
25%	9.082049e-01	8.223928e-01	9.574270e-01	8.203482e-01
50%	7.449470e+00	7.218670e+00	7.740437e+00	7.111464e+00
75%	5.507547e+01	5.230695e+01	5.615120e+01	5.161227e+01
max	1.004389e+07	1.049016e+06	2.625928e+06	5.651596e+07

The view of this table may be different each time you run this tutorial (and it is related to the training-test division process). In some cases, you may get better results for the cleaned dataset; for others, results will be better for raw data. That’s why we should always have a validation set to compare different scenarios. The other idea is to use all of the models and create a distribution of predictions and errors per point - we can use it later for decisions.

NOTE: The example in this tutorial is related to the Digital Elevation Model, which the data provider preprocessed (Copernicus Land Monitoring Services). You shouldn’t get the impression that raw data preprocessing and filtering are not required for the analysis. There are cases where the sensor may produce unreliable and biased results, such as a satellite camera’s saturated pixel. Removing it with the specific noise-filtering algorithm before the variogram point cloud development is better.

Where to go from here?#

B.2.1 Directional Ordinary Kriging
C.1.1 Blocks to Points Interpolation with Ordinary Kriging
C.1.2 Semivariogram Regularization

Changelog#

Date	Change description	Author
2023-08-23	The tutorial was refreshed and set along with the 0.5.0 version of the package	@SimonMolinsky
2023-04-15	Tutorial debugged and updated to the 0.4.1 version of the package	@SimonMolinsky
2022-11-05	Tutorial updated for the 0.3.5 version of the package	@SimonMolinsky
2022-10-08	The tutorial updated to the version 0.3.2 of the package	@SimonMolinsky
2022-08-27	The tutorial updated to the version 0.3.0 of the package	@SimonMolinsky
2021-12-14	Sill selection was upgraded: now optimal sill is derived from the grid search within `TheoreticalSemivariogram` class	@SimonMolinsky
2021-12-13	Changed behavior of `select_values_in_range()` function	@SimonMolinsky
2021-12-11	Behavior of `prepare_kriging_data()` function has benn changed	@SimonMolinsky
2021-10-13	Refactored TheoreticalSemivariogram (name change of class attribute) and refactored `calc_semivariance_from_pt_cloud()` function to protect calculations from `NaN's`.	@ethmtrgt & @SimonMolinsky
2021-08-22	Initial release	@SimonMolinsky

[ ]: