A.1.3 Spatial Dependence Index#

Is dataset worth modeling?

Table of Contents:#

What is the spatial dependcy index?
Why do we use spatial dependency index?
Example: The comparison of different Spatial Dependence Index over the same extent but different elements.
API links.

Introduction#

In this tutorial, we will learn how to estimate spatial dependency index. Algorithm is based on the work:

[1] CAMBARDELLA, C.A.; MOORMAN, T.B.; PARKIN, T.B.; KARLEN, D.L.; NOVAK, J.M.; TURCO, R.F.; KONOPKA, A.E. Field-scale variability of soil properties in central Iowa soils. Soil Science Society of America Journal, v. 58, n. 5, p. 1501-1511, 1994.

1. What is the spatial dependency index?#

The spatial dependency index (SDI) measures the strength of a spatial process we are modeling. SDI is normalized to the interval between 0 and 1. Therefore, we can transform it into percentages and assign an order of spatial dependency from weak to strong.

The SDI is a ratio of the nugget to the total variance (sill) of a model:

\[SDI = \frac{nugget}{sill} * 100\]

Whenever we fit a theoretical variogram with the pyinterpolate package, SDI is calculated, and we will take advantage of it in the examples. Two values represent SDI:

numeric, a ratio of nugget and sill in percent,
categorical, a description of a spatial dependency strength.

There are four levels of spatial dependency.

Lower Limit (included)	Upper Limit (excluded)	Strength
0	25	strong
25	75	moderate
75	95	weak
95	inf	no spatial dependence

The lower the ratio, the most substantial spatial dependence. If the ratio is greater than 75 percent, we should be cautious with spatial modeling because spatial similarities may not explain the process.

2. Why do we use the spatial dependency index?#

In a world where Tobler’s Law can be applied to every spatial phenomenon, we might always use kriging without consideration. We know that spatial dependence exists, and close neighbors are always similar.

This world is not our world! Not every process follows Tobler’s Law. We can find examples sampled over the same area and scale, but their spatial dependence indexes differ. In [1] (SDI in pyinterpolate is based on this publication), multiple chemical compounds are sampled from the same field. Every compound has its spatial distribution and variogram. Some soil parameters are not spatially dependent (for example, Mg or Ca that are randomly distributed).

The spatial dependence index level marks the next decision on what to do with the data.

Strong: just krige it!
Moderate: there might be some other thing that explains process variation.
Weak: the other non-spatial process has more influence on data than spatial similarities.
No spatial dependence: the process is random, or spatial dependencies cannot explain variance.

Note: Be careful because the last two points are red flags, BUT sometimes processes with a low variogram variability at one scale may be explained with spatial relations at a changed scale. A practical example is a comparison of rental apartment prices per night: if you look at the scale of hundreds of kilometers, the spatial dependence may be very weak or none. On the other hand, the spatial similarity between prices of apartments close to each other (up to 10 kilometers, 6 miles) tends to show a classic variogram curve. The reason is simple: most managers and algorithms use information about the pricing of the closest neighbors, and neighborhood prices are affected by the same external objects or events.

We should look into the spatial dependence index to ensure our path leads us toward meaningful results.

Example: Spatial Dependence over the same study extent but for different elements#

We will compare the spatial dependence index of four elements: cadmium, copper, lead, and zinc. We use the meuse dataset. The dataset comes from:

Pebesma, Edzer. (2009). The meuse data set: a tutorial for the gstat R package -> [link to the publication](https://cran.r-project.org/web/packages/gstat/vignettes/gstat.pdf)

[1]:

import numpy as np
import pandas as pd
import pyinterpolate as ptp

[2]:

MEUSE_FILE = 'samples/point_data/csv/meuse/meuse.csv'

# Variogram parameters
STEP_SIZE = 100
MAX_RANGE = 1600
ALLOWED_MODELS = 'safe'


# Elements
ELEMENTS = ['cadmium', 'copper', 'zinc', 'lead']
COLS = ['x', 'y']
COLS.extend(ELEMENTS)

[3]:

df = pd.read_csv(MEUSE_FILE, usecols=COLS)
df.head()

[3]:

	x	y	cadmium	copper	lead	zinc
0	181072	333611	11.7	85	299	1022
1	181025	333558	8.6	81	277	1141
2	181165	333537	6.5	68	199	640
3	181298	333484	2.6	81	116	257
4	181307	333330	2.8	48	117	269

Data is skewed for every element in the table. Before we start variogram modeling, we must transform it into a distribution close to normality. We will use a logarithmic transform, pass data into an experimental variogram, and then create theoretical models with the autofit() method. We will use grid search to find the best possible nugget value.

[4]:

for element in ELEMENTS:
    # Prepare data
    ds = df[['x', 'y', element]].copy()
    ds[element] = np.log(ds[element])
    arr = ds.values

    # Get experimental variogram
    exp_var = ptp.build_experimental_variogram(arr, step_size=STEP_SIZE, max_range=MAX_RANGE)

    # Find optimal theoretical model
    optimal_model = ptp.TheoreticalVariogram()
    optimal_model.autofit(experimental_variogram=exp_var, model_name=ALLOWED_MODELS, max_nugget=0.8)
    print('Optimal model for element {} has {:.2f}% of nugget to sill ratio. Spatial Dependence is {}.'.format(
        element, optimal_model.spatial_dependency_ratio, optimal_model.spatial_dependency_strength
    ))
    optimal_model.plot()

    print('\n---\n')

Optimal model for element cadmium has 36.02% of nugget to sill ratio. Spatial Dependence is moderate.

../../_images/usage_tutorials_A13-Spatial-Dependence-Index_9_1.png


---
Optimal model for element copper has 20.31% of nugget to sill ratio. Spatial Dependence is strong.

../../_images/usage_tutorials_A13-Spatial-Dependence-Index_9_3.png


---
Optimal model for element zinc has 20.08% of nugget to sill ratio. Spatial Dependence is strong.

../../_images/usage_tutorials_A13-Spatial-Dependence-Index_9_5.png


---
Optimal model for element lead has 20.22% of nugget to sill ratio. Spatial Dependence is strong.

../../_images/usage_tutorials_A13-Spatial-Dependence-Index_9_7.png

---

The strength of spatial dependence differs mostly between Cadmium and other elements.

API#

The spatial dependence index may be calculated directly with the calculate_spatial_dependence_index() function that takes two parameters: nugget and sill. It returns Tuple with spatial dependence ratio and spatial dependence strength.

Another way is to calculate TheoreticalVariogram with a grid search option for nugget (just leave the nugget parameter as None and the algorithm will find the optimal nugget value).

Where to go from here?

A.2.1 Directional Semivariogram
A.2.2 Variogram Point Cloud
A.2.3 Experimental Variogram and Variogram Point Cloud Classes
B.1.1 Ordinary and Simple Kriging

Changelog#

Date	Change description	Author
2023-08-19	The tutorial was refreshed and set along with the 0.5.0 version of the package	@SimonMolinsky
2023-04-15	Tutorial debugged and updated to the 0.4.1 version of the package	@SimonMolinsky
2023-04-08	The first version of the tutorial	@SimonMolinsky

[ ]: