10  Geospatial correlation

In Part II and Part III of the book, we learned two important tools for efficient geospatial data science. We learned how transform pipelines can be used to prepare geospatial data for investigation, and how geospatial queries can be used to answer geoscientific questions. Before we can learn our third tool, we need to review the important concept of geospatial correlation:

Definition

Given \(n\) pairs of measurements \(\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}\) from variables \(X\) and \(Y\) that are \(h\) units of distance apart, we define geospatial correlation as the sample Pearson correlation coefficient:

\[ cor_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]

where \(\bar{x}\) and \(\bar{y}\) are the mean values.

Let’s consider the following synthetic image to illustrate the concept for different values of \(h\):

using GeoIO

img = GeoIO.load("data/gaussian.gslib")

img |> viewer
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

The hscatter plot can be used to visualize the scatter of pairs \(\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}\) at a given lag \(h\). We can choose a variable \(X\) for the horizontal axis, a (possibly different) variable \(Y\) for the vertical axis, and the value of the lag \(h\). In order to reduce the computational costs associated with the plot, we will sample a subset of measurements from the image:

sample = img |> Sample(1000, replace=false)
1000×2 GeoTable over 1000 view(::CartesianGrid, [280, 2989, 7558, 9367, ..., 4652, 9512, 5431, 1190])
Z geometry
Continuous Quadrangle
[NoUnits] 🖈 Cartesian{NoDatum}
-0.682033 Quadrangle((x: 79.0 m, y: 2.0 m), ..., (x: 79.0 m, y: 3.0 m))
-0.524057 Quadrangle((x: 88.0 m, y: 29.0 m), ..., (x: 88.0 m, y: 30.0 m))
0.955714 Quadrangle((x: 57.0 m, y: 75.0 m), ..., (x: 57.0 m, y: 76.0 m))
-0.774632 Quadrangle((x: 66.0 m, y: 93.0 m), ..., (x: 66.0 m, y: 94.0 m))
-0.606988 Quadrangle((x: 56.0 m, y: 5.0 m), ..., (x: 56.0 m, y: 6.0 m))
0.897755 Quadrangle((x: 37.0 m, y: 32.0 m), ..., (x: 37.0 m, y: 33.0 m))
-0.722238 Quadrangle((x: 25.0 m, y: 53.0 m), ..., (x: 25.0 m, y: 54.0 m))
-0.841425 Quadrangle((x: 53.0 m, y: 84.0 m), ..., (x: 53.0 m, y: 85.0 m))
1.09665 Quadrangle((x: 59.0 m, y: 63.0 m), ..., (x: 59.0 m, y: 64.0 m))
0.184124 Quadrangle((x: 16.0 m, y: 99.0 m), ..., (x: 16.0 m, y: 100.0 m))

If we plot the values of the variable Z in the horizontal axis and the values of the same variable measured at lag \(h=0\) on the vertical axis, we get points along the identity line (i.e. no scatter):

hscatter(sample, :Z, :Z, lag=0.0)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

By increasing the value of the lag, we observe that the correlation is no longer equal to one, and that the linear fit through the points approaches the horizontal axis (i.e., zero correlation):

hscatter(sample, :Z, :Z, lag=3.0)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

hscatter(sample, :Z, :Z, lag=5.0)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

hscatter(sample, :Z, :Z, lag=10.0)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

hscatter(sample, :Z, :Z, lag=50.0)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

The Pearson correlation coefficient studied as a function of the lag \(h\) is known as the correlogram function. For example, consider the exponential correlogram function given by \(cor(h) = \exp(-h)\):

┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

The terms auto-correlogram (\(X = Y\)) and cross-correlogram (\(X \ne Y\)) are also encountered in the literature to differentiate the various geospatial correlations in the multivariate case. Similarly, the terms auto-covariance and cross-covariance are encountered by replacing the correlation by the covariance (non-normalized correlation).

Even though the correlogram function is widely used in other scientific fields, we will review an alternative statistic of association that is most useful in geospatial data science.

10.1 Variography

Definition

The variogram function is a more general alternative to the correlogram and covariance functions that does not rely on the mean values \(\bar{x}\) and \(\bar{y}\). It is given by

\[ \gamma_x(h) \approx \frac{1}{2|N(h)|}\sum_{N(h)}(x_i-x_j)^2 \]

where \(N(h) = \Big\{(i,j): i\underbrace{\longrightarrow}_{h \text{ units}} j\Big\}\) is the set of pairs of locations that are \(h\) units apart.

In the multivariate case, we can also define the cross-variogram:

\[ \gamma_{xy}(h) \approx \frac{1}{2|N(h)|}\sum_{N(h)}(x_i-x_j)(y_i-y_j) \]

The value \(\gamma(h)\) measures the “spread” of the hscatter plot. Usually, at \(h=0\) there is no spread, and hence \(\gamma(0) = 0\). In most practical cases \(\gamma(h) \to \sigma^2\) as \(h \to \infty\) where \(\sigma^2\) is the maximum variance of the process. When this maximum variance exists, we can write the following relation:

\[ \gamma(h) = \sigma^2 - cov(h) \]

┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

where \(cov(h)\) is the covariance function, a version of the correlogram function that is scaled by the standard deviations of X and Y:

\[ cor(h) = \frac{cov(h)}{\sigma_x \sigma_y} \]

Explaining why the variogram is more general than the covariance is out of scope for this book, but it has to do with the fact that variograms operate on the “difference process” \((x_i - x_j)\) as opposed to the centered process \((x_i - \bar{x})\). In particular, it does not require a finite maximum variance \(\sigma^2\).

Note

The theory of intrinsic random functions of order k (IRF-k) is an advanced concept from geostatistical theory that explains the generality of the variogram function (Chilès and Delfiner 2012).

Our main goal here is to gain intuition about the variogram function for interpolation purposes. It suffices to learn its four basic elements: range, sill, nugget and model.

┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

10.1.1 range

The range (a.k.a. correlation length) of the variogram determines the average size of “blobs” in the image. Let’s consider two synthetic images with ranges 10 and 30, respectively:

┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

In the first image, we can clearly visualize the average size of yellow and blue blobs around 10 pixels (i.e. quadrangles). In the second image, the blobs have an average size of 30 pixels, which is greater than one of the sides of the grid (100x25 pixels).

10.1.2 sill

To understand the sill of the variogram, let’s consider a 1D grid as our domain, and let’s represent the values of the variable with height instead of color:

┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

The sill determines the maximum variance of the process. If the sill is \(\sigma^2\), then a process with mean \(\mu\) will oscillate within \(\mu\pm3\sigma\) with 99.7% probability. The vertical amplitude in the second plot is (3x) larger than that of the first plot. In both plots, we have \(\mu=0\).

10.1.3 nugget

The nugget can be used to insert additional variance at scales that are smaller than the scale of measurements (i.e. pixel). It is known in the image processing literature as salt-and-pepper noise:

┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

We can visualize the nugget effect in our 2D grid with colors as before:

┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

Note

The name “nugget” comes from gold nuggets in mining geostatistics. These are often much smaller than selective mining units (SMUs), and show as bright values in the 3D model of the mineral deposit.

10.1.4 model

Finally, the model of the variogram determines how the function increases near the origin. The GeoStats.jl framework provides dozens of such models of geospatial correlation. The most widely used are the GaussianVariogram, the SphericalVariogram and the ExponentialVariogram:

┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

The faster is the increase of the function near the origin, the more “erratic” is the process:

┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

All the four elements of the variogram function can be easily set at construction time:

γ = GaussianVariogram(range=10.0, sill=2.0, nugget=0.1)
GaussianVariogram
├─ sill: 2.0
├─ nugget: 0.1
├─ range: 10.0 m
└─ distance: Euclidean

And queried later with the corresponding functions:

range(γ), sill(γ), nugget(γ)
(10.0 m, 2.0, 0.1)

We can evaluate the variogram at any given lag:

γ(1.0)
0.15615445670336806

Or evaluate the variogram between any two points:

γ(Point(0, 0), Point(1, 0))
0.15615445670336806

In this case, the Euclidean metric is used by default to compute the lag. More generally, we can evaluate the variogram between any two geometries:

γ(Point(0, 0), Triangle((0, 0), (1, 0), (1, 1)))
0.13352136956222632
Note

The evaluation of the variogram function between two geometries is known as variogram regularization, and implemented in terms of numerical integration.

Remind that the variogram value \(\gamma(h)\) is a measure of spread in the hscatter plot. It tells how much variation is expected for a variable at a distance \(h\) from a reference point.

10.2 Fitting models

Given geospatial data, how do we fit an appropriate variogram model for it? This practical question is traditionally answered in two steps as follows.

10.2.1 Empirical estimate

Let’s recap the synthetic image from the beginning of the chapter:

img |> viewer
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

We can use the EmpiricalVariogram to estimate the function at specific lag values:

g = EmpiricalVariogram(img, :Z, maxlag = 50.0)
EmpiricalVariogram
├─ abscissas: [1.77383 m, 3.97958 m, 6.34839 m, ..., 43.762 m, 46.2475 m, 48.7361 m]
├─ ordinates: [0.0502087, 0.212007, 0.424966, ..., 0.874137, 0.880386, 0.886483]
├─ distance: Euclidean(0.0)
├─ estimator: MatheronEstimator()
└─ npairs: 24138788
varioplot(g)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

Note

The numbers and bars in the empirical variogram plot represent the number of pairs used to estimate the value of the variogram at the corresponding bin. The larger the number, the more confident we can be in the estimate.

The DirectionalVariogram can be used to estimate the function along specific directions:

gₕ = DirectionalVariogram((1.0, 0.0), img, :Z, maxlag = 50.0)
gᵥ = DirectionalVariogram((0.0, 1.0), img, :Z, maxlag = 50.0)

varioplot(gₕ, showhist = false, color = "maroon")
varioplot!(gᵥ, showhist = false, color = "slategray")
Mke.current_figure()
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

In this example, we observe that the blobs are elongated with a horizontal range of 30 pixels and a vertical range of 10 pixels. This is known as geometric anisotropy.

We can also estimate the variogram in all directions on a plane with the EmpiricalVarioplane:

gₚ = EmpiricalVarioplane(img, :Z, maxlag = 50.0)
EmpiricalVarioplane
  50 angles
  └─0.00°
  └─3.67°
  └─7.35°
  └─11.02°
  └─14.69°
  ⋮
  └─165.31°
  └─168.98°
  └─172.65°
  └─176.33°
  └─180.00°

The varioplane is usually plotted on a polar axis to highlight the different ranges as a function of the polar angle:

planeplot(gₚ)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227

Note

The book by Webster and Oliver (2007) and the article by Cressie and Hawkins (1980) are good resources to learn more about robust variogram estimation.

10.2.2 Least-squares fit

After empirical variogram estimation, the next step consists of fitting a theoretical model. This step is necessary for interpolation given that we need to be able to evaluate the variogram function at any lag \(h\), not just the specific lags of the empirical variogram.

Note

Another reason to fit theoretical models is to ensure that variances of linear combinations of variables are always non-negative as discussed in Myers (1992).

To fit a specific theoretical model, we can use the fit function with the model as the first argument:

GeoStatsFunctions.fit(SphericalVariogram, g)
SphericalVariogram
├─ sill: 0.886339
├─ nugget: 2.6816e-10
├─ range: 18.3506 m
└─ distance: Euclidean

We can also let the framework select the model with minimum weighted least-squares error by passing the generic Variogram model to the function:

γ = GeoStatsFunctions.fit(Variogram, g)
GaussianVariogram
├─ sill: 0.886432
├─ nugget: 0.0623571
├─ range: 14.8885 m
└─ distance: Euclidean

10.3 Remarks

This chapter is definitely one of the most challenging ones for those with little background in geostatistics. Let’s make a few important remarks to summarize what we learned:

  • Geospatial correlation can be represented with different functions, including the correlogram, the covariance and the variogram functions. Among these functions the variogram is the most general and easy to interpret as a measure of “spread” in the hscatter plot.
  • The variogram value \(\gamma(h)\) represents the expected variation of a variable that is \(h\) units of distance from a reference point. It usually starts at \(\gamma(0) = 0\), reaches a sill value \(\sigma^2\) near the range and stays at this value as \(h \to \infty\).
  • The selection of an appropriate theoretical variogram model for interpolation of geospatial data is often based on a two-step procedure. First, we estimate the EmpiricalVariogram, and then we fit a theoretical model. The most widely used models are the GaussianVariogram, the SphericalVariogram and the ExponentialVariogram.

In the next chapter, we will learn how to perform geospatial interpolation with the selected theoretical variogram model.