In Part II and Part III of the book, we learned two important tools for efficient geospatial data science. We learned how transform pipelines can be used to prepare geospatial data for investigation, and how geospatial queries can be used to answer geoscientific questions. Before we can learn our third tool, we need to review the important concept of geospatial correlation:
Definition
Given \(n\) pairs of measurements \(\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}\) from variables \(X\) and \(Y\) that are \(h\) units of distance apart, we define geospatial correlation as the sample Pearson correlation coefficient:
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
The hscatter plot can be used to visualize the scatter of pairs \(\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}\) at a given lag \(h\). We can choose a variable \(X\) for the horizontal axis, a (possibly different) variable \(Y\) for the vertical axis, and the value of the lag \(h\). In order to reduce the computational costs associated with the plot, we will sample a subset of measurements from the image:
Quadrangle((x: 79.0 m, y: 2.0 m), ..., (x: 79.0 m, y: 3.0 m))
-0.524057
Quadrangle((x: 88.0 m, y: 29.0 m), ..., (x: 88.0 m, y: 30.0 m))
0.955714
Quadrangle((x: 57.0 m, y: 75.0 m), ..., (x: 57.0 m, y: 76.0 m))
-0.774632
Quadrangle((x: 66.0 m, y: 93.0 m), ..., (x: 66.0 m, y: 94.0 m))
-0.606988
Quadrangle((x: 56.0 m, y: 5.0 m), ..., (x: 56.0 m, y: 6.0 m))
0.897755
Quadrangle((x: 37.0 m, y: 32.0 m), ..., (x: 37.0 m, y: 33.0 m))
-0.722238
Quadrangle((x: 25.0 m, y: 53.0 m), ..., (x: 25.0 m, y: 54.0 m))
-0.841425
Quadrangle((x: 53.0 m, y: 84.0 m), ..., (x: 53.0 m, y: 85.0 m))
1.09665
Quadrangle((x: 59.0 m, y: 63.0 m), ..., (x: 59.0 m, y: 64.0 m))
0.184124
Quadrangle((x: 16.0 m, y: 99.0 m), ..., (x: 16.0 m, y: 100.0 m))
⋮
⋮
If we plot the values of the variable Z in the horizontal axis and the values of the same variable measured at lag \(h=0\) on the vertical axis, we get points along the identity line (i.e. no scatter):
hscatter(sample, :Z, :Z, lag=0.0)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
By increasing the value of the lag, we observe that the correlation is no longer equal to one, and that the linear fit through the points approaches the horizontal axis (i.e., zero correlation):
hscatter(sample, :Z, :Z, lag=3.0)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
hscatter(sample, :Z, :Z, lag=5.0)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
hscatter(sample, :Z, :Z, lag=10.0)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
hscatter(sample, :Z, :Z, lag=50.0)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
The Pearson correlation coefficient studied as a function of the lag \(h\) is known as the correlogram function. For example, consider the exponential correlogram function given by \(cor(h) = \exp(-h)\):
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
The correlogram \(cor(h)\) is often a non-increasing function
It coincides with the usual correlation in data science at \(h=0\)
\(cor(h) \to 0\) as \(h \to \infty\) in most practical cases
The terms auto-correlogram (\(X = Y\)) and cross-correlogram (\(X \ne Y\)) are also encountered in the literature to differentiate the various geospatial correlations in the multivariate case. Similarly, the terms auto-covariance and cross-covariance are encountered by replacing the correlation by the covariance (non-normalized correlation).
Even though the correlogram function is widely used in other scientific fields, we will review an alternative statistic of association that is most useful in geospatial data science.
10.1 Variography
Definition
The variogram function is a more general alternative to the correlogram and covariance functions that does not rely on the mean values \(\bar{x}\) and \(\bar{y}\). It is given by
The value \(\gamma(h)\) measures the “spread” of the hscatter plot. Usually, at \(h=0\) there is no spread, and hence \(\gamma(0) = 0\). In most practical cases \(\gamma(h) \to \sigma^2\) as \(h \to \infty\) where \(\sigma^2\) is the maximum variance of the process. When this maximum variance exists, we can write the following relation:
\[
\gamma(h) = \sigma^2 - cov(h)
\]
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
where \(cov(h)\) is the covariance function, a version of the correlogram function that is scaled by the standard deviations of X and Y:
\[
cor(h) = \frac{cov(h)}{\sigma_x \sigma_y}
\]
Explaining why the variogram is more general than the covariance is out of scope for this book, but it has to do with the fact that variograms operate on the “difference process” \((x_i - x_j)\) as opposed to the centered process \((x_i - \bar{x})\). In particular, it does not require a finite maximum variance \(\sigma^2\).
Note
The theory of intrinsic random functions of order k (IRF-k) is an advanced concept from geostatistical theory that explains the generality of the variogram function (Chilès and Delfiner 2012).
Our main goal here is to gain intuition about the variogram function for interpolation purposes. It suffices to learn its four basic elements: range, sill, nugget and model.
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
10.1.1 range
The range (a.k.a. correlation length) of the variogram determines the average size of “blobs” in the image. Let’s consider two synthetic images with ranges 10 and 30, respectively:
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
In the first image, we can clearly visualize the average size of yellow and blue blobs around 10 pixels (i.e. quadrangles). In the second image, the blobs have an average size of 30 pixels, which is greater than one of the sides of the grid (100x25 pixels).
10.1.2 sill
To understand the sill of the variogram, let’s consider a 1D grid as our domain, and let’s represent the values of the variable with height instead of color:
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
The sill determines the maximum variance of the process. If the sill is \(\sigma^2\), then a process with mean \(\mu\) will oscillate within \(\mu\pm3\sigma\) with 99.7% probability. The vertical amplitude in the second plot is (3x) larger than that of the first plot. In both plots, we have \(\mu=0\).
10.1.3 nugget
The nugget can be used to insert additional variance at scales that are smaller than the scale of measurements (i.e. pixel). It is known in the image processing literature as salt-and-pepper noise:
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
We can visualize the nugget effect in our 2D grid with colors as before:
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
Note
The name “nugget” comes from gold nuggets in mining geostatistics. These are often much smaller than selective mining units (SMUs), and show as bright values in the 3D model of the mineral deposit.
10.1.4 model
Finally, the model of the variogram determines how the function increases near the origin. The GeoStats.jl framework provides dozens of such models of geospatial correlation. The most widely used are the GaussianVariogram, the SphericalVariogram and the ExponentialVariogram:
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
The faster is the increase of the function near the origin, the more “erratic” is the process:
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
All the four elements of the variogram function can be easily set at construction time:
And queried later with the corresponding functions:
range(γ), sill(γ), nugget(γ)
(10.0 m, 2.0, 0.1)
We can evaluate the variogram at any given lag:
γ(1.0)
0.15615445670336806
Or evaluate the variogram between any two points:
γ(Point(0, 0), Point(1, 0))
0.15615445670336806
In this case, the Euclidean metric is used by default to compute the lag. More generally, we can evaluate the variogram between any two geometries:
γ(Point(0, 0), Triangle((0, 0), (1, 0), (1, 1)))
0.13352136956222632
Note
The evaluation of the variogram function between two geometries is known as variogram regularization, and implemented in terms of numerical integration.
Remind that the variogram value \(\gamma(h)\) is a measure of spread in the hscatter plot. It tells how much variation is expected for a variable at a distance \(h\) from a reference point.
10.2 Fitting models
Given geospatial data, how do we fit an appropriate variogram model for it? This practical question is traditionally answered in two steps as follows.
10.2.1 Empirical estimate
Let’s recap the synthetic image from the beginning of the chapter:
img |> viewer
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
We can use the EmpiricalVariogram to estimate the function at specific lag values:
g =EmpiricalVariogram(img, :Z, maxlag =50.0)
EmpiricalVariogram
├─ abscissas: [1.77383 m, 3.97958 m, 6.34839 m, ..., 43.762 m, 46.2475 m, 48.7361 m]
├─ ordinates: [0.0502087, 0.212007, 0.424966, ..., 0.874137, 0.880386, 0.886483]
├─ distance: Euclidean(0.0)
├─ estimator: MatheronEstimator()
└─ npairs: 24138788
varioplot(g)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
Note
The numbers and bars in the empirical variogram plot represent the number of pairs used to estimate the value of the variogram at the corresponding bin. The larger the number, the more confident we can be in the estimate.
The DirectionalVariogram can be used to estimate the function along specific directions:
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
In this example, we observe that the blobs are elongated with a horizontal range of 30 pixels and a vertical range of 10 pixels. This is known as geometric anisotropy.
We can also estimate the variogram in all directions on a plane with the EmpiricalVarioplane:
The varioplane is usually plotted on a polar axis to highlight the different ranges as a function of the polar angle:
planeplot(gₚ)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/We6MY/src/scenes.jl:227
Note
The book by Webster and Oliver (2007) and the article by Cressie and Hawkins (1980) are good resources to learn more about robust variogram estimation.
10.2.2 Least-squares fit
After empirical variogram estimation, the next step consists of fitting a theoretical model. This step is necessary for interpolation given that we need to be able to evaluate the variogram function at any lag \(h\), not just the specific lags of the empirical variogram.
Note
Another reason to fit theoretical models is to ensure that variances of linear combinations of variables are always non-negative as discussed in Myers (1992).
To fit a specific theoretical model, we can use the fit function with the model as the first argument:
This chapter is definitely one of the most challenging ones for those with little background in geostatistics. Let’s make a few important remarks to summarize what we learned:
Geospatial correlation can be represented with different functions, including the correlogram, the covariance and the variogram functions. Among these functions the variogram is the most general and easy to interpret as a measure of “spread” in the hscatter plot.
The variogram value \(\gamma(h)\) represents the expected variation of a variable that is \(h\) units of distance from a reference point. It usually starts at \(\gamma(0) = 0\), reaches a sill value \(\sigma^2\) near the range and stays at this value as \(h \to \infty\).
The selection of an appropriate theoretical variogram model for interpolation of geospatial data is often based on a two-step procedure. First, we estimate the EmpiricalVariogram, and then we fit a theoretical model. The most widely used models are the GaussianVariogram, the SphericalVariogram and the ExponentialVariogram.
In the next chapter, we will learn how to perform geospatial interpolation with the selected theoretical variogram model.
Cressie, Noel, and Douglas M. Hawkins. 1980. “Robust Estimation of the Variogram: i.”Journal of the International Association for Mathematical Geology 12 (2): 115–25. https://doi.org/10.1007/bf01035243.
Myers, Donald E. 1992. “Kriging, Cokriging, Radial Basis Functions and the Role of Positive Definiteness.”Computers & Mathematics with Applications 24 (12): 139–48. https://doi.org/https://doi.org/10.1016/0898-1221(92)90176-I.