Clustering
Overview
We provide various geostatistical clustering methods to divide geospatial data into regions with homogeneous features. These methods can consider the values
of the geotable (the classical approach), or both the values
and the domain
(the geostatistical approach).
Consider the following geotable for illustration purposes:
gtb = georef((z=[10sin(i/10) + j for i in 1:4:100, j in 1:4:100],))
gtb |> viewer
data:image/s3,"s3://crabby-images/f1b41/f1b41356a411bcf0a787ac566b4209a53f4b920b" alt="Example block output"
Classical
Geostatistical
Unlike classical clustering methods in machine learning, geostatistical clustering (a.k.a. domaining) methods consider both the values
and the domain
of the data.
GeoStatsTransforms.GHC
— TypeGHC(k, λ; nmax=2000, kern=:epanechnikov, link=:ward, as=:cluster)
A transform for partitioning geospatial data into k
clusters according to a range λ
using Geostatistical Hierarchical Clustering (GHC). The larger the range the more connected are nearby samples.
Parameters
k
- Approximate number of clustersλ
- Approximate range of kernel function in length units
Options
nmax
- Maximum number of observations to use in dissimilarity matrixkern
- Kernel function (:uniform
,:triangular
or:epanechnikov
)link
- Linkage function (:single
,:average
,:complete
,:ward
or:ward_presquared
)as
- Variable name used to store clustering results
References
- Fouedjio, F. 2016. A hierarchical clustering method for multivariate geostatistical data
Notes
- The range parameter controls the sparsity pattern of the pairwise distances, which can greatly affect the computational performance of the GHC algorithm. We recommend choosing a range that is small enough to connect nearby samples. For example, clustering data over a 100x100 Cartesian grid with unit spacing is possible with
λ=1.0
orλ=2.0
but the problem starts to become computationally unfeasible aroundλ=10.0
due to the density of points.
ctb = gtb |> GHC(20, 1.0)
ctb |> viewer
data:image/s3,"s3://crabby-images/cdd09/cdd09ad74c986f63cd885155aad7888200106aa2" alt="Example block output"
GeoStatsTransforms.GSC
— TypeGSC(k, m; σ=1.0, tol=1e-4, maxiter=10, weights=nothing, as=:cluster)
A transform for partitioning geospatial data into k
clusters using Geostatistical Spectral Clustering (GSC).
Parameters
k
- Desired number of clustersm
- Multiplicative factor for adjacent weights
Options
σ
- Standard deviation for exponential model (default to1.0
)tol
- Tolerance of k-means algorithm (default to1e-4
)maxiter
- Maximum number of iterations (default to10
)weights
- Dictionary with weights for each attribute (default tonothing
)as
- Variable name used to store clustering results
References
- Romary et al. 2015. Unsupervised classification of multivariate geostatistical data: Two algorithms
Notes
- The algorithm implemented here is slightly different than the algorithm
described in Romary et al. 2015. Instead of setting Wᵢⱼ = 0 when i <-/-> j, we simply magnify the weight by a multiplicative factor Wᵢⱼ *= m when i <–> j. This leads to dense matrices but also better results in practice.
ctb = gtb |> GSC(50, 2.0)
ctb |> viewer
data:image/s3,"s3://crabby-images/2d4ff/2d4ffbd32a90f88ab45d1096a63486c2533c1517" alt="Example block output"
GeoStatsTransforms.SLIC
— TypeSLIC(k, m; tol=1e-4, maxiter=10, weights=nothing, as=:cluster)
A transform for clustering geospatial data into approximately k
clusters using Simple Linear Iterative Clustering (SLIC).
The transform produces clusters of samples that are spatially connected based on a distance dₛ
and that, at the same time, are similar in terms of vars
with distance dᵥ
. The tradeoff is controlled with a hyperparameter m
in an additive model dₜ = √(dᵥ² + m²(dₛ/s)²)
.
Parameters
k
- Approximate number of clustersm
- Hyperparameter of SLIC model
Options
tol
- Tolerance of k-means algorithm (default to1e-4
)maxiter
- Maximum number of iterations (default to10
)weights
- Dictionary with weights for each attribute (default tonothing
)as
- Variable name used to store clustering results
References
- Achanta et al. 2011. SLIC superpixels compared to state-of-the-art superpixel methods
ctb = gtb |> SLIC(50, 0.01)
ctb |> viewer
data:image/s3,"s3://crabby-images/f10a8/f10a815254a0107631c3f8339534f1cb2eb6dd5e" alt="Example block output"