Quickstart

A geostatistical workflow often consists of four steps:

Definition of geospatial data
Manipulation of geospatial data
Geostatistical modeling
Scientific visualization

In this section, we walk through these steps to illustrate some of the features of the project. In the case of geostatistical modeling, we will specifically explore geostatistical learning models. If you prefer learning from video, check out the recording of our JuliaEO 2023 workshop:

We assume that the following packages are loaded throughout the code examples:

using GeoStats
import CairoMakie as Mke

The GeoStats.jl package exports the full stack for geospatial data science and geostatistical modeling. The CairoMakie.jl package is one of the possible visualization backends from the Makie.jl project.

If you are new to Julia and have never heard of Makie.jl before, here are a few tips to help you choose between the different backends:

WGLMakie.jl is the preferred backend for interactive visualization on the web browser. It integrates well with Pluto.jl notebooks and other web-based applications.
GLMakie.jl is the preferred backend for interactive high-performance visualization. It leverages native graphical resources and doesn't require a web browser to function.
CairoMakie.jl is the preferred backend for publication-quality static visualization. It requires less computing power and is therefore recommended for those users with modest laptops.

Note

We recommend importing the Makie.jl backend as Mke to avoid polluting the Julia session with names from the visualization stack.

Loading/creating data

Loading data

The Julia ecosystem for loading geospatial data is comprised of several low-level packages such as Shapefile.jl and GeoJSON.jl, which define their own very basic geometry types. Instead of requesting users to learn the so called GeoInterface.jl to handle these types, we provide the high-level GeoIO.jl package to load any file with geospatial data into well-tested geometries from the Meshes.jl submodule:

using GeoIO

zone = GeoIO.load("data/zone.shp")

4×6 GeoTable over 4 GeometrySet
PERIMETER	ACRES	MACROZONA	Hectares	area_m2	geometry
Continuous	Continuous	Categorical	Continuous	Continuous	MultiPolygon
[NoUnits]	[NoUnits]	[NoUnits]	[NoUnits]	[NoUnits]	🖈 GeodeticLatLon{SAD69}
5.8508e6	3.23145e7	Estuario	1.30772e7	1.30772e11	Multi(21×PolyArea)
9.53947e6	2.50594e8	Fronteiras Antigas	1.01412e8	1.01412e12	Multi(1×PolyArea)
1.01743e7	2.75528e8	Fronteiras Intermediarias	1.11502e8	1.11502e12	Multi(1×PolyArea)
7.09612e6	1.61293e8	Fronteiras Novas	6.5273e7	6.5273e11	Multi(2×PolyArea)

Various functions are defined over these geometries, for instance:

sum(area, zone.geometry)

2.903786892991543e12 m^2

sum(perimeter, zone.geometry)

3.2680213406922124e7 m

Please check the Meshes.jl documentation for more details.

Note

We highly recommend using Meshes.jl geometries in geospatial workflows as they were carefully designed to accomodate advanced features of the GeoStats.jl framework. Any other geometry type will likely fail with our geostatistical algorithms and pipelines.

Creating data

Geospatial data can be derived from other Julia variables. For example, given a Julia array, which is not attached to any coordinate system, we can georeference the array using the georef function:

z = [10sin(i/10) + 2j for i in 1:50, j in 1:50]

Ω = georef((; z=z))

2500×2 GeoTable over 50×50 CartesianGrid
z	geometry
Continuous	Quadrangle
[NoUnits]	🖈 Cartesian{NoDatum}
2.99833	Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
3.98669	Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
4.9552	Quadrangle((x: 2.0 m, y: 0.0 m), ..., (x: 2.0 m, y: 1.0 m))
5.89418	Quadrangle((x: 3.0 m, y: 0.0 m), ..., (x: 3.0 m, y: 1.0 m))
6.79426	Quadrangle((x: 4.0 m, y: 0.0 m), ..., (x: 4.0 m, y: 1.0 m))
7.64642	Quadrangle((x: 5.0 m, y: 0.0 m), ..., (x: 5.0 m, y: 1.0 m))
8.44218	Quadrangle((x: 6.0 m, y: 0.0 m), ..., (x: 6.0 m, y: 1.0 m))
9.17356	Quadrangle((x: 7.0 m, y: 0.0 m), ..., (x: 7.0 m, y: 1.0 m))
9.83327	Quadrangle((x: 8.0 m, y: 0.0 m), ..., (x: 8.0 m, y: 1.0 m))
10.4147	Quadrangle((x: 9.0 m, y: 0.0 m), ..., (x: 9.0 m, y: 1.0 m))
⋮	⋮

Default coordinates are assigned based on the size of the array, and different configurations can be obtained with different methods (see Data).

Geospatial data can be visualized with the viz recipe function:

viz(Ω.geometry, color = Ω.z)

Alternatively, we provide a basic scientific viewer to visualize all viewable variables in the data with a colorbar and other interactive elements (see Visualization):

viewer(Ω)

Note

Julia supports unicode symbols with $\LaTeX$ syntax. We can type \Omega and press TAB to get the symbol Ω in the example above. This autocompletion works in various text editors, including the VSCode editor with the Julia extension.

Manipulating data

Table interface

Our geospatial data implements the Tables.jl interface, which means that they can be accessed as if they were tables with samples in the rows and variables in the columns. In this case, a special column named geometry is created on the fly, row by row, containing Meshes.jl geometries.

For those familiar with the productive DataFrames.jl interface, there is nothing new:

Ω[1,:]

(z = 2.9983341664682817, geometry = Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m)))

Ω[1,"geometry"]

Quadrangle
├─ Point(x: 0.0 m, y: 0.0 m)
├─ Point(x: 1.0 m, y: 0.0 m)
├─ Point(x: 1.0 m, y: 1.0 m)
└─ Point(x: 0.0 m, y: 1.0 m)

Ω.z

2500-element Vector{Float64}:
  2.9983341664682817
  3.9866933079506124
  4.955202066613396
  5.894183423086505
  6.79425538604203
  7.6464247339503535
  8.44217687237691
  9.173560908995228
  9.833269096274833
 10.414709848078965
  ⋮
 91.28424227586412
 90.83834063250545
 90.48397926110484
 90.22469882334903
 90.06308996366536
 90.000767424359
 90.0383539116416
 90.17547387375667
 90.41075725336862

However, notice that our implementation performs some clever optimizations behind the scenes to avoid expensive creation of geometries:

Ω.geometry

50×50 CartesianGrid
├─ minimum: Point(x: 0.0 m, y: 0.0 m)
├─ maximum: Point(x: 50.0 m, y: 50.0 m)
└─ spacing: (1.0 m, 1.0 m)

We can always retrieve the table of attributes (or features) with the function values and the underlying geospatial domain with the function domain. This can be useful for writing algorithms that are type-stable and depend purely on the feature values:

values(Ω)

(z = [2.9983341664682817, 3.9866933079506124, 4.955202066613396, 5.894183423086505, 6.79425538604203, 7.6464247339503535, 8.44217687237691, 9.173560908995228, 9.833269096274833, 10.414709848078965  …  91.8172288893559, 91.28424227586412, 90.83834063250545, 90.48397926110484, 90.22469882334903, 90.06308996366536, 90.000767424359, 90.0383539116416, 90.17547387375667, 90.41075725336862],)

or on the geometries:

domain(Ω)

50×50 CartesianGrid
├─ minimum: Point(x: 0.0 m, y: 0.0 m)
├─ maximum: Point(x: 50.0 m, y: 50.0 m)
└─ spacing: (1.0 m, 1.0 m)

Geospatial transforms

It is easy to design advanced geospatial pipelines that operate on both the table of features and the underlying geospatial domain:

# pipeline with table transforms
pipe = Quantile() → StdCoords()

# feed geospatial data to pipeline
Ω̂ = pipe(Ω)

# plot distribution before and after pipeline
fig = Mke.Figure(size = (800, 400))
Mke.hist(fig[1,1], Ω.z, color = :gray)
Mke.hist(fig[2,1], Ω̂.z, color = :gray)
fig

Coordinates before pipeline:

boundingbox(Ω.geometry)

Box
├─ min: Point(x: 0.0 m, y: 0.0 m)
└─ max: Point(x: 50.0 m, y: 50.0 m)

and after pipeline:

boundingbox(Ω̂.geometry)

Box
├─ min: Point(x: -0.5 m, y: -0.5 m)
└─ max: Point(x: 0.5 m, y: 0.5 m)

These pipelines are revertible meaning that we can transform the data, perform geostatistical modeling, and revert the pipelines to obtain estimates in the original sample space (see Transforms).

Geospatial queries

We provide three macros @groupby, @transform and @combine for powerful geospatial split-apply-combine patterns, as well as the function geojoin for advanced geospatial joins.

@transform(Ω, :w = 2 * :z * area(:geometry))

2500×3 GeoTable over 50×50 CartesianGrid
z	w	geometry
Continuous	Continuous	Quadrangle
[NoUnits]	[m^2]	🖈 Cartesian{NoDatum}
2.99833	5.99667 m^2	Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
3.98669	7.97339 m^2	Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
4.9552	9.9104 m^2	Quadrangle((x: 2.0 m, y: 0.0 m), ..., (x: 2.0 m, y: 1.0 m))
5.89418	11.7884 m^2	Quadrangle((x: 3.0 m, y: 0.0 m), ..., (x: 3.0 m, y: 1.0 m))
6.79426	13.5885 m^2	Quadrangle((x: 4.0 m, y: 0.0 m), ..., (x: 4.0 m, y: 1.0 m))
7.64642	15.2928 m^2	Quadrangle((x: 5.0 m, y: 0.0 m), ..., (x: 5.0 m, y: 1.0 m))
8.44218	16.8844 m^2	Quadrangle((x: 6.0 m, y: 0.0 m), ..., (x: 6.0 m, y: 1.0 m))
9.17356	18.3471 m^2	Quadrangle((x: 7.0 m, y: 0.0 m), ..., (x: 7.0 m, y: 1.0 m))
9.83327	19.6665 m^2	Quadrangle((x: 8.0 m, y: 0.0 m), ..., (x: 8.0 m, y: 1.0 m))
10.4147	20.8294 m^2	Quadrangle((x: 9.0 m, y: 0.0 m), ..., (x: 9.0 m, y: 1.0 m))
⋮	⋮	⋮

These are very useful for geospatial data science as they hide the complexity of the geometry column. For more information, check the Queries section of the documentation.

Geostatistical modeling

Having defined the geospatial data objects, we proceed and define the geostatistical learning model. Let's assume that we have geopatial data with some variable that we want to predict in a supervised learning setting. We load the data from a CSV file, and inspect the available columns:

using CSV
using DataFrames

table = CSV.File("data/agriculture.csv") |> DataFrame

first(table, 5)

5×8 DataFrame

Row	band1	band2	band3	band4	crop	polygon	x	y
	Int64	Int64	Int64	Int64	Int64	Float64	Float64	Float64
1	121	91	70	384	1	222.0	746.0	717.0
2	112	82	58	336	2	264.0	953.0	291.0
3	120	86	55	389	1	363.0	1050.0	796.0
4	125	92	78	255	2	295.0	787.0	921.0
5	117	77	59	308	1	347.0	947.0	758.0

Columns band1, band2, ..., band4 represent four satellite bands for different locations (x, y) in this region. The column crop has the crop type for each location that was labeled manually with the purpose of fitting a learning model.

We can now georeference the table and plot some of the variables:

Ω = georef(table, ("x", "y"))

fig = Mke.Figure(size = (800, 400))
viz(fig[1,1], Ω.geometry, color = Ω.band4)
viz(fig[1,2], Ω.geometry, color = Ω.crop)
fig

Similar to a generic statistical learning workflow, we split the data into "train" and "test" sets. The main difference here is that our geospatial geosplit function accepts a separating plane specified by its normal direction (1,-1):

Ωs, Ωt = geosplit(Ω, 0.2, (1.0, -1.0))

viz(Ωs.geometry)
viz!(Ωt.geometry, color = :gray90)
Mke.current_figure()

We can visualize the domain of the "train" (or source) set Ωs in blue, and the domain of the "test" (or target) set Ωt in gray. We reserved 20% of the samples to Ωs and 80% to Ωt. Internally, this geospatial geosplit function is implemented in terms of efficient geospatial partitions.

Let's define our geostatistical learning model to predict the crop type based on the four satellite bands. We will use the DecisionTreeClassifier model, which is suitable for the task we want to perform. Any model from the StatsLeanModels.jl model is supported, including all models from ScikitLearn.jl:

feats = ["band1", "band2", "band3", "band4"]
label = "crop"

model = DecisionTreeClassifier()

DecisionTreeClassifier
max_depth:                -1
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  nothing
root:                     nothing

We will fit the model in Ωs where the features and labels are available and predict in Ωt where the features are available. The Learn transform automatically fits the model to the data:

learn = Learn(Ωs, model, feats => label)

Learn transform
├─ model: FittedStatsLearnModel{DecisionTreeClassifier}
└─ feats: [:band1, :band2, :band3, :band4]

The transform can be called with new data to generate predictions:

Ω̂t = learn(Ωt)

60039×2 GeoTable over 60039 view(::PointSet, [1, 2, 3, 4, ..., 74997, 74998, 74999, 75000])
crop	geometry
Categorical	Point
[NoUnits]	🖈 Cartesian{NoDatum}
1	(x: 746.0 m, y: 717.0 m)
2	(x: 953.0 m, y: 291.0 m)
2	(x: 1050.0 m, y: 796.0 m)
2	(x: 787.0 m, y: 921.0 m)
2	(x: 947.0 m, y: 758.0 m)
2	(x: 949.0 m, y: 769.0 m)
2	(x: 773.0 m, y: 899.0 m)
2	(x: 851.0 m, y: 895.0 m)
3	(x: 1018.0 m, y: 748.0 m)
1	(x: 427.0 m, y: 743.0 m)
⋮	⋮

Scientific visualization

We note that the prediction of a geostatistical learning model is a geospatial data object, and we can inspect it with the same methods already described above. This also means that we can visualize the prediction directly, side by side with the true label in this synthetic example:

fig = Mke.Figure(size = (800, 400))
viz(fig[1,1], Ω̂t.geometry, color = Ω̂t.crop)
viz(fig[1,2], Ωt.geometry, color = Ωt.crop)
fig

With this example we conclude the basic workflow. To get familiar with other features of the project, please check the the reference guide.