Quickstart

A geostatistical workflow often consists of four steps:

  1. Definition of geospatial data
  2. Manipulation of geospatial data
  3. Geostatistical modeling
  4. Scientific visualization

In this section, we walk through these steps to illustrate some of the features of the project. In the case of geostatistical modeling, we will specifically explore geostatistical learning models. If you prefer learning from video, check out the recording of our JuliaEO 2023 workshop:

We assume that the following packages are loaded throughout the code examples:

using GeoStats
import CairoMakie as Mke

The GeoStats.jl package exports the full stack for geospatial data science and geostatistical modeling. The CairoMakie.jl package is one of the possible visualization backends from the Makie.jl project.

If you are new to Julia and have never heard of Makie.jl before, here are a few tips to help you choose between the different backends:

  • WGLMakie.jl is the preferred backend for interactive visualization on the web browser. It integrates well with Pluto.jl notebooks and other web-based applications.

  • GLMakie.jl is the preferred backend for interactive high-performance visualization. It leverages native graphical resources and doesn't require a web browser to function.

  • CairoMakie.jl is the preferred backend for publication-quality static visualization. It requires less computing power and is therefore recommended for those users with modest laptops.

Note

We recommend importing the Makie.jl backend as Mke to avoid polluting the Julia session with names from the visualization stack.

Loading/creating data

Loading data

The Julia ecosystem for loading geospatial data is comprised of several low-level packages such as Shapefile.jl and GeoJSON.jl, which define their own very basic geometry types. Instead of requesting users to learn the so called GeoInterface.jl to handle these types, we provide the high-level GeoIO.jl package to load any file with geospatial data into well-tested geometries from the Meshes.jl submodule:

using GeoIO

zone = GeoIO.load("data/zone.shp")
path = GeoIO.load("data/path.shp")

viz(zone.geometry)
viz!(path.geometry, color = :gray90)
Mke.current_figure()

Various functions are defined over these geometries, for instance:

sum(area, zone.geometry)
238.0147015754351
sum(perimeter, zone.geometry)
295.25311339990634

Please check the Meshes.jl documentation for more details.

Note

We highly recommend using Meshes.jl geometries in geospatial workflows as they were carefully designed to accomodate advanced features of the GeoStats.jl framework. Any other geometry type will likely fail with our geostatistical algorithms and pipelines.

Creating data

Geospatial data can be derived from other Julia variables. For example, given a Julia array, which is not attached to any coordinate system, we can georeference the array using the georef function:

Z = [10sin(i/10) + 2j for i in 1:50, j in 1:50]

Ω = georef((Z=Z,))
2500×2 GeoTable over 50×50 CartesianGrid{2,Float64}
Z geometry
Continuous Quadrangle
[NoUnits]
2.99833 Quadrangle((0.0, 0.0), ..., (0.0, 1.0))
3.98669 Quadrangle((1.0, 0.0), ..., (1.0, 1.0))
4.9552 Quadrangle((2.0, 0.0), ..., (2.0, 1.0))
5.89418 Quadrangle((3.0, 0.0), ..., (3.0, 1.0))
6.79426 Quadrangle((4.0, 0.0), ..., (4.0, 1.0))
7.64642 Quadrangle((5.0, 0.0), ..., (5.0, 1.0))
8.44218 Quadrangle((6.0, 0.0), ..., (6.0, 1.0))
9.17356 Quadrangle((7.0, 0.0), ..., (7.0, 1.0))
9.83327 Quadrangle((8.0, 0.0), ..., (8.0, 1.0))
10.4147 Quadrangle((9.0, 0.0), ..., (9.0, 1.0))

Default coordinates are assigned based on the size of the array, and different configurations can be obtained with different methods (see Data).

Geospatial data can be visualized with the viz recipe function:

viz(Ω.geometry, color = Ω.Z)

Alternatively, we provide a basic scientific viewer to visualize all viewable variables in the data with a colorbar and other interactive elements (see Visualization):

viewer(Ω)
Note

Julia supports unicode symbols with $\LaTeX$ syntax. We can type \Omega and press TAB to get the symbol Ω in the example above. This autocompletion works in various text editors, including the VSCode editor with the Julia extension.

Manipulating data

Table interface

Our geospatial data implements the Tables.jl interface, which means that they can be accessed as if they were tables with samples in the rows and variables in the columns. In this case, a special column named geometry is created on the fly, row by row, containing Meshes.jl geometries.

For those familiar with the productive DataFrames.jl interface, there is nothing new:

Ω[1,:]
(Z = 2.9983341664682817, geometry = Quadrangle((0.0, 0.0), ..., (0.0, 1.0)))
Ω[1,:geometry]
Quadrangle{2,Float64}
├─ Point(0.0, 0.0)
├─ Point(1.0, 0.0)
├─ Point(1.0, 1.0)
└─ Point(0.0, 1.0)
Ω.Z
2500-element Vector{Float64}:
  2.9983341664682817
  3.9866933079506124
  4.955202066613396
  5.894183423086505
  6.79425538604203
  7.6464247339503535
  8.44217687237691
  9.173560908995228
  9.833269096274833
 10.414709848078965
  ⋮
 91.28424227586412
 90.83834063250545
 90.48397926110484
 90.22469882334903
 90.06308996366536
 90.000767424359
 90.0383539116416
 90.17547387375667
 90.41075725336862

However, notice that our implementation performs some clever optimizations behind the scenes to avoid expensive creation of geometries:

Ω.geometry
50×50 CartesianGrid{2,Float64}
  minimum: Point(0.0, 0.0)
  maximum: Point(50.0, 50.0)
  spacing: (1.0, 1.0)

We can always retrieve the table of attributes (or features) with the function values and the underlying geospatial domain with the function domain. This can be useful for writing algorithms that are type-stable and depend purely on the feature values:

values(Ω)
(Z = [2.9983341664682817, 3.9866933079506124, 4.955202066613396, 5.894183423086505, 6.79425538604203, 7.6464247339503535, 8.44217687237691, 9.173560908995228, 9.833269096274833, 10.414709848078965  …  91.8172288893559, 91.28424227586412, 90.83834063250545, 90.48397926110484, 90.22469882334903, 90.06308996366536, 90.000767424359, 90.0383539116416, 90.17547387375667, 90.41075725336862],)

or on the geometries:

domain(Ω)
50×50 CartesianGrid{2,Float64}
  minimum: Point(0.0, 0.0)
  maximum: Point(50.0, 50.0)
  spacing: (1.0, 1.0)

Geospatial transforms

It is easy to design advanced geospatial pipelines that operate on both the table of features and the underlying geospatial domain:

# pipeline with table transforms
pipe = Quantile() → StdCoords()

# feed geospatial data to pipeline
Ω̂ = pipe(Ω)

# plot distribution before and after pipeline
fig = Mke.Figure(size = (800, 400))
Mke.hist(fig[1,1], Ω.Z, color = :gray)
Mke.hist(fig[2,1], Ω̂.Z, color = :gray)
fig

Coordinates before pipeline:

boundingbox(Ω.geometry)
Box{2,Float64}
├─ min: Point(0.0, 0.0)
└─ max: Point(50.0, 50.0)

and after pipeline:

boundingbox(Ω̂.geometry)
Box{2,Float64}
├─ min: Point(-0.5, -0.5)
└─ max: Point(0.5, 0.5)

These pipelines are revertible meaning that we can transform the data, perform geostatistical modeling, and revert the pipelines to obtain estimates in the original sample space (see Transforms).

Geospatial queries

We provide three macros @groupby, @transform and @combine for powerful geospatial split-apply-combine patterns, as well as the function geojoin for advanced geospatial joins.

@transform(Ω, :W = 2 * :Z * area(:geometry))
2500×3 GeoTable over 50×50 CartesianGrid{2,Float64}
Z W geometry
Continuous Continuous Quadrangle
[NoUnits] [NoUnits]
2.99833 5.99667 Quadrangle((0.0, 0.0), ..., (0.0, 1.0))
3.98669 7.97339 Quadrangle((1.0, 0.0), ..., (1.0, 1.0))
4.9552 9.9104 Quadrangle((2.0, 0.0), ..., (2.0, 1.0))
5.89418 11.7884 Quadrangle((3.0, 0.0), ..., (3.0, 1.0))
6.79426 13.5885 Quadrangle((4.0, 0.0), ..., (4.0, 1.0))
7.64642 15.2928 Quadrangle((5.0, 0.0), ..., (5.0, 1.0))
8.44218 16.8844 Quadrangle((6.0, 0.0), ..., (6.0, 1.0))
9.17356 18.3471 Quadrangle((7.0, 0.0), ..., (7.0, 1.0))
9.83327 19.6665 Quadrangle((8.0, 0.0), ..., (8.0, 1.0))
10.4147 20.8294 Quadrangle((9.0, 0.0), ..., (9.0, 1.0))

These are very useful for geospatial data science as they hide the complexity of the geometry column. For more information, check the Queries section of the documentation.

Geostatistical modeling

Having defined the geospatial data objects, we proceed and define the geostatistical learning model. Let's assume that we have geopatial data with some variable that we want to predict in a supervised learning setting. We load the data from a CSV file, and inspect the available columns:

using CSV
using DataFrames

table = CSV.File("data/agriculture.csv") |> DataFrame

first(table, 5)
5×8 DataFrame
Rowband1band2band3band4croppolygonxy
Int64Int64Int64Int64Int64Float64Float64Float64
112191703841222.0746.0717.0
211282583362264.0953.0291.0
312086553891363.01050.0796.0
412592782552295.0787.0921.0
511777593081347.0947.0758.0

Columns band1, band2, ..., band4 represent four satellite bands for different locations (x, y) in this region. The column crop has the crop type for each location that was labeled manually with the purpose of fitting a learning model.

We can now georeference the table and plot some of the variables:

Ω = georef(table, (:x, :y))

fig = Mke.Figure(size = (800, 400))
viz(fig[1,1], Ω.geometry, color = Ω.band4)
viz(fig[1,2], Ω.geometry, color = Ω.crop)
fig

Similar to a generic statistical learning workflow, we split the data into "train" and "test" sets. The main difference here is that our geospatial geosplit function accepts a separating plane specified by its normal direction (1,-1):

Ωs, Ωt = geosplit(Ω, 0.2, (1.0, -1.0))

viz(Ωs.geometry)
viz!(Ωt.geometry, color = :gray90)
Mke.current_figure()

We can visualize the domain of the "train" (or source) set Ωs in blue, and the domain of the "test" (or target) set Ωt in gray. We reserved 20% of the samples to Ωs and 80% to Ωt. Internally, this geospatial geosplit function is implemented in terms of efficient geospatial partitions.

Let's define our geostatistical learning model to predict the crop type based on the four satellite bands. We will use the DecisionTreeClassifier model, which is suitable for the task we want to perform. Any model from the StatsLeanModels.jl model is supported, including all models from ScikitLearn.jl:

feats = [:band1, :band2, :band3, :band4]
label = :crop

model = DecisionTreeClassifier()
DecisionTreeClassifier
max_depth:                -1
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  nothing
root:                     nothing

We will fit the model in Ωs where the features and labels are available and predict in Ωt where the features are available. The Learn transform automatically fits the model to the data:

learn = Learn(Ωs, model, feats => label)
Learn transform
├─ model = FittedModel{DecisionTreeClassifier}
└─ input = [:band1, :band2, :band3, :band4]

The transform can be called with new data to generate predictions:

Ω̂t = learn(Ωt)
60039×2 GeoTable over 60039 view(::PointSet{2,Float64}, [1, 2, 3, 4, ..., 74997, 74998, 74999, 75000])
crop geometry
Categorical Point2
[NoUnits]
1 (746.0, 717.0)
2 (953.0, 291.0)
2 (1050.0, 796.0)
2 (787.0, 921.0)
2 (947.0, 758.0)
2 (949.0, 769.0)
2 (773.0, 899.0)
2 (851.0, 895.0)
3 (1018.0, 748.0)
1 (427.0, 743.0)

Scientific visualization

We note that the prediction of a geostatistical learning model is a geospatial data object, and we can inspect it with the same methods already described above. This also means that we can visualize the prediction directly, side by side with the true label in this synthetic example:

fig = Mke.Figure(size = (800, 400))
viz(fig[1,1], Ω̂t.geometry, color = Ω̂t.crop)
viz(fig[1,2], Ωt.geometry, color = Ωt.crop)
fig

With this example we conclude the basic workflow. To get familiar with other features of the project, please check the the reference guide.