Quickstart
A geostatistical workflow often consists of four steps:
- Definition of geospatial data
- Manipulation of geospatial data
- Geostatistical modeling
- Scientific visualization
In this section, we walk through these steps to illustrate some of the features of the project. In the case of geostatistical modeling, we will specifically explore geostatistical learning models. If you prefer learning from video, check out the recording of our JuliaEO 2023 workshop:
We assume that the following packages are loaded throughout the code examples:
using GeoStats
import CairoMakie as Mke
The GeoStats.jl package exports the full stack for geospatial data science and geostatistical modeling. The CairoMakie.jl package is one of the possible visualization backends from the Makie.jl project.
If you are new to Julia and have never heard of Makie.jl before, here are a few tips to help you choose between the different backends:
WGLMakie.jl is the preferred backend for interactive visualization on the web browser. It integrates well with Pluto.jl notebooks and other web-based applications.
GLMakie.jl is the preferred backend for interactive high-performance visualization. It leverages native graphical resources and doesn't require a web browser to function.
CairoMakie.jl is the preferred backend for publication-quality static visualization. It requires less computing power and is therefore recommended for those users with modest laptops.
We recommend importing the Makie.jl backend as Mke
to avoid polluting the Julia session with names from the visualization stack.
Loading/creating data
Loading data
The Julia ecosystem for loading geospatial data is comprised of several low-level packages such as Shapefile.jl and GeoJSON.jl, which define their own very basic geometry types. Instead of requesting users to learn the so called GeoInterface.jl to handle these types, we provide the high-level GeoIO.jl package to load any file with geospatial data into well-tested geometries from the Meshes.jl submodule:
using GeoIO
zone = GeoIO.load("data/zone.shp")
PERIMETER | ACRES | MACROZONA | Hectares | area_m2 | geometry |
---|---|---|---|---|---|
Continuous | Continuous | Categorical | Continuous | Continuous | MultiPolygon |
[NoUnits] | [NoUnits] | [NoUnits] | [NoUnits] | [NoUnits] | 🖈 GeodeticLatLon{SAD69} |
5.8508e6 | 3.23145e7 | Estuario | 1.30772e7 | 1.30772e11 | Multi(21×PolyArea) |
9.53947e6 | 2.50594e8 | Fronteiras Antigas | 1.01412e8 | 1.01412e12 | Multi(1×PolyArea) |
1.01743e7 | 2.75528e8 | Fronteiras Intermediarias | 1.11502e8 | 1.11502e12 | Multi(1×PolyArea) |
7.09612e6 | 1.61293e8 | Fronteiras Novas | 6.5273e7 | 6.5273e11 | Multi(2×PolyArea) |
Various functions are defined over these geometries, for instance:
sum(area, zone.geometry)
2.9037868929915405e12 m^2
sum(perimeter, zone.geometry)
3.2680213406922124e7 m
Please check the Meshes.jl documentation for more details.
We highly recommend using Meshes.jl geometries in geospatial workflows as they were carefully designed to accomodate advanced features of the GeoStats.jl framework. Any other geometry type will likely fail with our geostatistical algorithms and pipelines.
Creating data
Geospatial data can be derived from other Julia variables. For example, given a Julia array, which is not attached to any coordinate system, we can georeference the array using the georef
function:
Z = [10sin(i/10) + 2j for i in 1:50, j in 1:50]
Ω = georef((Z=Z,))
Z | geometry |
---|---|
Continuous | Quadrangle |
[NoUnits] | 🖈 Cartesian{NoDatum} |
2.99833 | Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m)) |
3.98669 | Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m)) |
4.9552 | Quadrangle((x: 2.0 m, y: 0.0 m), ..., (x: 2.0 m, y: 1.0 m)) |
5.89418 | Quadrangle((x: 3.0 m, y: 0.0 m), ..., (x: 3.0 m, y: 1.0 m)) |
6.79426 | Quadrangle((x: 4.0 m, y: 0.0 m), ..., (x: 4.0 m, y: 1.0 m)) |
7.64642 | Quadrangle((x: 5.0 m, y: 0.0 m), ..., (x: 5.0 m, y: 1.0 m)) |
8.44218 | Quadrangle((x: 6.0 m, y: 0.0 m), ..., (x: 6.0 m, y: 1.0 m)) |
9.17356 | Quadrangle((x: 7.0 m, y: 0.0 m), ..., (x: 7.0 m, y: 1.0 m)) |
9.83327 | Quadrangle((x: 8.0 m, y: 0.0 m), ..., (x: 8.0 m, y: 1.0 m)) |
10.4147 | Quadrangle((x: 9.0 m, y: 0.0 m), ..., (x: 9.0 m, y: 1.0 m)) |
⋮ | ⋮ |
Default coordinates are assigned based on the size of the array, and different configurations can be obtained with different methods (see Data).
Geospatial data can be visualized with the viz
recipe function:
viz(Ω.geometry, color = Ω.Z)
Alternatively, we provide a basic scientific viewer
to visualize all viewable variables in the data with a colorbar and other interactive elements (see Visualization):
viewer(Ω)
Julia supports unicode symbols with $\LaTeX$ syntax. We can type \Omega
and press TAB
to get the symbol Ω
in the example above. This autocompletion works in various text editors, including the VSCode editor with the Julia extension.
Manipulating data
Table interface
Our geospatial data implements the Tables.jl interface, which means that they can be accessed as if they were tables with samples in the rows and variables in the columns. In this case, a special column named geometry
is created on the fly, row by row, containing Meshes.jl geometries.
For those familiar with the productive DataFrames.jl interface, there is nothing new:
Ω[1,:]
(Z = 2.9983341664682817, geometry = Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m)))
Ω[1,:geometry]
Quadrangle
├─ Point(x: 0.0 m, y: 0.0 m)
├─ Point(x: 1.0 m, y: 0.0 m)
├─ Point(x: 1.0 m, y: 1.0 m)
└─ Point(x: 0.0 m, y: 1.0 m)
Ω.Z
2500-element Vector{Float64}:
2.9983341664682817
3.9866933079506124
4.955202066613396
5.894183423086505
6.79425538604203
7.6464247339503535
8.44217687237691
9.173560908995228
9.833269096274833
10.414709848078965
⋮
91.28424227586412
90.83834063250545
90.48397926110484
90.22469882334903
90.06308996366536
90.000767424359
90.0383539116416
90.17547387375667
90.41075725336862
However, notice that our implementation performs some clever optimizations behind the scenes to avoid expensive creation of geometries:
Ω.geometry
50×50 CartesianGrid
├─ minimum: Point(x: 0.0 m, y: 0.0 m)
├─ maximum: Point(x: 50.0 m, y: 50.0 m)
└─ spacing: (1.0 m, 1.0 m)
We can always retrieve the table of attributes (or features) with the function values
and the underlying geospatial domain with the function domain
. This can be useful for writing algorithms that are type-stable and depend purely on the feature values:
values(Ω)
(Z = [2.9983341664682817, 3.9866933079506124, 4.955202066613396, 5.894183423086505, 6.79425538604203, 7.6464247339503535, 8.44217687237691, 9.173560908995228, 9.833269096274833, 10.414709848078965 … 91.8172288893559, 91.28424227586412, 90.83834063250545, 90.48397926110484, 90.22469882334903, 90.06308996366536, 90.000767424359, 90.0383539116416, 90.17547387375667, 90.41075725336862],)
or on the geometries:
domain(Ω)
50×50 CartesianGrid
├─ minimum: Point(x: 0.0 m, y: 0.0 m)
├─ maximum: Point(x: 50.0 m, y: 50.0 m)
└─ spacing: (1.0 m, 1.0 m)
Geospatial transforms
It is easy to design advanced geospatial pipelines that operate on both the table of features and the underlying geospatial domain:
# pipeline with table transforms
pipe = Quantile() → StdCoords()
# feed geospatial data to pipeline
Ω̂ = pipe(Ω)
# plot distribution before and after pipeline
fig = Mke.Figure(size = (800, 400))
Mke.hist(fig[1,1], Ω.Z, color = :gray)
Mke.hist(fig[2,1], Ω̂.Z, color = :gray)
fig
Coordinates before pipeline:
boundingbox(Ω.geometry)
Box
├─ min: Point(x: 0.0 m, y: 0.0 m)
└─ max: Point(x: 50.0 m, y: 50.0 m)
and after pipeline:
boundingbox(Ω̂.geometry)
Box
├─ min: Point(x: -0.5 m, y: -0.5 m)
└─ max: Point(x: 0.5 m, y: 0.5 m)
These pipelines are revertible meaning that we can transform the data, perform geostatistical modeling, and revert the pipelines to obtain estimates in the original sample space (see Transforms).
Geospatial queries
We provide three macros @groupby
, @transform
and @combine
for powerful geospatial split-apply-combine patterns, as well as the function geojoin
for advanced geospatial joins.
@transform(Ω, :W = 2 * :Z * area(:geometry))
Z | W | geometry |
---|---|---|
Continuous | Continuous | Quadrangle |
[NoUnits] | [m^2] | 🖈 Cartesian{NoDatum} |
2.99833 | 5.99667 m^2 | Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m)) |
3.98669 | 7.97339 m^2 | Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m)) |
4.9552 | 9.9104 m^2 | Quadrangle((x: 2.0 m, y: 0.0 m), ..., (x: 2.0 m, y: 1.0 m)) |
5.89418 | 11.7884 m^2 | Quadrangle((x: 3.0 m, y: 0.0 m), ..., (x: 3.0 m, y: 1.0 m)) |
6.79426 | 13.5885 m^2 | Quadrangle((x: 4.0 m, y: 0.0 m), ..., (x: 4.0 m, y: 1.0 m)) |
7.64642 | 15.2928 m^2 | Quadrangle((x: 5.0 m, y: 0.0 m), ..., (x: 5.0 m, y: 1.0 m)) |
8.44218 | 16.8844 m^2 | Quadrangle((x: 6.0 m, y: 0.0 m), ..., (x: 6.0 m, y: 1.0 m)) |
9.17356 | 18.3471 m^2 | Quadrangle((x: 7.0 m, y: 0.0 m), ..., (x: 7.0 m, y: 1.0 m)) |
9.83327 | 19.6665 m^2 | Quadrangle((x: 8.0 m, y: 0.0 m), ..., (x: 8.0 m, y: 1.0 m)) |
10.4147 | 20.8294 m^2 | Quadrangle((x: 9.0 m, y: 0.0 m), ..., (x: 9.0 m, y: 1.0 m)) |
⋮ | ⋮ | ⋮ |
These are very useful for geospatial data science as they hide the complexity of the geometry
column. For more information, check the Queries section of the documentation.
Geostatistical modeling
Having defined the geospatial data objects, we proceed and define the geostatistical learning model. Let's assume that we have geopatial data with some variable that we want to predict in a supervised learning setting. We load the data from a CSV file, and inspect the available columns:
using CSV
using DataFrames
table = CSV.File("data/agriculture.csv") |> DataFrame
first(table, 5)
Row | band1 | band2 | band3 | band4 | crop | polygon | x | y |
---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Float64 | Float64 | Float64 | |
1 | 121 | 91 | 70 | 384 | 1 | 222.0 | 746.0 | 717.0 |
2 | 112 | 82 | 58 | 336 | 2 | 264.0 | 953.0 | 291.0 |
3 | 120 | 86 | 55 | 389 | 1 | 363.0 | 1050.0 | 796.0 |
4 | 125 | 92 | 78 | 255 | 2 | 295.0 | 787.0 | 921.0 |
5 | 117 | 77 | 59 | 308 | 1 | 347.0 | 947.0 | 758.0 |
Columns band1
, band2
, ..., band4
represent four satellite bands for different locations (x, y)
in this region. The column crop
has the crop type for each location that was labeled manually with the purpose of fitting a learning model.
We can now georeference the table and plot some of the variables:
Ω = georef(table, (:x, :y))
fig = Mke.Figure(size = (800, 400))
viz(fig[1,1], Ω.geometry, color = Ω.band4)
viz(fig[1,2], Ω.geometry, color = Ω.crop)
fig
Similar to a generic statistical learning workflow, we split the data into "train" and "test" sets. The main difference here is that our geospatial geosplit
function accepts a separating plane specified by its normal direction (1,-1)
:
Ωs, Ωt = geosplit(Ω, 0.2, (1.0, -1.0))
viz(Ωs.geometry)
viz!(Ωt.geometry, color = :gray90)
Mke.current_figure()
We can visualize the domain of the "train" (or source) set Ωs in blue, and the domain of the "test" (or target) set Ωt in gray. We reserved 20% of the samples to Ωs and 80% to Ωt. Internally, this geospatial geosplit
function is implemented in terms of efficient geospatial partitions.
Let's define our geostatistical learning model to predict the crop type based on the four satellite bands. We will use the DecisionTreeClassifier
model, which is suitable for the task we want to perform. Any model from the StatsLeanModels.jl model is supported, including all models from ScikitLearn.jl:
feats = [:band1, :band2, :band3, :band4]
label = :crop
model = DecisionTreeClassifier()
DecisionTreeClassifier
max_depth: -1
min_samples_leaf: 1
min_samples_split: 2
min_purity_increase: 0.0
pruning_purity_threshold: 1.0
n_subfeatures: 0
classes: nothing
root: nothing
We will fit the model in Ωs where the features and labels are available and predict in Ωt where the features are available. The Learn
transform automatically fits the model to the data:
learn = Learn(Ωs, model, feats => label)
Learn transform
├─ model: FittedModel{DecisionTreeClassifier}
└─ input: [:band1, :band2, :band3, :band4]
The transform can be called with new data to generate predictions:
Ω̂t = learn(Ωt)
crop | geometry |
---|---|
Categorical | Point |
[NoUnits] | 🖈 Cartesian{NoDatum} |
1 | (x: 746.0 m, y: 717.0 m) |
2 | (x: 953.0 m, y: 291.0 m) |
2 | (x: 1050.0 m, y: 796.0 m) |
2 | (x: 787.0 m, y: 921.0 m) |
2 | (x: 947.0 m, y: 758.0 m) |
2 | (x: 949.0 m, y: 769.0 m) |
2 | (x: 773.0 m, y: 899.0 m) |
2 | (x: 851.0 m, y: 895.0 m) |
3 | (x: 1018.0 m, y: 748.0 m) |
1 | (x: 427.0 m, y: 743.0 m) |
⋮ | ⋮ |
Scientific visualization
We note that the prediction of a geostatistical learning model is a geospatial data object, and we can inspect it with the same methods already described above. This also means that we can visualize the prediction directly, side by side with the true label in this synthetic example:
fig = Mke.Figure(size = (800, 400))
viz(fig[1,1], Ω̂t.geometry, color = Ω̂t.crop)
viz(fig[1,2], Ωt.geometry, color = Ωt.crop)
fig
With this example we conclude the basic workflow. To get familiar with other features of the project, please check the the reference guide.