8  Split-apply-combine

In Part II of the book, we introduced transform pipelines to process the values and the domain of geotables using high-level abstractions that preserve geospatial information. In this chapter, we start to introduce other tools to query geotables after they have been pre-processed by pipelines.

8.1 Motivation

In geospatial data science, geoscientific questions are often posed in terms of both the values and the domain of a geotable. For example:

  1. Where are the areas with high probability of landslide?
  2. What is the average rainfall per watershed over the last year?
  3. How much lithium will be mined from each geological unit?
  4. What is the variation of log-permeability per depositional facies?

The word “where” is often present in these questions to indicate that answers must be georeferenced. If the variable of interest is already present in the geotable, then we can effectively answer “where” questions using the viewer and the Filter transform from previous chapters:

geotable |> Filter(row -> row.probability > 0.9) |> viewer

If the variable of interest is not present in the geotable, or if the “where” word is not present in the original question, then there will be some reference to “geospatial units” on which geostatistics must be computed. These questions can be answered with a geospatial version of the split-apply-combine strategy (Wickham 2011) from data science. Our framework provides the @groupby, @transform and @combine macros to split-apply-combine geotables.

8.2 Bonnie data set

We will use the Bonnie data set to illustrate our geospatial split-apply-combine:

The Bonnie Project Example is under copyright of Transmin Metallurgical Consultants, 2019. It is issued under the Creative Commons Attribution-ShareAlike 4.0 International Public License.

using CSV

gt = georef(CSV.File("data/bonnie.csv"), (:EAST, :NORTH, :RL))
19618×9 GeoTable over 19618 PointSet{3,Float64}
Auppm Agppm Cuppm Asppm Sper CODE OX ISBD geometry
Continuous Continuous Continuous Continuous Continuous Categorical Categorical Continuous Point3
[NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits]
0.407 2.335 1311.0 27.0 0.01 C1 TR1 2.42 (243940.0, 1.00223e6, 1855.0)
0.407 2.335 1311.0 27.0 0.01 C1 TR1 1.98 (243945.0, 1.00223e6, 1855.0)
0.407 2.335 1311.0 27.0 0.01 C1 TR1 2.068 (243950.0, 1.00226e6, 1855.0)
0.385 2.1 572.0 0.0 0.0 C1 TR1 2.222 (243965.0, 1.00226e6, 1855.0)
0.385 2.1 572.0 0.0 0.0 C1 TR1 2.156 (243970.0, 1.00226e6, 1855.0)
0.385 2.1 572.0 0.0 0.0 C1 OX1 2.08 (243960.0, 1.00227e6, 1855.0)
0.385 2.1 572.0 0.0 0.0 C1 OX1 2.06 (243965.0, 1.00227e6, 1855.0)
0.385 2.1 572.0 0.0 0.0 C1 OX1 1.9 (243970.0, 1.00227e6, 1855.0)
0.42152 1.422 852.36 22.85 2.4078 C1 TR1 2.024 (243930.0, 1.00222e6, 1860.0)
0.41747 1.429 832.36 21.87 1.9388 C1 TR1 2.112 (243935.0, 1.00222e6, 1860.0)

It represents a 3D mineral deposit with grades in parts per million (ppm), sulfur contaminant in percent, and other categorical variables with geological and lithological units:

names(gt)
9-element Vector{String}:
 "Auppm"
 "Agppm"
 "Cuppm"
 "Asppm"
 "Sper"
 "CODE"
 "OX"
 "ISBD"
 "geometry"

The “EAST”, “NORTH” and “RL” coordinates used to georef the CSV file represent the centroids of the mining blocks. The sides of these blocks are provided as metadata (e.g., 5x5x5):

gt |> viewer
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/ND0gA/src/scenes.jl:220

Let’s clean the geotable using what we learned in previous chapters. We will reject the column with sulfur, will drop missing values, and will rename the variables for greater readability:

clean = Reject("Sper") 
        DropMissing() 
        Rename("Auppm" => "Au", "Agppm" => "Ag",
               "Cuppm" => "Cu", "Asppm" => "As",
               "CODE" => "geo", "OX" => "litho",
               "ISBD" => "ρ")

gt = gt |> clean
19548×8 GeoTable over 19548 view(::PointSet{3,Float64}, [1, 2, 3, 4, ..., 19615, 19616, 19617, 19618])
Au Ag Cu As geo litho ρ geometry
Continuous Continuous Continuous Continuous Categorical Categorical Continuous Point3
[NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits]
0.407 2.335 1311.0 27.0 C1 TR1 2.42 (243940.0, 1.00223e6, 1855.0)
0.407 2.335 1311.0 27.0 C1 TR1 1.98 (243945.0, 1.00223e6, 1855.0)
0.407 2.335 1311.0 27.0 C1 TR1 2.068 (243950.0, 1.00226e6, 1855.0)
0.385 2.1 572.0 0.0 C1 TR1 2.222 (243965.0, 1.00226e6, 1855.0)
0.385 2.1 572.0 0.0 C1 TR1 2.156 (243970.0, 1.00226e6, 1855.0)
0.385 2.1 572.0 0.0 C1 OX1 2.08 (243960.0, 1.00227e6, 1855.0)
0.385 2.1 572.0 0.0 C1 OX1 2.06 (243965.0, 1.00227e6, 1855.0)
0.385 2.1 572.0 0.0 C1 OX1 1.9 (243970.0, 1.00227e6, 1855.0)
0.42152 1.422 852.36 22.85 C1 TR1 2.024 (243930.0, 1.00222e6, 1860.0)
0.41747 1.429 832.36 21.87 C1 TR1 2.112 (243935.0, 1.00222e6, 1860.0)

That is a lot better! Let’s assume that we want to answer the following business question:

What is the total mass of gold that will be mined from each geological unit?

8.3 Splitting geotables

We can split geotables into lazy geospatial partitions based on values stored in a column. In this case, we want to split the mineral deposit in geological units stored in the geo column:

groups = @groupby(gt, "geo")
2 Partition
├─ 17153×8 SubGeoTable over 17153 view(::PointSet{3,Float64}, [1, 2, 3, 4, ..., 17203, 17204, 17205, 17206])
└─ 2395×8 SubGeoTable over 2395 view(::PointSet{3,Float64}, [17207, 17208, 17209, 17210, ..., 19615, 19616, 19617, 19618])
metadata: rows, names

There are two geological units in this deposit, represented as SubGeoTable. We can access these units by indexing into the geospatial partition:

groups[1]
17153×8 SubGeoTable over 17153 view(::PointSet{3,Float64}, [1, 2, 3, 4, ..., 17203, 17204, 17205, 17206])
Au Ag Cu As geo litho ρ geometry
Continuous Continuous Continuous Continuous Categorical Categorical Continuous Point3
[NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits]
0.407 2.335 1311.0 27.0 C1 TR1 2.42 (243940.0, 1.00223e6, 1855.0)
0.407 2.335 1311.0 27.0 C1 TR1 1.98 (243945.0, 1.00223e6, 1855.0)
0.407 2.335 1311.0 27.0 C1 TR1 2.068 (243950.0, 1.00226e6, 1855.0)
0.385 2.1 572.0 0.0 C1 TR1 2.222 (243965.0, 1.00226e6, 1855.0)
0.385 2.1 572.0 0.0 C1 TR1 2.156 (243970.0, 1.00226e6, 1855.0)
0.385 2.1 572.0 0.0 C1 OX1 2.08 (243960.0, 1.00227e6, 1855.0)
0.385 2.1 572.0 0.0 C1 OX1 2.06 (243965.0, 1.00227e6, 1855.0)
0.385 2.1 572.0 0.0 C1 OX1 1.9 (243970.0, 1.00227e6, 1855.0)
0.42152 1.422 852.36 22.85 C1 TR1 2.024 (243930.0, 1.00222e6, 1860.0)
0.41747 1.429 832.36 21.87 C1 TR1 2.112 (243935.0, 1.00222e6, 1860.0)
viz(groups[1].geometry, color = "teal")
viz!(groups[2].geometry, color = "slategray3")
Mke.current_figure()
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/ND0gA/src/scenes.jl:220

8.4 Applying expressions

The mass of gold in a mining block is a function of the gold grade (Au), the rock density (ρ), and the volume of the block:

\[ m = Au \times \rho \times V \]

Let’s use an auxiliary function to convert the Points in the geometry column into Boxes of sides 5x5x5. The function takes a centroid point as input and produces a box centered at the point with corners that are 2.5x2.5x2.5 units away in both directions:

box(point) = Box(point - Vec(2.5, 2.5, 2.5), point + Vec(2.5, 2.5, 2.5))
box (generic function with 1 method)
box.(gt.geometry)
19548-element Vector{Box{3, Float64}}:
 Box(min: (2.43938e5, 1.00223e6, 1852.5), max: (2.43942e5, 1.00223e6, 1857.5))
 Box(min: (2.43942e5, 1.00223e6, 1852.5), max: (2.43948e5, 1.00223e6, 1857.5))
 Box(min: (2.43948e5, 1.00226e6, 1852.5), max: (2.43952e5, 1.00226e6, 1857.5))
 Box(min: (2.43962e5, 1.00226e6, 1852.5), max: (2.43968e5, 1.00226e6, 1857.5))
 Box(min: (2.43968e5, 1.00226e6, 1852.5), max: (2.43972e5, 1.00226e6, 1857.5))
 Box(min: (2.43958e5, 1.00227e6, 1852.5), max: (2.43962e5, 1.00227e6, 1857.5))
 Box(min: (2.43962e5, 1.00227e6, 1852.5), max: (2.43968e5, 1.00227e6, 1857.5))
 Box(min: (2.43968e5, 1.00227e6, 1852.5), max: (2.43972e5, 1.00227e6, 1857.5))
 Box(min: (2.43928e5, 1.00222e6, 1857.5), max: (2.43932e5, 1.00222e6, 1862.5))
 Box(min: (2.43932e5, 1.00222e6, 1857.5), max: (2.43938e5, 1.00222e6, 1862.5))
 Box(min: (2.43938e5, 1.00222e6, 1857.5), max: (2.43942e5, 1.00222e6, 1862.5))
 Box(min: (2.43942e5, 1.00222e6, 1857.5), max: (2.43948e5, 1.00222e6, 1862.5))
 Box(min: (2.43912e5, 1.00223e6, 1857.5), max: (2.43918e5, 1.00223e6, 1862.5))
 ⋮
 Box(min: (2.43782e5, 1.0022e6, 2027.5), max: (2.43788e5, 1.0022e6, 2032.5))
 Box(min: (2.43788e5, 1.0022e6, 2027.5), max: (2.43792e5, 1.0022e6, 2032.5))
 Box(min: (2.43792e5, 1.0022e6, 2027.5), max: (2.43798e5, 1.0022e6, 2032.5))
 Box(min: (2.43752e5, 1.00221e6, 2027.5), max: (2.43758e5, 1.00221e6, 2032.5))
 Box(min: (2.43758e5, 1.00221e6, 2027.5), max: (2.43762e5, 1.00221e6, 2032.5))
 Box(min: (2.43762e5, 1.00221e6, 2027.5), max: (2.43768e5, 1.00221e6, 2032.5))
 Box(min: (2.43768e5, 1.00221e6, 2027.5), max: (2.43772e5, 1.00221e6, 2032.5))
 Box(min: (2.43772e5, 1.00221e6, 2027.5), max: (2.43778e5, 1.00221e6, 2032.5))
 Box(min: (2.43778e5, 1.00221e6, 2027.5), max: (2.43782e5, 1.00221e6, 2032.5))
 Box(min: (2.43782e5, 1.00221e6, 2027.5), max: (2.43788e5, 1.00221e6, 2032.5))
 Box(min: (2.43788e5, 1.00221e6, 2027.5), max: (2.43792e5, 1.00221e6, 2032.5))
 Box(min: (2.43792e5, 1.00221e6, 2027.5), max: (2.43798e5, 1.00221e6, 2032.5))

The @transform macro modifies or creates new columns in the geotable based on expressions with existing column names. In this case, we want to replace the geometry column by calling the auxiliary function above:

gt = @transform(gt, :geometry = box(:geometry))
19548×8 GeoTable over 19548 GeometrySet{3,Float64}
Au Ag Cu As geo litho ρ geometry
Continuous Continuous Continuous Continuous Categorical Categorical Continuous Box
[NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits]
0.407 2.335 1311.0 27.0 C1 TR1 2.42 Box(min: (2.43938e5, 1.00223e6, 1852.5), max: (2.43942e5, 1.00223e6, 1857.5))
0.407 2.335 1311.0 27.0 C1 TR1 1.98 Box(min: (2.43942e5, 1.00223e6, 1852.5), max: (2.43948e5, 1.00223e6, 1857.5))
0.407 2.335 1311.0 27.0 C1 TR1 2.068 Box(min: (2.43948e5, 1.00226e6, 1852.5), max: (2.43952e5, 1.00226e6, 1857.5))
0.385 2.1 572.0 0.0 C1 TR1 2.222 Box(min: (2.43962e5, 1.00226e6, 1852.5), max: (2.43968e5, 1.00226e6, 1857.5))
0.385 2.1 572.0 0.0 C1 TR1 2.156 Box(min: (2.43968e5, 1.00226e6, 1852.5), max: (2.43972e5, 1.00226e6, 1857.5))
0.385 2.1 572.0 0.0 C1 OX1 2.08 Box(min: (2.43958e5, 1.00227e6, 1852.5), max: (2.43962e5, 1.00227e6, 1857.5))
0.385 2.1 572.0 0.0 C1 OX1 2.06 Box(min: (2.43962e5, 1.00227e6, 1852.5), max: (2.43968e5, 1.00227e6, 1857.5))
0.385 2.1 572.0 0.0 C1 OX1 1.9 Box(min: (2.43968e5, 1.00227e6, 1852.5), max: (2.43972e5, 1.00227e6, 1857.5))
0.42152 1.422 852.36 22.85 C1 TR1 2.024 Box(min: (2.43928e5, 1.00222e6, 1857.5), max: (2.43932e5, 1.00222e6, 1862.5))
0.41747 1.429 832.36 21.87 C1 TR1 2.112 Box(min: (2.43932e5, 1.00222e6, 1857.5), max: (2.43938e5, 1.00222e6, 1862.5))

The macro will broadcast the expression to all rows of the geotable.

Note

We used the :geometry symbol to refer to the geometry column instead of the usual "geometry" string. The @transform macro understands that strings can also appear as valid values in the right-hand-side of the expression, which are not columns in the geotable. To mark a string as a column name, we need to use curly braces:

@transform(gt, {"geometry"} = box({"geometry"}))
Note

Alternatively, we could have used the following syntax to update the geometry column:

gt.geometry = box.(gt.geometry)

Currently, this syntax does not work with other columns.

Tip for all users

The use of symbols to represent column names is preferred in macros.

The mass of gold on each mining block can be computed now that the geometries have volume:

@transform(gt, :m = :Au * :ρ * volume(:geometry))
19548×9 GeoTable over 19548 GeometrySet{3,Float64}
Au Ag Cu As geo litho ρ m geometry
Continuous Continuous Continuous Continuous Categorical Categorical Continuous Continuous Box
[NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits]
0.407 2.335 1311.0 27.0 C1 TR1 2.42 123.117 Box(min: (2.43938e5, 1.00223e6, 1852.5), max: (2.43942e5, 1.00223e6, 1857.5))
0.407 2.335 1311.0 27.0 C1 TR1 1.98 100.732 Box(min: (2.43942e5, 1.00223e6, 1852.5), max: (2.43948e5, 1.00223e6, 1857.5))
0.407 2.335 1311.0 27.0 C1 TR1 2.068 105.209 Box(min: (2.43948e5, 1.00226e6, 1852.5), max: (2.43952e5, 1.00226e6, 1857.5))
0.385 2.1 572.0 0.0 C1 TR1 2.222 106.934 Box(min: (2.43962e5, 1.00226e6, 1852.5), max: (2.43968e5, 1.00226e6, 1857.5))
0.385 2.1 572.0 0.0 C1 TR1 2.156 103.758 Box(min: (2.43968e5, 1.00226e6, 1852.5), max: (2.43972e5, 1.00226e6, 1857.5))
0.385 2.1 572.0 0.0 C1 OX1 2.08 100.1 Box(min: (2.43958e5, 1.00227e6, 1852.5), max: (2.43962e5, 1.00227e6, 1857.5))
0.385 2.1 572.0 0.0 C1 OX1 2.06 99.1375 Box(min: (2.43962e5, 1.00227e6, 1852.5), max: (2.43968e5, 1.00227e6, 1857.5))
0.385 2.1 572.0 0.0 C1 OX1 1.9 91.4375 Box(min: (2.43968e5, 1.00227e6, 1852.5), max: (2.43972e5, 1.00227e6, 1857.5))
0.42152 1.422 852.36 22.85 C1 TR1 2.024 106.645 Box(min: (2.43928e5, 1.00222e6, 1857.5), max: (2.43932e5, 1.00222e6, 1862.5))
0.41747 1.429 832.36 21.87 C1 TR1 2.112 110.212 Box(min: (2.43932e5, 1.00222e6, 1857.5), max: (2.43938e5, 1.00222e6, 1862.5))
Note

The @transform macro can be used with both geotables and geospatial partitions.

8.5 Combining results

We can use the @combine macro to reduce columns of geotables in a geospatial partition obtained with the @groupby macro. The macro is similar to the @transform macro, but expects valid reduction functions such as sum, mean, std. Reduction functions take a vector of values as input and produce a single scalar as output:

groups = @groupby(gt, :geo)

@combine(groups, :μ = mean(:Au), :σ = std(:Au))
2×4 GeoTable over 2 GeometrySet{3,Float64}
geo μ σ geometry
Categorical Continuous Continuous Multi
[NoUnits] [NoUnits] [NoUnits]
C1 0.581937 0.136264 Multi(17153×Box)
C2 0.484622 0.095799 Multi(2395×Box)

Note that the macro reduces the geometry column and produces a new complex Multi geometry with all the Boxes that are inside each geological unit. This is a very advanced feature of the framework that cannot be represented with the simple features standard. It is also possible to use a custom reduction function for the geometry column:

groups = @groupby(gt, :geo)

@combine(groups, :μ = mean(:Au), :σ = std(:Au), :geometry = first(:geometry))
2×4 GeoTable over 2 GeometrySet{3,Float64}
geo μ σ geometry
Categorical Continuous Continuous Box
[NoUnits] [NoUnits] [NoUnits]
C1 0.581937 0.136264 Box(min: (2.43938e5, 1.00223e6, 1852.5), max: (2.43942e5, 1.00223e6, 1857.5))
C2 0.484622 0.095799 Box(min: (2.43792e5, 1.00214e6, 1872.5), max: (2.43798e5, 1.00214e6, 1877.5))

8.6 Answering questions

Let’s recall our original business question:

What is the total mass of gold that will be mined from each geological unit?

We can now answer this question with three lines of code:

groups = @groupby(gt, :geo)

mass = @transform(groups, :m = :Au * :ρ * volume(:geometry))

answer = @combine(mass, :m = sum(:m))
2×3 GeoTable over 2 GeometrySet{3,Float64}
geo m geometry
Categorical Continuous Multi
[NoUnits] [NoUnits]
C1 2.53302e6 Multi(17153×Box)
C2 3.29241e5 Multi(2395×Box)
Tip for all users

We can simplify the code further using the @chain macro. It forwards the resulting geotable or geospatial partition to the next macro:

@chain gt begin
  @groupby(:geo)
  @transform(:m = :Au * :ρ * volume(:geometry))
  @combine(:m = sum(:m))
end
2×3 GeoTable over 2 GeometrySet{3,Float64}
geo m geometry
Categorical Continuous Multi
[NoUnits] [NoUnits]
C1 2.53302e6 Multi(17153×Box)
C2 3.29241e5 Multi(2395×Box)