7  Building pipelines

In previous chapters, we learned a large number of transforms for manipulating and processing geotables. In all those code examples, we used Juliaโ€™s pipe operator |> to apply the transform and send the resulting geotable to the next transform:

geotable |> transform1 |> transform2 |> ... |> viewer

In this chapter, we will learn two new powerful operators โ†’ and โŠ” provided by the framework to combine transforms into pipelines that can be optimized and reused with different geotables.

7.1 Motivation

The pipe operator |> in Julia is very convenient for sequential application of functions. Given an input x, we can type x |> f1 |> f2 to apply functions f1 and f2 in sequence, in a way that is equivalent to f2(f1(x)) or, alternatively, to the function composition (f2 โˆ˜ f1)(x). Its syntax can drastically improve code readability when the number of functions is large. However, the operator has a major limitation in the context of geospatial data science: it evaluates all intermediate results as soon as the data is inserted in the pipe. This is known in computer science as eager evaluation.

Taking the expression above as an example, the operator will first evaluate f1(x) and store the result in a variable y. After f1 is completed, the operator evaluates f2(y) and produces the final (desired) result. If y requires a lot of computer memory as it is usually the case with large geotables, the application of the pipeline will be slow.

Another evaluation strategy, known as lazy evaluation, consists of building the entire pipeline without the data in it. The major advantage of this strategy is that it can analyze the functions, and potentially simplify the code before evaluation. For example, the pipeline cos โ†’ acos can be replaced by the much simpler pipeline identity for some values of the input x.

7.2 Operator โ†’

In our framework, the operator โ†’ (\to) can be used in place of the pipe operator to build lazy sequential pipelines of transforms. Consider the synthetic data from previous chapters:

N = 10000
a = [2randn(Nรท2) .+ 6; randn(Nรท2)]
b = [3randn(Nรท2); 2randn(Nรท2)]
c = randn(N)
d = c .+ 0.6randn(N)

table = (; a, b, c, d)

gt = georef(table, CartesianGrid(100, 100))
10000ร—5 GeoTable over 100ร—100 CartesianGrid
a b c d geometry
Continuous Continuous Continuous Continuous Quadrangle
[NoUnits] [NoUnits] [NoUnits] [NoUnits] ๐Ÿ–ˆ Cartesian{NoDatum}
9.44817 4.40422 1.47458 0.577574 Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
3.73836 1.30623 -0.177942 -0.913577 Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
8.54674 1.59586 -0.808086 -2.06557 Quadrangle((x: 2.0 m, y: 0.0 m), ..., (x: 2.0 m, y: 1.0 m))
3.5269 -8.23821 0.336019 -1.03262 Quadrangle((x: 3.0 m, y: 0.0 m), ..., (x: 3.0 m, y: 1.0 m))
4.57009 -1.27253 -0.759468 -0.543913 Quadrangle((x: 4.0 m, y: 0.0 m), ..., (x: 4.0 m, y: 1.0 m))
4.44248 -3.18144 -1.30153 -1.53231 Quadrangle((x: 5.0 m, y: 0.0 m), ..., (x: 5.0 m, y: 1.0 m))
7.24252 1.68168 -0.351576 -0.985976 Quadrangle((x: 6.0 m, y: 0.0 m), ..., (x: 6.0 m, y: 1.0 m))
2.12661 -7.07115 -1.01656 -1.1899 Quadrangle((x: 7.0 m, y: 0.0 m), ..., (x: 7.0 m, y: 1.0 m))
7.14357 -0.519449 0.506755 1.15134 Quadrangle((x: 8.0 m, y: 0.0 m), ..., (x: 8.0 m, y: 1.0 m))
2.28069 4.2503 1.13103 0.478313 Quadrangle((x: 9.0 m, y: 0.0 m), ..., (x: 9.0 m, y: 1.0 m))
โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ

And suppose that we are interested in converting the columns โ€œaโ€, โ€œbโ€ and โ€œcโ€ of the geotable with the Quantile transform. Instead of creating the intermediate geotable with the Select transform, and then sending the result to the Quantile transform, we can create the entire pipeline without reference to the data:

pipeline = Select("a", "b", "c") โ†’ Quantile()
SequentialTransform
โ”œโ”€ Select(selector: [:a, :b, :c], newnames: nothing)
โ””โ”€ Quantile(selector: all, dist: Distributions.Normal{Float64}(ฮผ=0.0, ฯƒ=1.0))

The operator โ†’ creates a special SequentialTransform, which can be applied like any other transform in the framework:

gt |> pipeline
10000ร—4 GeoTable over 100ร—100 CartesianGrid
a b c geometry
Continuous Continuous Continuous Quadrangle
[NoUnits] [NoUnits] [NoUnits] ๐Ÿ–ˆ Cartesian{NoDatum}
2.03551 1.75185 1.47953 Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
0.159341 0.538546 -0.174847 Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
1.60999 0.657593 -0.82495 Quadrangle((x: 2.0 m, y: 0.0 m), ..., (x: 2.0 m, y: 1.0 m))
0.132233 -2.80703 0.335033 Quadrangle((x: 3.0 m, y: 0.0 m), ..., (x: 3.0 m, y: 1.0 m))
0.293421 -0.527855 -0.771856 Quadrangle((x: 4.0 m, y: 0.0 m), ..., (x: 4.0 m, y: 1.0 m))
0.26683 -1.2652 -1.33768 Quadrangle((x: 5.0 m, y: 0.0 m), ..., (x: 5.0 m, y: 1.0 m))
1.09162 0.691263 -0.348054 Quadrangle((x: 6.0 m, y: 0.0 m), ..., (x: 6.0 m, y: 1.0 m))
0.0127842 -2.5364 -1.03858 Quadrangle((x: 7.0 m, y: 0.0 m), ..., (x: 7.0 m, y: 1.0 m))
1.05287 -0.21265 0.494717 Quadrangle((x: 8.0 m, y: 0.0 m), ..., (x: 8.0 m, y: 1.0 m))
0.0255704 1.69015 1.14069 Quadrangle((x: 9.0 m, y: 0.0 m), ..., (x: 9.0 m, y: 1.0 m))
โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ

It will perform optimizations whenever possible. For instance, we know a priori that adding the Identity transform anywhere in the pipeline doesnโ€™t have any effect:

pipeline โ†’ Identity()
SequentialTransform
โ”œโ”€ Select(selector: [:a, :b, :c], newnames: nothing)
โ””โ”€ Quantile(selector: all, dist: Distributions.Normal{Float64}(ฮผ=0.0, ฯƒ=1.0))

7.3 Operator โŠ”

The operator โŠ” (\sqcup) can be used to create lazy parallel transforms. There is no equivalent in Julia as this operator is very specific to tables. It combines the geotables produced by two or more pipelines into a single geotable with the disjoint union of all columns.

Letโ€™s illustrate this concept with two pipelines:

pipeline1 = Select("a") โ†’ Indicator("a", k=3)
SequentialTransform
โ”œโ”€ Select(selector: [:a], newnames: nothing)
โ””โ”€ Indicator(selector: :a, k: 3, scale: :quantile, categ: false)
pipeline2 = Select("b", "c", "d") โ†’ PCA(maxdim=2)
SequentialTransform
โ”œโ”€ Select(selector: [:b, :c, :d], newnames: nothing)
โ”œโ”€ ZScore(selector: all)
โ””โ”€ EigenAnalysis(proj: :V, maxdim: 2, pratio: 1.0)

The first pipeline creates 3 indicator variables from variable โ€œaโ€:

gt |> pipeline1
10000ร—4 GeoTable over 100ร—100 CartesianGrid
a_1 a_2 a_3 geometry
Categorical Categorical Categorical Quadrangle
[NoUnits] [NoUnits] [NoUnits] ๐Ÿ–ˆ Cartesian{NoDatum}
false false true Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
false true true Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
false false true Quadrangle((x: 2.0 m, y: 0.0 m), ..., (x: 2.0 m, y: 1.0 m))
false true true Quadrangle((x: 3.0 m, y: 0.0 m), ..., (x: 3.0 m, y: 1.0 m))
false true true Quadrangle((x: 4.0 m, y: 0.0 m), ..., (x: 4.0 m, y: 1.0 m))
false true true Quadrangle((x: 5.0 m, y: 0.0 m), ..., (x: 5.0 m, y: 1.0 m))
false false true Quadrangle((x: 6.0 m, y: 0.0 m), ..., (x: 6.0 m, y: 1.0 m))
false true true Quadrangle((x: 7.0 m, y: 0.0 m), ..., (x: 7.0 m, y: 1.0 m))
false false true Quadrangle((x: 8.0 m, y: 0.0 m), ..., (x: 8.0 m, y: 1.0 m))
false true true Quadrangle((x: 9.0 m, y: 0.0 m), ..., (x: 9.0 m, y: 1.0 m))
โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ

The second pipeline runs principal component analysis with variables โ€œbโ€, โ€œcโ€ and โ€œdโ€ and produces 2 principal components:

gt |> pipeline2
10000ร—3 GeoTable over 100ร—100 CartesianGrid
PC1 PC2 geometry
Continuous Continuous Quadrangle
[NoUnits] [NoUnits] ๐Ÿ–ˆ Cartesian{NoDatum}
1.40936 1.71802 Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
-0.688262 0.525308 Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
-1.83925 0.650102 Quadrangle((x: 2.0 m, y: 0.0 m), ..., (x: 2.0 m, y: 1.0 m))
-0.432553 -3.21449 Quadrangle((x: 3.0 m, y: 0.0 m), ..., (x: 3.0 m, y: 1.0 m))
-0.885181 -0.482846 Quadrangle((x: 4.0 m, y: 0.0 m), ..., (x: 4.0 m, y: 1.0 m))
-1.88198 -1.2205 Quadrangle((x: 5.0 m, y: 0.0 m), ..., (x: 5.0 m, y: 1.0 m))
-0.854437 0.673923 Quadrangle((x: 6.0 m, y: 0.0 m), ..., (x: 6.0 m, y: 1.0 m))
-1.48507 -2.74755 Quadrangle((x: 7.0 m, y: 0.0 m), ..., (x: 7.0 m, y: 1.0 m))
1.05385 -0.206995 Quadrangle((x: 8.0 m, y: 0.0 m), ..., (x: 8.0 m, y: 1.0 m))
1.10403 1.66065 Quadrangle((x: 9.0 m, y: 0.0 m), ..., (x: 9.0 m, y: 1.0 m))
โ‹ฎ โ‹ฎ โ‹ฎ

We can combine the two pipelines into a single pipeline that executes in parallel:

pipeline = pipeline1 โŠ” pipeline2
ParallelTableTransform
โ”œโ”€ SequentialTransform
โ”‚  โ”œโ”€ Select(selector: [:a], newnames: nothing)
โ”‚  โ””โ”€ Indicator(selector: :a, k: 3, scale: :quantile, categ: false)
โ””โ”€ SequentialTransform
   โ”œโ”€ Select(selector: [:b, :c, :d], newnames: nothing)
   โ”œโ”€ ZScore(selector: all)
   โ””โ”€ EigenAnalysis(proj: :V, maxdim: 2, pratio: 1.0)
gt |> pipeline
10000ร—6 GeoTable over 100ร—100 CartesianGrid
a_1 a_2 a_3 PC1 PC2 geometry
Categorical Categorical Categorical Continuous Continuous Quadrangle
[NoUnits] [NoUnits] [NoUnits] [NoUnits] [NoUnits] ๐Ÿ–ˆ Cartesian{NoDatum}
false false true 1.40936 1.71802 Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
false true true -0.688262 0.525308 Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
false false true -1.83925 0.650102 Quadrangle((x: 2.0 m, y: 0.0 m), ..., (x: 2.0 m, y: 1.0 m))
false true true -0.432553 -3.21449 Quadrangle((x: 3.0 m, y: 0.0 m), ..., (x: 3.0 m, y: 1.0 m))
false true true -0.885181 -0.482846 Quadrangle((x: 4.0 m, y: 0.0 m), ..., (x: 4.0 m, y: 1.0 m))
false true true -1.88198 -1.2205 Quadrangle((x: 5.0 m, y: 0.0 m), ..., (x: 5.0 m, y: 1.0 m))
false false true -0.854437 0.673923 Quadrangle((x: 6.0 m, y: 0.0 m), ..., (x: 6.0 m, y: 1.0 m))
false true true -1.48507 -2.74755 Quadrangle((x: 7.0 m, y: 0.0 m), ..., (x: 7.0 m, y: 1.0 m))
false false true 1.05385 -0.206995 Quadrangle((x: 8.0 m, y: 0.0 m), ..., (x: 8.0 m, y: 1.0 m))
false true true 1.10403 1.66065 Quadrangle((x: 9.0 m, y: 0.0 m), ..., (x: 9.0 m, y: 1.0 m))
โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ

All 5 columns are present in the final geotable.

7.4 Revertibility

An important concept related to pipelines that is very useful in geospatial data science is revertibility. The concept is useful whenever we need to answer geoscientific questions in terms of variables that have been transformed for geostatistical analysis.

Letโ€™s illustrate the concept with the following geotable and pipeline:

a = [-1.0, 4.0, 1.6, 3.4]
b = [1.6, 3.4, -1.0, 4.0]
c = [3.4, 2.0, 3.6, -1.0]
table = (; a, b, c)

geotable = georef(table, [(0, 0), (1, 0), (1, 1), (0, 1)])
4ร—4 GeoTable over 4 PointSet
a b c geometry
Continuous Continuous Continuous Point
[NoUnits] [NoUnits] [NoUnits] ๐Ÿ–ˆ Cartesian{NoDatum}
-1.0 1.6 3.4 (x: 0.0 m, y: 0.0 m)
4.0 3.4 2.0 (x: 1.0 m, y: 0.0 m)
1.6 -1.0 3.6 (x: 1.0 m, y: 1.0 m)
3.4 4.0 -1.0 (x: 0.0 m, y: 1.0 m)
pipeline = Center()
Center transform
โ””โ”€ selector: all

We saw that our pipelines can be evaluated with Juliaโ€™s pipe operator:

geotable |> pipeline
4ร—4 GeoTable over 4 PointSet
a b c geometry
Continuous Continuous Continuous Point
[NoUnits] [NoUnits] [NoUnits] ๐Ÿ–ˆ Cartesian{NoDatum}
-3.0 -0.4 1.4 (x: 0.0 m, y: 0.0 m)
2.0 1.4 0.0 (x: 1.0 m, y: 0.0 m)
-0.4 -3.0 1.6 (x: 1.0 m, y: 1.0 m)
1.4 2.0 -3.0 (x: 0.0 m, y: 1.0 m)

In order to revert a pipeline, however; we need to save auxiliary constants that were used to transform the data (e.g., mean of selected columns). The apply function serves this purpose:

newtable, cache = apply(pipeline, geotable)

newtable
4ร—4 GeoTable over 4 PointSet
a b c geometry
Continuous Continuous Continuous Point
[NoUnits] [NoUnits] [NoUnits] ๐Ÿ–ˆ Cartesian{NoDatum}
-3.0 -0.4 1.4 (x: 0.0 m, y: 0.0 m)
2.0 1.4 0.0 (x: 1.0 m, y: 0.0 m)
-0.4 -3.0 1.6 (x: 1.0 m, y: 1.0 m)
1.4 2.0 -3.0 (x: 0.0 m, y: 1.0 m)

The function produces the new geotable as usual and an additional cache with all the information needed to revert the transforms in the pipeline. We say that a pipeline isrevertible, if there is an efficient way to revert its transforms starting from any geotable that has the same schema of the geotable produced by the apply function:

isrevertible(pipeline)
true
revert(pipeline, newtable, cache)
4ร—4 GeoTable over 4 PointSet
a b c geometry
Continuous Continuous Continuous Point
[NoUnits] [NoUnits] [NoUnits] ๐Ÿ–ˆ Cartesian{NoDatum}
-1.0 1.6 3.4 (x: 0.0 m, y: 0.0 m)
4.0 3.4 2.0 (x: 1.0 m, y: 0.0 m)
1.6 -1.0 3.6 (x: 1.0 m, y: 1.0 m)
3.4 4.0 -1.0 (x: 0.0 m, y: 1.0 m)

A very common workflow in geospatial data science consists of:

  1. Transforming the data to an appropriate sample space for geostatistical analysis
  2. Doing additional modeling to predict variables in new geospatial locations
  3. Reverting the modeling results with the saved pipeline and cache

We will see examples of this workflow in Part V of the book.

7.5 Congratulations!

Congratulations on finishing Part II of the book. Letโ€™s quickly review what we learned so far:

  • Transforms and pipelines are powerful tools to achieve reproducible geospatial data science.
  • The operators โ†’ and โŠ” can be used to build lazy pipelines. After a pipeline is built, it can be applied to different geotables, which may have different types of geospatial domain.
  • Lazy pipelines can always be optimized for computational performance, and the Julia language really thrives to dispatch the appropriate optimizations when they are available.
  • Map projections are specific types of coordinate transforms. They can be combined with many other transforms in the framework to produce advanced geostatistical visualizations.

There is a long journey until the technology reaches its full potential. The good news is that Julia code is easy to read and modify, and you can become an active contributor after just a few weeks working with the language. We invite you to contribute new transforms and optimizations as soon as you feel comfortable with the framework.