9  Geospatial joins

Another important tool in geospatial data science is the geospatial join. We will introduce the concept with a practical example, and will explain how it is related to the standard join of two tables.

9.1 Motivation

The split-apply-combine pattern that we learned in the previous chapter requires a single geotable with all the relevant information in it. However, many questions in geospatial data science can only be answered with information that is spread across multiple geotables. Hence, the need to join these geotables before attempting to @groupby, @transform and @combine the information.

Let’s consider a simple example where we are given two geotables, one containing people who shared their latitude and longitude coordinates:

table = (
  NAME=["John", "Mary", "Paul", "Anne", "Kate"],
  AGE=[34.0, 12.0, 23.0, 39.0, 28.0]u"yr",
  HEIGHT=[1.78, 1.56, 1.70, 1.80, 1.72]u"m",
  LATITUDE=[-22.96710361241228, 37.42773662442142, -27.486220858775997, 39.90358408375064, -3.847311538763359],
  LONGITUDE=[-43.17891118844475, -122.17007072663823, 153.04380578036657, 116.40764745941036, -32.411372812211226]
)

people = georef(table, (:LONGITUDE, :LATITUDE))
5×4 GeoTable over 5 PointSet{2,Float64}
NAME AGE HEIGHT geometry
Categorical Continuous Continuous Point2
[NoUnits] [yr] [m]
John 34.0 yr 1.78 m (-43.1789, -22.9671)
Mary 12.0 yr 1.56 m (-122.17, 37.4277)
Paul 23.0 yr 1.7 m (153.044, -27.4862)
Anne 39.0 yr 1.8 m (116.408, 39.9036)
Kate 28.0 yr 1.72 m (-32.4114, -3.84731)

And another containing countries in the world according to the Natural Earth project:

using GeoIO

countries = GeoIO.load("data/countries.geojson", numbertype = Float64)
177×3 GeoTable over 177 GeometrySet{2,Float64}
COUNTRY REGION geometry
Categorical Categorical MultiPolygon
[NoUnits] [NoUnits]
Fiji Melanesia Multi(3×PolyArea)
Tanzania Eastern Africa Multi(1×PolyArea)
W. Sahara Northern Africa Multi(1×PolyArea)
Canada Northern America Multi(30×PolyArea)
United States of America Northern America Multi(10×PolyArea)
Kazakhstan Central Asia Multi(1×PolyArea)
Uzbekistan Central Asia Multi(1×PolyArea)
Papua New Guinea Melanesia Multi(4×PolyArea)
Indonesia South-Eastern Asia Multi(13×PolyArea)
Argentina South America Multi(2×PolyArea)
⋮ ⋮ ⋮
Note

The georef function has a method that accepts the names of columns with geospatial coordinates. In this example, the table already has the LATITUDE and LONGITUDE coordinates.

Let’s visualize the geometries of these two geotables:

fig = Mke.Figure()
ax = Mke.Axis(fig[1,1], title = "People and countries",
              xlabel = "longitude", ylabel = "latitude")
viz!(countries.geometry)
viz!(people.geometry, color = "teal", pointsize = 10)
fig
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/ND0gA/src/scenes.jl:220

Our goal in this example is to attach the “COUNTRY” and “REGION” information to each individual based on their location, represented as a point. In other words, we want to find the countries that contain the location, and then copy the columns to the people.

9.2 Joining geotables

The geojoin function can be used to join two geotables using a geometric predicate function:

geojoin(people, countries, pred = ∈)
5×6 GeoTable over 5 PointSet{2,Float64}
NAME AGE HEIGHT COUNTRY REGION geometry
Categorical Continuous Continuous Categorical Categorical Point2
[NoUnits] [yr] [m] [NoUnits] [NoUnits]
John 34.0 yr 1.78 m Brazil South America (-43.1789, -22.9671)
Mary 12.0 yr 1.56 m United States of America Northern America (-122.17, 37.4277)
Paul 23.0 yr 1.7 m Australia Australia and New Zealand (153.044, -27.4862)
Anne 39.0 yr 1.8 m China Eastern Asia (116.408, 39.9036)
Kate 28.0 yr 1.72 m missing missing (-32.4114, -3.84731)

In the example above, we used the ∈ predicate to check if a Point in the people geotable is in a MultiPolygon in the countries geotable. The default predicate in geojoin is the intersects function that we covered in previous chapters.

Notice that Kate’s “COUNTRY” and “REGION” are missing. That is because Kate lives in the Fernando de Noronha island, which is not present in the countries geotable. To retain just those people for which the geometric predicate evaluates true, we can use a different kind of geojoin known as inner:

geojoin(people, countries, kind = :inner)
4×6 GeoTable over 4 view(::PointSet{2,Float64}, [1, 2, 3, 4])
NAME AGE HEIGHT COUNTRY REGION geometry
Categorical Continuous Continuous Categorical Categorical Point2
[NoUnits] [yr] [m] [NoUnits] [NoUnits]
John 34.0 yr 1.78 m Brazil South America (-43.1789, -22.9671)
Mary 12.0 yr 1.56 m United States of America Northern America (-122.17, 37.4277)
Paul 23.0 yr 1.7 m Australia Australia and New Zealand (153.044, -27.4862)
Anne 39.0 yr 1.8 m China Eastern Asia (116.408, 39.9036)

By default, the left kind is used. Like in standard join, the geojoin is not commutative. If we swap the order of the geotables, the result will be different:

geojoin(countries, people, kind = :inner)
4×6 GeoTable over 4 view(::GeometrySet{2,Float64}, [5, 30, 138, 140])
COUNTRY REGION NAME AGE HEIGHT geometry
Categorical Categorical Categorical Continuous Continuous MultiPolygon
[NoUnits] [NoUnits] [NoUnits] [yr] [m]
United States of America Northern America Mary 12.0 yr 1.56 m Multi(10×PolyArea)
Brazil South America John 34.0 yr 1.78 m Multi(1×PolyArea)
Australia Australia and New Zealand Paul 23.0 yr 1.7 m Multi(2×PolyArea)
China Eastern Asia Anne 39.0 yr 1.8 m Multi(2×PolyArea)

To learn about the different kinds of join, check DataFrames.jl’s documentation on joins.

Note

In database-style join, the predicate function isequal is applied to arbitrary columns of tables. In geojoin, geometric predicate functions are applied to the geometry column of geotables.

The tablejoin function can be used for database-style join between a geotable and a simple table (e.g., DataFrame). The result will be a new geotable over a subset of geometries from the first argument. Please check the documentation for more details.

Tip for all users

Besides geojoin, the framework also provides vertical and horizontal concatenation of geotables. Vertical concatenation is achieved with vcat as follows:

gt1 = georef((a=[1,2,3], b=[4,5,6]))
gt2 = georef((b=[4,5,6], c=[7,8,9]))

vcat(gt1, gt2)
6×4 GeoTable over 6 GeometrySet{1,Float64}
a b c geometry
Categorical Categorical Categorical Segment
[NoUnits] [NoUnits] [NoUnits]
1 4 missing Segment((0.0,), (1.0,))
2 5 missing Segment((1.0,), (2.0,))
3 6 missing Segment((2.0,), (3.0,))
missing 4 7 Segment((0.0,), (1.0,))
missing 5 8 Segment((1.0,), (2.0,))
missing 6 9 Segment((2.0,), (3.0,))

All columns are preserved by default, which corresponds to the option kind = :union. Absent columns are filled with missing values. The option kind = :intersect can be used to retain only the columns that are present in all geotables:

vcat(gt1, gt2, kind = :intersect)
6×2 GeoTable over 6 GeometrySet{1,Float64}
b geometry
Categorical Segment
[NoUnits]
4 Segment((0.0,), (1.0,))
5 Segment((1.0,), (2.0,))
6 Segment((2.0,), (3.0,))
4 Segment((0.0,), (1.0,))
5 Segment((1.0,), (2.0,))
6 Segment((2.0,), (3.0,))

Horizontal concatenation is achieved with hcat as follows:

hcat(gt1, gt2)
3×5 GeoTable over 3 CartesianGrid{1,Float64}
a b b_ c geometry
Categorical Categorical Categorical Categorical Segment
[NoUnits] [NoUnits] [NoUnits] [NoUnits]
1 4 4 7 Segment((0.0,), (1.0,))
2 5 5 8 Segment((1.0,), (2.0,))
3 6 6 9 Segment((2.0,), (3.0,))

Unique column names are produced with underscore suffixes whenever a column appears in multiple geotables. Julia provides convenient syntax for vcat and hcat:

[gt1
 gt2]
6×4 GeoTable over 6 GeometrySet{1,Float64}
a b c geometry
Categorical Categorical Categorical Segment
[NoUnits] [NoUnits] [NoUnits]
1 4 missing Segment((0.0,), (1.0,))
2 5 missing Segment((1.0,), (2.0,))
3 6 missing Segment((2.0,), (3.0,))
missing 4 7 Segment((0.0,), (1.0,))
missing 5 8 Segment((1.0,), (2.0,))
missing 6 9 Segment((2.0,), (3.0,))
[gt1 gt2]
3×5 GeoTable over 3 CartesianGrid{1,Float64}
a b b_ c geometry
Categorical Categorical Categorical Categorical Segment
[NoUnits] [NoUnits] [NoUnits] [NoUnits]
1 4 4 7 Segment((0.0,), (1.0,))
2 5 5 8 Segment((1.0,), (2.0,))
3 6 6 9 Segment((2.0,), (3.0,))

9.3 Common predicates

In geospatial data science, the most common geometric predicates used in geojoin are ∈, ⊆, ==, ≈ and intersects as illustrated in Table 9.1. Specific applications may require custom predicates, which can be easily defined in pure Julia with the Meshes.jl module. For example, it is sometimes convenient to define geometric predicates in terms of a distance and a threshold:

pred(g1, g2) = evaluate(Euclidean(), centroid(g1), centroid(g2)) ≤ 1u"m"`
Table 9.1: Common geometric predicates in geospatial join
PREDICATE EXAMPLE
∈
⊆
==
≈
intersects

9.4 Multiple matches

Sometimes the predicate function will evaluate true for multiple rows of the geotables. In this case, we need to decide which value will be copied to the resulting geotable. The geojoin accepts reduction functions for each column. For example, suppose that Kate decided to move to the continent:

table = (
  NAME=["John", "Mary", "Paul", "Anne", "Kate"],
  AGE=[34.0, 12.0, 23.0, 39.0, 28.0]u"yr",
  HEIGHT=[1.78, 1.56, 1.70, 1.80, 1.72]u"m",
  LATITUDE=[-22.96710361241228, 37.42773662442142, -27.486220858775997, 39.90358408375064, -9.66628224039543],
  LONGITUDE=[-43.17891118844475, -122.17007072663823, 153.04380578036657, 116.40764745941036, -35.71261407423411]
)

people = georef(table, (:LONGITUDE, :LATITUDE))
5×4 GeoTable over 5 PointSet{2,Float64}
NAME AGE HEIGHT geometry
Categorical Continuous Continuous Point2
[NoUnits] [yr] [m]
John 34.0 yr 1.78 m (-43.1789, -22.9671)
Mary 12.0 yr 1.56 m (-122.17, 37.4277)
Paul 23.0 yr 1.7 m (153.044, -27.4862)
Anne 39.0 yr 1.8 m (116.408, 39.9036)
Kate 28.0 yr 1.72 m (-35.7126, -9.66628)

Now, both John and Kate live in Brazil according the countries geotable:

geojoin(people, countries)
5×6 GeoTable over 5 PointSet{2,Float64}
NAME AGE HEIGHT COUNTRY REGION geometry
Categorical Continuous Continuous Categorical Categorical Point2
[NoUnits] [yr] [m] [NoUnits] [NoUnits]
John 34.0 yr 1.78 m Brazil South America (-43.1789, -22.9671)
Mary 12.0 yr 1.56 m United States of America Northern America (-122.17, 37.4277)
Paul 23.0 yr 1.7 m Australia Australia and New Zealand (153.044, -27.4862)
Anne 39.0 yr 1.8 m China Eastern Asia (116.408, 39.9036)
Kate 28.0 yr 1.72 m Brazil South America (-35.7126, -9.66628)

If we swap the order of the geotables in the geojoin, we need to decide how to reduce the multiple values of “NAME”, “AGE” and “HEIGHT” in the resulting geotable. By default, the mean function is used to reduce Continuous variables, and the first function is used otherwise:

geojoin(countries, people, kind = :inner)
4×6 GeoTable over 4 view(::GeometrySet{2,Float64}, [5, 30, 138, 140])
COUNTRY REGION NAME AGE HEIGHT geometry
Categorical Categorical Categorical Continuous Continuous MultiPolygon
[NoUnits] [NoUnits] [NoUnits] [yr] [m]
United States of America Northern America Mary 12.0 yr 1.56 m Multi(10×PolyArea)
Brazil South America John 31.0 yr 1.75 m Multi(1×PolyArea)
Australia Australia and New Zealand Paul 23.0 yr 1.7 m Multi(2×PolyArea)
China Eastern Asia Anne 39.0 yr 1.8 m Multi(2×PolyArea)

We can specify custom reduction functions using the following syntax:

geojoin(countries, people, "AGE" => maximum, "HEIGHT" => mean, kind = :inner)
4×6 GeoTable over 4 view(::GeometrySet{2,Float64}, [5, 30, 138, 140])
COUNTRY REGION NAME AGE HEIGHT geometry
Categorical Categorical Categorical Continuous Continuous MultiPolygon
[NoUnits] [NoUnits] [NoUnits] [yr] [m]
United States of America Northern America Mary 12.0 yr 1.56 m Multi(10×PolyArea)
Brazil South America John 34.0 yr 1.75 m Multi(1×PolyArea)
Australia Australia and New Zealand Paul 23.0 yr 1.7 m Multi(2×PolyArea)
China Eastern Asia Anne 39.0 yr 1.8 m Multi(2×PolyArea)
Note

The keyword arguments kind and pred must appear at the end of the geojoin:

geojoin(gt1, gt2, var1 => red1, ..., varn => redn; kind = :left, pred = intersects)

9.5 Congratulations!

Congratulations on finishing Part III of the book. Let’s quickly review what we learned so far:

  • In order to answer geoscientific questions with @groupby, @transform and @combine, we need a single geotable with all the relevant information. This single geotable is often the result of a geojoin with multiple geotables from different sources of information.
  • There are various kinds of geojoin such as inner and left, which match the behavior of standard join in databases. We recommend the DataFrames.jl documentation to learn more.
  • The geojoin produces matches with geometric predicate functions. The most common are illustrated in Table 9.1, but custom predicates can be easily defined in pure Julia using functions defined in the Meshes.jl module.

In the next part of the book, you will learn a final important tool that is missing in our toolkit for advanced geospatial data science. The concepts in the following chapters are extremely important.