1  What is geospatial data?

Welcome to Part I of the book. Before we can start our journey in geospatial data science, we need to introduce important concepts, which will be the foundations of Part II, Part III and Part IV.

In this chapter, we define geospatial data and introduce a universal representation for it which is ideal for geostatistical analysis. Unlike other representations in the literature (e.g., “raster”, “vector” data), the proposed representation is suitable for encoding geospatial data over 3D unstructured meshes, 2D images embedded in 3D space, and other types of complex geospatial domains.

1.1 Definition

Definition

(Discrete) geospatial data is the combination of a table of attributes (or features) with a discretization of a geospatial domain. Each row (or measurement) in the table corresponds to an element (or geometry) in the discretization of the geospatial domain.

The definition depends on two other definitions that we clarify next.

1.1.1 Table

In data science the most natural data structure for working with data is the table. Generally speaking, a table is any object that can be structured into rows containing measurements and columns representing variables. For example, Table 1.1 has 5 measurements of 4 variables:

Table 1.1: Example of table
NAME AGE HEIGHT GENDER
John 34 1.78m male
Mary 12 1.56m female
Paul 23 1.70m male
Anne 39 1.80m female
Kate 28 1.72m female

In Julia, the concept of table is formalized in Tables.jl by Quinn et al. (2023). The definition is independent of the machine representation, and various representations can co-exist in the language.

1.1.2 Domain

The second definition that we need is that of a geospatial domain. In geosciences, questions are often formulated within a physical region of interest. This physical region can cover a small area of the surface of the Earth, the entire Earth surface, or any region of finite measure that can be discretized into smaller geometries (a.k.a. elements):

(a) Coast line in Islay Province, Peru. View on Google Maps
(b) Synthetic carbonate reservoir model by Correia et al. (2015). See UNISIM-II for more details
Figure 1.1: Example of geospatial domains

Figure 1.1 illustrates two very different examples of geospatial domains. The domain in Figure 1.1 (a) is widely studied in GIS books. It is a 2D domain that contemplates a small area near Islay Province, Peru. The domain in Figure 1.1 (b) on the other hand is not considered in traditional GIS literature. It is a 3D domain that has been discretized into hexahedron geometries.

The concept of geospatial domain is formalized in Meshes.jl by Hoffimann et al. (2021).

1.1.3 Remarks

  • Images like the one depicted in Figure 1.1 (a) are often implemented in terms of the array data structure. GIS books call it “raster data”, but we will avoid this term in our framework in order to obtain a more general set of tools.

  • According to our definition of geospatial data, “raster data” is simply a table with colors as variables (e.g., RGB values) combined with a Cartesian grid of quadrangle geometries. We illustrate this concept in Figure 1.2 by zooming in a satellite image of the Lena delta:

    (a) “Raster data” of Lena delta
    (b) Zoom reveals quadrangle geometries
    Figure 1.2: Quadrangle geometries in “raster data”
  • There are no constraints on the geometries used in the discretization of the geospatial domain. In Figure 1.3, Brazil is discretized into complex polygonal geometries that represent country states:

    Figure 1.3: Brazil’s states represented with complex polygonal geometries. View on Google Maps.
  • GIS books call it “vector data” because the geometries are stored as vectors of coordinates in memory. We will also avoid this term in our framework given that it only highlights an implementation detail.

Before we start discussing machine representation with actual Julia code, let’s make a final (pedantic) distinction between the words geospatial and spatial. These words mean different things in different communities:

  • In geosciences, an object is geospatial if it lives in physical space.
  • In statistics, a model is spatial if it exploits the vicinity of samples in the sample space.

Given that geospatial data science deals with both concepts, we must use these words carefully.

Note

In Geostatistical Learning, models can exploit both spaces to improve prediction performance, but that is out of the scope for this book.

1.2 Representation

Based on the definition of geospatial data given in the previous section, we are now ready to proceed and discuss an efficient machine representation for it with actual Julia code.

1.2.1 Table

The Julia language comes with two built-in table representations:

  1. Named tuple of vectors
  2. Vector of named tuples

The first representation focuses on the columns of the table:

coltable = (
  NAME=["John", "Mary", "Paul", "Anne", "Kate"],
  AGE=[34, 12, 23, 39, 28],
  HEIGHT=[1.78, 1.56, 1.70, 1.80, 1.72],
  GENDER=["male", "female", "male", "female", "female"]
)
(NAME = ["John", "Mary", "Paul", "Anne", "Kate"], AGE = [34, 12, 23, 39, 28], HEIGHT = [1.78, 1.56, 1.7, 1.8, 1.72], GENDER = ["male", "female", "male", "female", "female"])

Given that data science is often performed with entire columns, this column-major representation of a table is very convenient. The second representation focuses on the rows of the table:

rowtable = [
  (NAME="John", AGE=34, HEIGHT=1.78, GENDER="male"),
  (NAME="Mary", AGE=12, HEIGHT=1.56, GENDER="female"),
  (NAME="Paul", AGE=23, HEIGHT=1.70, GENDER="male"),
  (NAME="Anne", AGE=39, HEIGHT=1.80, GENDER="female"),
  (NAME="Kate", AGE=28, HEIGHT=1.72, GENDER="female")
]
5-element Vector{@NamedTuple{NAME::String, AGE::Int64, HEIGHT::Float64, GENDER::String}}:
 (NAME = "John", AGE = 34, HEIGHT = 1.78, GENDER = "male")
 (NAME = "Mary", AGE = 12, HEIGHT = 1.56, GENDER = "female")
 (NAME = "Paul", AGE = 23, HEIGHT = 1.7, GENDER = "male")
 (NAME = "Anne", AGE = 39, HEIGHT = 1.8, GENDER = "female")
 (NAME = "Kate", AGE = 28, HEIGHT = 1.72, GENDER = "female")

The row-major representation can be useful to process data that is potentially larger than the available computer memory, or infinite streams of data.

Although these two representations come built-in with Julia, they lack basic functionality for data science. The most widely used table representation for data science in Julia is available in DataFrames.jl by Bouchet-Valat and Kamiński (2023).

using DataFrames

df = DataFrame(
  NAME=["John", "Mary", "Paul", "Anne", "Kate"],
  AGE=[34, 12, 23, 39, 28],
  HEIGHT=[1.78, 1.56, 1.70, 1.80, 1.72],
  GENDER=["male", "female", "male", "female", "female"]
)
5×4 DataFrame
Row NAME AGE HEIGHT GENDER
String Int64 Float64 String
1 John 34 1.78 male
2 Mary 12 1.56 female
3 Paul 23 1.7 male
4 Anne 39 1.8 female
5 Kate 28 1.72 female

This representation provides additional syntax for accessing rows and columns of the table:

df[1,:]
DataFrameRow (4 columns)
Row NAME AGE HEIGHT GENDER
String Int64 Float64 String
1 John 34 1.78 male
df[:,"NAME"]
5-element Vector{String}:
 "John"
 "Mary"
 "Paul"
 "Anne"
 "Kate"
df[1:3,["NAME","AGE"]]
3×2 DataFrame
Row NAME AGE
String Int64
1 John 34
2 Mary 12
3 Paul 23
df.HEIGHT
5-element Vector{Float64}:
 1.78
 1.56
 1.7
 1.8
 1.72
df."HEIGHT"
5-element Vector{Float64}:
 1.78
 1.56
 1.7
 1.8
 1.72
Note

Unlike other languages, Julia makes a distinction between the the symbol :HEIGHT and the string "HEIGHT". The DataFrame representation supports both types for column names, but that is not always the case with other table representations.

Other popular table representations in Julia are associated with specific file formats:

As an example, we consider the table representation of a table.csv file stored on disk:

using CSV

CSV.File("data/table.csv")
5-element CSV.File:
 CSV.Row: (NAME = String7("John"), AGE = 34, HEIGHT = 1.78, GENDER = String7("male"))
 CSV.Row: (NAME = String7("Mary"), AGE = 12, HEIGHT = 1.56, GENDER = String7("female"))
 CSV.Row: (NAME = String7("Paul"), AGE = 23, HEIGHT = 1.7, GENDER = String7("male"))
 CSV.Row: (NAME = String7("Anne"), AGE = 39, HEIGHT = 1.8, GENDER = String7("female"))
 CSV.Row: (NAME = String7("Kate"), AGE = 28, HEIGHT = 1.72, GENDER = String7("female"))

The choice of table representation is a function of the application.

1.2.2 Domain

All available domain representations come from the Meshes.jl module. Let’s start illustrating the most common domains in GIS, which are collections of disconnected 2D geometries:

using GeoStats

p = Point(1, 2)
s = Segment((0, 2), (1, 3))
t = Triangle((0, 0), (1, 0), (1, 1))
b = Ball((2, 2), 1)

geoms = [p, s, t, b]
4-element Vector{Geometry{2, Float64}}:
 Point(1.0, 2.0)
 Segment((0.0, 2.0), (1.0, 3.0))
 Triangle((0.0, 0.0), (1.0, 0.0), (1.0, 1.0))
 Ball(center: (2.0, 2.0), radius: 1.0)

Because these geometries are unaware of each other, we place them into a GeometrySet, informally known in computational geometry as the “soup of geometries” data structure:

gset = GeometrySet(geoms)
4 GeometrySet{2,Float64}
├─ Point(1.0, 2.0)
├─ Segment((0.0, 2.0), (1.0, 3.0))
├─ Triangle((0.0, 0.0), (1.0, 0.0), (1.0, 1.0))
└─ Ball(center: (2.0, 2.0), radius: 1.0)

No advanced knowledge is required to start working with these geometries. For example, we can compute the length of the Segment, the area of the Triangle and the area of the Ball with:

length(s), area(t), area(b)
(1.4142135623730951, 0.5, 3.141592653589793)

More generally, we can compute the measure of the geometries in the domain:

[measure(g) for g in gset]
4-element Vector{Float64}:
 0.0
 1.4142135623730951
 0.5
 3.141592653589793

In the example above, we iterated over the domain to apply the function of interest, but we could have used Julia’s dot syntax for broadcasting the function over the geometries:

measure.(gset)
4-element Vector{Float64}:
 0.0
 1.4142135623730951
 0.5
 3.141592653589793

The list of supported geometries is very comprehensive. It encompasses all geometries from the simple features standard and more. We will see more examples in the following chapters.

One of the main limitations of GIS software today is the lack of explicit representation of topology. A GeometrySet does not provide efficient topological relations (Floriani and Hui 2007), yet advanced geospatial data science requires the definition of geospatial domains where geometries are aware of their neighbors. Let’s illustrate this concept with the CartesianGrid domain:

grid = CartesianGrid(10, 10)
10×10 CartesianGrid{2,Float64}
  minimum: Point(0.0, 0.0)
  maximum: Point(10.0, 10.0)
  spacing: (1.0, 1.0)

We can access the individual geometries of the domain as before:

grid[1]
Quadrangle{2,Float64}
├─ Point(0.0, 0.0)
├─ Point(1.0, 0.0)
├─ Point(1.0, 1.0)
└─ Point(0.0, 1.0)

And even though we can manipulate this domain as if it was a “soup of geometries”, the major advantage in this abstraction is the underlying topology:

topo = topology(grid)
10×10 GridTopology(aperiodic, aperiodic)

This data structure can be used by advanced users who wish to design algorithms with neighborhood information. We will cover this topic in a separate chapter. For now, keep in mind that working with the entire domain as opposed to with a vector or “soup of geometries” has major benefits.

Note

The CartesianGrid domain is lazy, meaning it only stores the start and end points of the grid together with the spacing between the elements. Therefore, we can easily create large 3D grids of Hexahedron geometries without consuming all available memory:

grid = CartesianGrid(10000, 10000, 10000)
10000×10000×10000 CartesianGrid{3,Float64}
  minimum: Point(0.0, 0.0, 0.0)
  maximum: Point(10000.0, 10000.0, 10000.0)
  spacing: (1.0, 1.0, 1.0)
grid[1]
Hexahedron{3,Float64}
├─ Point(0.0, 0.0, 0.0)
├─ Point(1.0, 0.0, 0.0)
├─ Point(1.0, 1.0, 0.0)
├─ Point(0.0, 1.0, 0.0)
├─ Point(0.0, 0.0, 1.0)
├─ Point(1.0, 0.0, 1.0)
├─ Point(1.0, 1.0, 1.0)
└─ Point(0.0, 1.0, 1.0)

In computational geometry, a CartesianGrid is a specific type of mesh. It can only represent “flat” domains sampled regularly along each dimension (e.g., images). To represent domains with curvature such as terrain elevation models or complex 3D domains like the one in Figure 1.1 (b), we can use the SimpleMesh domain:

# global vector of 2D points
points = [(0.0, 0.0), (1.0, 0.0), (0.0, 1.0), (1.0, 1.0), (0.25, 0.5), (0.75, 0.5)]

# connect the points into N-gons
connec = connect.([(1, 2, 6, 5), (2, 4, 6), (4, 3, 5, 6), (3, 1, 5)])

# 2D mesh made of N-gon elements
mesh = SimpleMesh(points, connec)
4 SimpleMesh{2,Float64}
  6 vertices
  ├─ Point(0.0, 0.0)
  ├─ Point(1.0, 0.0)
  ├─ Point(0.0, 1.0)
  ├─ Point(1.0, 1.0)
  ├─ Point(0.25, 0.5)
  └─ Point(0.75, 0.5)
  4 elements
  ├─ Quadrangle(1, 2, 6, 5)
  ├─ Triangle(2, 4, 6)
  ├─ Quadrangle(4, 3, 5, 6)
  └─ Triangle(3, 1, 5)

The connect function takes a tuple of indices and a geometry type, and produces a connectivity object. The geometry type can be omitted, in which case it is assumed to be a Ngon, i.e., a polygon with N sides:

c = connect((1, 2, 3))
Triangle(1, 2, 3)

This connectivity object can be materialized into an actual geometry with a vector of points:

materialize(c, [Point(0, 0), Point(1, 0), Point(1, 1)])
Triangle{2,Float64}
├─ Point(0.0, 0.0)
├─ Point(1.0, 0.0)
└─ Point(1.0, 1.0)

The SimpleMesh uses the materialize function above to construct geometries on the fly, similar to what we have seen with the CartesianGrid:

mesh[1]
Quadrangle{2,Float64}
├─ Point(0.0, 0.0)
├─ Point(1.0, 0.0)
├─ Point(0.75, 0.5)
└─ Point(0.25, 0.5)

Don’t worry if you feel overwhelmed by these concepts. We are only sharing them here to give you an idea of how complex 3D domains are represented in the framework. You can do geospatial data science without ever having to operate with these concepts explicitly.

Let’s make a few important remarks:

  • Flexibility comes with a price. To construct a SimpleMesh of connected geometries we need to explicitly create a vector of vertices, and connect these vertices into geometries using their indices in the vector.
  • Geometries in a SimpleMesh can be of different type. In the example, we have both Triangle and Quadrangle geometries in the domain. This is similar to what we had with GeometrySet, but now the geometries are connected.
  • SimpleMesh are rarely constructed by hand. They are often the result of a sophisticated geometric processing pipeline that is already stored in a file on disk.

The last missing piece of the puzzle is the combination of tables with domains into geospatial data, which we discuss next.

1.2.3 Data

Wouldn’t it be nice if we had a representation of geospatial data that behaved like a table as discussed in the Tables section, but preserved topological information as discussed in the Domains section? In the GeoStats.jl framework, this is precisely what we get with the georef function:

using GeoStats

df = DataFrame(
  NAME=["John", "Mary", "Paul", "Anne"],
  AGE=[34.0, 12.0, 23.0, 39.0]u"yr",
  HEIGHT=[1.78, 1.56, 1.70, 1.80]u"m",
  GENDER=["male", "female", "male", "female"]
)

grid = CartesianGrid(2, 2)

geotable = georef(df, grid)
4×5 GeoTable over 2×2 CartesianGrid{2,Float64}
NAME AGE HEIGHT GENDER geometry
Categorical Continuous Continuous Categorical Quadrangle
[NoUnits] [yr] [m] [NoUnits]
John 34.0 yr 1.78 m male Quadrangle((0.0, 0.0), ..., (0.0, 1.0))
Mary 12.0 yr 1.56 m female Quadrangle((1.0, 0.0), ..., (1.0, 1.0))
Paul 23.0 yr 1.7 m male Quadrangle((0.0, 1.0), ..., (0.0, 2.0))
Anne 39.0 yr 1.8 m female Quadrangle((1.0, 1.0), ..., (1.0, 2.0))
Tip for all users

The framework is integrated with the Unitful.jl module. To add the unit “meter” to the numeric value 1.0, we can write 1.0u"m". Similarly, we can add units to vectors of values:

[1.0, 2.0, 3.0]u"m"
3-element Vector{Quantity{Float64, 𝐋, Unitful.FreeUnits{(m,), 𝐋, nothing}}}:
 1.0 m
 2.0 m
 3.0 m

The function combines any table with any domain into a geospatial data representation that adheres to the Tables.jl interface. We call this representation a GeoTable to distinguish it from a standard table. Besides the original columns, the GeoTable has a special geometry column with the underlying domain:

names(geotable)
5-element Vector{String}:
 "NAME"
 "AGE"
 "HEIGHT"
 "GENDER"
 "geometry"

Unlike a standard table, the GeoTable creates geometries on the fly depending on the data access pattern. For example, we can request the first measurement of the GeoTable and it will automatically construct the corresponding Quadrangle:

geotable[1,:]
(NAME = "John", AGE = 34.0 yr, HEIGHT = 1.78 m, GENDER = "male", geometry = Quadrangle((0.0, 0.0), ..., (0.0, 1.0)))

If we request a subset of measurements, the GeoTable will avoid unnecessary creation of geometries, and will instead return a view into the original data:

geotable[1:3,["NAME","AGE"]]
3×3 GeoTable over 3 view(::CartesianGrid{2,Float64}, 1:3)
NAME AGE geometry
Categorical Continuous Quadrangle
[NoUnits] [yr]
John 34.0 yr Quadrangle((0.0, 0.0), ..., (0.0, 1.0))
Mary 12.0 yr Quadrangle((1.0, 0.0), ..., (1.0, 1.0))
Paul 23.0 yr Quadrangle((0.0, 1.0), ..., (0.0, 2.0))

Finally, if we request the entire geometry column, we get back the original domain:

geotable[:,"geometry"]
2×2 CartesianGrid{2,Float64}
  minimum: Point(0.0, 0.0)
  maximum: Point(2.0, 2.0)
  spacing: (1.0, 1.0)

Besides the data access patterns of the DataFrame, the GeoTable also provides an advanced method for retrieving all rows that intersect with a given geometry:

geotable[Segment((0, 0), (2, 0)), :]
2×5 GeoTable over 2 view(::CartesianGrid{2,Float64}, [1, 2])
NAME AGE HEIGHT GENDER geometry
Categorical Continuous Continuous Categorical Quadrangle
[NoUnits] [yr] [m] [NoUnits]
John 34.0 yr 1.78 m male Quadrangle((0.0, 0.0), ..., (0.0, 1.0))
Mary 12.0 yr 1.56 m female Quadrangle((1.0, 0.0), ..., (1.0, 1.0))

This method is very useful to narrow the region of interest and quickly discard all measurements that are outside of it. For instance, it is common to discard all “pixels” outside of a polygon before exporting the geotable to a file on disk.

Notice that the GeoTable representation is general enough to accommodate both “raster data” and “vector data” in traditional GIS. We can create very large rasters because the CartesianGrid is lazy:

georef(
  (
    R=rand(1000000),
    G=rand(1000000),
    B=rand(1000000)
  ),
  CartesianGrid(1000, 1000)
)
1000000×4 GeoTable over 1000×1000 CartesianGrid{2,Float64}
R G B geometry
Continuous Continuous Continuous Quadrangle
[NoUnits] [NoUnits] [NoUnits]
0.597682 0.908811 0.614286 Quadrangle((0.0, 0.0), ..., (0.0, 1.0))
0.603924 0.659289 0.899297 Quadrangle((1.0, 0.0), ..., (1.0, 1.0))
0.646975 0.421632 0.0126513 Quadrangle((2.0, 0.0), ..., (2.0, 1.0))
0.553914 0.842626 0.540499 Quadrangle((3.0, 0.0), ..., (3.0, 1.0))
0.189256 0.773223 0.607486 Quadrangle((4.0, 0.0), ..., (4.0, 1.0))
0.492625 0.433846 0.322602 Quadrangle((5.0, 0.0), ..., (5.0, 1.0))
0.243811 0.836498 0.743303 Quadrangle((6.0, 0.0), ..., (6.0, 1.0))
0.704833 0.0608002 0.191424 Quadrangle((7.0, 0.0), ..., (7.0, 1.0))
0.236498 0.373837 0.949686 Quadrangle((8.0, 0.0), ..., (8.0, 1.0))
0.0952209 0.485773 0.415937 Quadrangle((9.0, 0.0), ..., (9.0, 1.0))

And can load vector geometries from files that store simple features using the GeoIO.jl module:

using GeoIO

GeoIO.load("data/geotable.geojson")
4×2 GeoTable over 4 GeometrySet{2,Float32}
NAME geometry
Categorical MultiPolygon
[NoUnits]
Estuario Multi(21×PolyArea)
Fronteiras Antigas Multi(1×PolyArea)
Fronteiras Intermediarias Multi(1×PolyArea)
Fronteiras Novas Multi(2×PolyArea)

We will see more examples of “vector data” in the chapter Interfacing with GIS, and will explain why file formats like Shapefile.jl and GeoJSON.jl are not enough for advanced geospatial data science.

Tip for all users

The GeoStats.jl module reexports the full stack of modules for geospatial data science in Julia. There is no need to import modules like Meshes.jl explicitly. You are all set if you start your script with

using GeoStats
Tip for advanced users

In Julia, a function is type-stable if the return type is known at compile time. Since the GeoTable has columns from the original table (e.g., numbers) and an additional special geometry column, the access to the data with the DataFrame syntax is not type stable. If you need to write type-stable code, use the functions values and domain instead:

values(geotable)
4×4 DataFrame
Row NAME AGE HEIGHT GENDER
String Quantity… Quantity… String
1 John 34.0 yr 1.78 m male
2 Mary 12.0 yr 1.56 m female
3 Paul 23.0 yr 1.7 m male
4 Anne 39.0 yr 1.8 m female
domain(geotable)
2×2 CartesianGrid{2,Float64}
  minimum: Point(0.0, 0.0)
  maximum: Point(2.0, 2.0)
  spacing: (1.0, 1.0)

1.3 Remarks

What did we learn in this chapter?

  • Geospatial data can be efficiently represented as a GeoTable. This representation is universal, meaning it combines the “raster” and “vector” representations found in traditional GIS.
  • A GeoTable provides all the data access patterns of a DataFrame. In particular, it supports the syntax geotable[rows,cols] without expensive copies of geometries.
  • Working with geospatial domains as opposed to working with vectors of disconnected geometries has major benefits. In particular, it preserves topological information.

It is very convenient to manipulate a GeoTable as if it was a DataFrame. Nevertheless, we will learn that advanced geospatial data science requires higher-level constructs to preserve geospatial information. We will cover these constructs in Part II and Part III of the book.