Learning

Overview

We formalized the concept of geostatistical learning in Hoffimann et al. 2021. Geostatistical Learning: Challenges and Opportunities. The main difference compared to classical learning theory lies in the underlying assumptions used to derive learning models.

We provide the Learn transform for supervised learning with geospatial data, and support various learning models written in native Julia:

StatsLearnModels.Learn — Type

Learn(table; [model])

Perform supervised learning with labeled table and statistical learning model.

Uses KNNClassifier(1) or KNNRegressor(1) model by default depending on the scientific type of the labels stored in the table.

Examples

Learn(label(table, "y"))
Learn(label(table, ["y1", "y2"]))
Learn(label(table, 3), model=KNNClassifier(5))

Models

Nearest neighbor models

StatsLearnModels.KNNClassifier — Type

KNNClassifier(k, metric=Euclidean(); leafsize=10, reorder=true)

K-nearest neighbor classification model with k neighbors and metric from Distances.jl. Optionally, specify the leafsize and reorder options for the underlying trees in NearestNeighbors.jl.

Generalized linear models

StatsLearnModels.LinearRegressor — Type

LinearRegressor(; kwargs...)

Linear regression model.

The kwargs are forwarded to the GLM.lm function from GLM.jl.

Decision tree models

DecisionTree.DecisionTreeClassifier — Type

DecisionTreeClassifier(; pruning_purity_threshold=0.0,
                       max_depth::Int=-1,
                       min_samples_leaf::Int=1,
                       min_samples_split::Int=2,
                       min_purity_increase::Float=0.0,
                       n_subfeatures::Int=0,
                       rng=Random.GLOBAL_RNG,
                       impurity_importance::Bool=true)

Decision tree classifier. See DecisionTree.jl's documentation

Hyperparameters:

pruning_purity_threshold: (post-pruning) merge leaves having >=thresh combined purity (default: no pruning)
max_depth: maximum depth of the decision tree (default: no maximum)
min_samples_leaf: the minimum number of samples each leaf needs to have (default: 1)
min_samples_split: the minimum number of samples in needed for a split (default: 2)
min_purity_increase: minimum purity needed for a split (default: 0.0)
n_subfeatures: number of features to select at random (default: keep all)
rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator.
impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance

Implements fit!, predict, predict_proba, get_classes

source

DecisionTree.DecisionTreeRegressor — Type

DecisionTreeRegressor(; pruning_purity_threshold=0.0,
                      max_depth::Int-1,
                      min_samples_leaf::Int=5,
                      min_samples_split::Int=2,
                      min_purity_increase::Float=0.0,
                      n_subfeatures::Int=0,
                      rng=Random.GLOBAL_RNG,
                      impurity_importance::Bool=true)

Decision tree regression. See DecisionTree.jl's documentation

Hyperparameters:

pruning_purity_threshold: (post-pruning) merge leaves having >=thresh combined purity (default: no pruning). This accuracy-based method may not be appropriate for regression tree.
max_depth: maximum depth of the decision tree (default: no maximum)
min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
min_samples_split: the minimum number of samples in needed for a split (default: 2)
min_purity_increase: minimum purity needed for a split (default: 0.0)
n_subfeatures: number of features to select at random (default: keep all)
rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator.
impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance

Implements fit!, predict, get_classes

source

Random forest models

DecisionTree.RandomForestClassifier — Type

RandomForestClassifier(; n_subfeatures::Int=-1,
                       n_trees::Int=10,
                       partial_sampling::Float=0.7,
                       max_depth::Int=-1,
                       rng=Random.GLOBAL_RNG,
                       impurity_importance::Bool=true)

Random forest classification. See DecisionTree.jl's documentation

Hyperparameters:

n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
n_trees: number of trees to train (default: 10)
partial_sampling: fraction of samples to train each tree on (default: 0.7)
max_depth: maximum depth of the decision trees (default: no maximum)
min_samples_leaf: the minimum number of samples each leaf needs to have
min_samples_split: the minimum number of samples in needed for a split
min_purity_increase: minimum purity needed for a split
rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with an Int
impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance

Implements fit!, predict, predict_proba, get_classes

source

DecisionTree.RandomForestRegressor — Type

RandomForestRegressor(; n_subfeatures::Int=-1,
                      n_trees::Int=10,
                      partial_sampling::Float=0.7,
                      max_depth::Int=-1,
                      min_samples_leaf::Int=5,
                      rng=Random.GLOBAL_RNG,
                      impurity_importance::Bool=true)

Random forest regression. See DecisionTree.jl's documentation

Hyperparameters:

n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
n_trees: number of trees to train (default: 10)
partial_sampling: fraction of samples to train each tree on (default: 0.7)
max_depth: maximum depth of the decision trees (default: no maximum)
min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
min_samples_split: the minimum number of samples in needed for a split
min_purity_increase: minimum purity needed for a split
rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with an Int
impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance.

Implements fit!, predict, get_classes

source

Adaptive boosting models

DecisionTree.AdaBoostStumpClassifier — Type

AdaBoostStumpClassifier(; n_iterations::Int=10,
                        rng=Random.GLOBAL_RNG)

Adaboosted decision tree stumps. See DecisionTree.jl's documentation

Hyperparameters:

n_iterations: number of iterations of AdaBoost
rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator.

Implements fit!, predict, predict_proba, get_classes

source