Learning

Overview

We formalized the concept of geostatistical learning in Hoffimann et al. 2021. Geostatistical Learning: Challenges and Opportunities. The main difference compared to classical learning theory lies in the underlying assumptions used to derive learning models.

We provide the Learn transform for supervised learning with geospatial data, and support various learning models written in native Julia:

StatsLearnModels.LearnType
Learn(table; [model])

Perform supervised learning with labeled table and statistical learning model.

Uses KNNClassifier(1) or KNNRegressor(1) model by default depending on the scientific type of the labels stored in the table.

Examples

Learn(label(table, "y"))
Learn(label(table, ["y1", "y2"]))
Learn(label(table, 3), model=KNNClassifier(5))

See also label.

source

The transform takes a labeled table as input:

For model validation, including cross-validation error estimates, please check the Validation section.

Models

Nearest neighbor models

Generalized linear models

Decision tree models

DecisionTree.DecisionTreeClassifierType
DecisionTreeClassifier(; pruning_purity_threshold=0.0,
                       max_depth::Int=-1,
                       min_samples_leaf::Int=1,
                       min_samples_split::Int=2,
                       min_purity_increase::Float=0.0,
                       n_subfeatures::Int=0,
                       rng=Random.GLOBAL_RNG,
                       impurity_importance::Bool=true)

Decision tree classifier. See DecisionTree.jl's documentation

Hyperparameters:

  • pruning_purity_threshold: (post-pruning) merge leaves having >=thresh combined purity (default: no pruning)
  • max_depth: maximum depth of the decision tree (default: no maximum)
  • min_samples_leaf: the minimum number of samples each leaf needs to have (default: 1)
  • min_samples_split: the minimum number of samples in needed for a split (default: 2)
  • min_purity_increase: minimum purity needed for a split (default: 0.0)
  • n_subfeatures: number of features to select at random (default: keep all)
  • rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator.
  • impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance

Implements fit!, predict, predict_proba, get_classes

source
DecisionTree.DecisionTreeRegressorType
DecisionTreeRegressor(; pruning_purity_threshold=0.0,
                      max_depth::Int-1,
                      min_samples_leaf::Int=5,
                      min_samples_split::Int=2,
                      min_purity_increase::Float=0.0,
                      n_subfeatures::Int=0,
                      rng=Random.GLOBAL_RNG,
                      impurity_importance::Bool=true)

Decision tree regression. See DecisionTree.jl's documentation

Hyperparameters:

  • pruning_purity_threshold: (post-pruning) merge leaves having >=thresh combined purity (default: no pruning). This accuracy-based method may not be appropriate for regression tree.
  • max_depth: maximum depth of the decision tree (default: no maximum)
  • min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
  • min_samples_split: the minimum number of samples in needed for a split (default: 2)
  • min_purity_increase: minimum purity needed for a split (default: 0.0)
  • n_subfeatures: number of features to select at random (default: keep all)
  • rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator.
  • impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance

Implements fit!, predict, get_classes

source

Random forest models

DecisionTree.RandomForestClassifierType
RandomForestClassifier(; n_subfeatures::Int=-1,
                       n_trees::Int=10,
                       partial_sampling::Float=0.7,
                       max_depth::Int=-1,
                       rng=Random.GLOBAL_RNG,
                       impurity_importance::Bool=true)

Random forest classification. See DecisionTree.jl's documentation

Hyperparameters:

  • n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
  • n_trees: number of trees to train (default: 10)
  • partial_sampling: fraction of samples to train each tree on (default: 0.7)
  • max_depth: maximum depth of the decision trees (default: no maximum)
  • min_samples_leaf: the minimum number of samples each leaf needs to have
  • min_samples_split: the minimum number of samples in needed for a split
  • min_purity_increase: minimum purity needed for a split
  • rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with an Int
  • impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance

Implements fit!, predict, predict_proba, get_classes

source
DecisionTree.RandomForestRegressorType
RandomForestRegressor(; n_subfeatures::Int=-1,
                      n_trees::Int=10,
                      partial_sampling::Float=0.7,
                      max_depth::Int=-1,
                      min_samples_leaf::Int=5,
                      rng=Random.GLOBAL_RNG,
                      impurity_importance::Bool=true)

Random forest regression. See DecisionTree.jl's documentation

Hyperparameters:

  • n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
  • n_trees: number of trees to train (default: 10)
  • partial_sampling: fraction of samples to train each tree on (default: 0.7)
  • max_depth: maximum depth of the decision trees (default: no maximum)
  • min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
  • min_samples_split: the minimum number of samples in needed for a split
  • min_purity_increase: minimum purity needed for a split
  • rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with an Int
  • impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance.

Implements fit!, predict, get_classes

source

Adaptive boosting models

DecisionTree.AdaBoostStumpClassifierType
AdaBoostStumpClassifier(; n_iterations::Int=10,
                        rng=Random.GLOBAL_RNG)

Adaboosted decision tree stumps. See DecisionTree.jl's documentation

Hyperparameters:

  • n_iterations: number of iterations of AdaBoost
  • rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator.

Implements fit!, predict, predict_proba, get_classes

source