Learning

Overview

We formalized the concept of geostatistical learning in Hoffimann et al. 2021. Geostatistical Learning: Challenges and Opportunities. The main difference compared to classical learning theory lies in the underlying assumptions used to derive learning models.

We provide the Learn transform for supervised learning with geospatial data, and support various learning models written in native Julia. Besides the models listed below, we support all models from the ScikitLearn.jl and MLJ.jl packages.

StatsLearnModels.Learn — Type

Learn(train, model, features => targets)

Fits the statistical learning model to train table, using the selectors of features and targets.

Examples

Learn(train, model, [1, 2, 3] => "d")
Learn(train, model, [:a, :b, :c] => :d)
Learn(train, model, ["a", "b", "c"] => 4)
Learn(train, model, [1, 2, 3] => [:d, :e])
Learn(train, model, r"[abc]" => ["d", "e"])

source

Note

We highly recommend the native Julia models listed below for maximum performance and reproducibility. The integration with external models from ScikiLearn.jl requires a local Python installation, which can be hard to reproduce across different machines.

For model validation, including cross-validation error estimates, please check the Validation section.

Models

Nearest neighbor models

StatsLearnModels.KNNClassifier — Type

KNNClassifier(k, metric=Euclidean(); leafsize=10, reorder=true)

K-nearest neighbor classification model with k neighbors and metric from Distances.jl. Optionally, specify the leafsize and reorder options for the underlying trees in NearestNeighbors.jl.

Generalized linear models

StatsLearnModels.LinearRegressor — Type

LinearRegressor(; kwargs...)

Linear regression model.

The kwargs are forwarded to the GLM.lm function from GLM.jl.

Decision tree models

DecisionTree.DecisionTreeClassifier — Type

DecisionTreeClassifier(; pruning_purity_threshold=0.0,
                       max_depth::Int=-1,
                       min_samples_leaf::Int=1,
                       min_samples_split::Int=2,
                       min_purity_increase::Float=0.0,
                       n_subfeatures::Int=0,
                       rng=Random.GLOBAL_RNG,
                       impurity_importance::Bool=true)

Decision tree classifier. See DecisionTree.jl's documentation

Hyperparameters:

pruning_purity_threshold: (post-pruning) merge leaves having >=thresh combined purity (default: no pruning)
max_depth: maximum depth of the decision tree (default: no maximum)
min_samples_leaf: the minimum number of samples each leaf needs to have (default: 1)
min_samples_split: the minimum number of samples in needed for a split (default: 2)
min_purity_increase: minimum purity needed for a split (default: 0.0)
n_subfeatures: number of features to select at random (default: keep all)
rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator.
impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance

Implements fit!, predict, predict_proba, get_classes

source

DecisionTree.DecisionTreeRegressor — Type

DecisionTreeRegressor(; pruning_purity_threshold=0.0,
                      max_depth::Int-1,
                      min_samples_leaf::Int=5,
                      min_samples_split::Int=2,
                      min_purity_increase::Float=0.0,
                      n_subfeatures::Int=0,
                      rng=Random.GLOBAL_RNG,
                      impurity_importance::Bool=true)

Decision tree regression. See DecisionTree.jl's documentation

Hyperparameters:

pruning_purity_threshold: (post-pruning) merge leaves having >=thresh combined purity (default: no pruning). This accuracy-based method may not be appropriate for regression tree.
max_depth: maximum depth of the decision tree (default: no maximum)
min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
min_samples_split: the minimum number of samples in needed for a split (default: 2)
min_purity_increase: minimum purity needed for a split (default: 0.0)
n_subfeatures: number of features to select at random (default: keep all)
rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator.
impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance

Implements fit!, predict, get_classes

source

DecisionTree.impurity_importance — Function

impurity_importance(tree; normalize::Bool = false)
impurity_importance(forest)
impurity_importance(adaboost, coeffs)

Return a vector of feature importance calculated by Mean Decrease in Impurity (MDI).

Feature importance is computed as follows:

Single tree: For each feature, the associated importance is the sum, over all splits based on that feature, of the impurity decreases for that split (the node impurity minus the sum of the child impurities) divided by the total number of training observations. When normalize was true, the feature importances were normalized by the sum of feature importances.

More explicitly, the impurity decrease for node i is:

Δimpurityᵢ = nᵢ × lossᵢ - nₗ × lossₗ - nᵣ × lossᵣ

Where n is the number of observations, loss is entropy, gini index or other measures of impurity, index i denotes the quantity of node i, index l denotes the quantity of left child node, and index r denotes the quantity of right child node.

Forests: The importance for a given feature is the average over trees in the forest of the normalized tree importances for that feature.
AdaBoost models: The feature importance is as same as split_importance.

For forests and adaboost models, feature importance is normalized before averaging over trees, so the keyword argument normalize is useless. Whether to normalize or not is controversial, but the current implementation is identical to scikitlearn's RandomForestClassifier, RandomForestRegressor, and AdaBoostClassifier, which is different from feature importances described in G. Louppe, “Understanding Random Forests: From Theory to Practice”, PhD Thesis, U. of Liege, 2014. (https://arxiv.org/abs/1407.7502). See this PR for a detailed discussion.

If impurity_importance was set false when building the tree, this function returns an empty vector.

Warn: The importance might be misleading because MDI is a biased method. See Beware Default Random Forest Importances for more discussion.

source

Random forest models

DecisionTree.RandomForestClassifier — Type

RandomForestClassifier(; n_subfeatures::Int=-1,
                       n_trees::Int=10,
                       partial_sampling::Float=0.7,
                       max_depth::Int=-1,
                       rng=Random.GLOBAL_RNG,
                       impurity_importance::Bool=true)

Random forest classification. See DecisionTree.jl's documentation

Hyperparameters:

n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
n_trees: number of trees to train (default: 10)
partial_sampling: fraction of samples to train each tree on (default: 0.7)
max_depth: maximum depth of the decision trees (default: no maximum)
min_samples_leaf: the minimum number of samples each leaf needs to have
min_samples_split: the minimum number of samples in needed for a split
min_purity_increase: minimum purity needed for a split
rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with an Int
impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance

Implements fit!, predict, predict_proba, get_classes

source

DecisionTree.RandomForestRegressor — Type

RandomForestRegressor(; n_subfeatures::Int=-1,
                      n_trees::Int=10,
                      partial_sampling::Float=0.7,
                      max_depth::Int=-1,
                      min_samples_leaf::Int=5,
                      rng=Random.GLOBAL_RNG,
                      impurity_importance::Bool=true)

Random forest regression. See DecisionTree.jl's documentation

Hyperparameters:

n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
n_trees: number of trees to train (default: 10)
partial_sampling: fraction of samples to train each tree on (default: 0.7)
max_depth: maximum depth of the decision trees (default: no maximum)
min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
min_samples_split: the minimum number of samples in needed for a split
min_purity_increase: minimum purity needed for a split
rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with an Int
impurity_importance: whether to calculate feature importances using Mean Decrease in Impurity (MDI). See DecisionTree.impurity_importance.

Implements fit!, predict, get_classes

source

Adaptive boosting models

DecisionTree.AdaBoostStumpClassifier — Type

AdaBoostStumpClassifier(; n_iterations::Int=10,
                        rng=Random.GLOBAL_RNG)

Adaboosted decision tree stumps. See DecisionTree.jl's documentation

Hyperparameters:

n_iterations: number of iterations of AdaBoost
rng: the random number generator to use. Can be an Int, which will be used to seed and create a new random number generator.

Implements fit!, predict, predict_proba, get_classes

source