Learning
Overview
We formalized the concept of geostatistical learning in Hoffimann et al. 2021. Geostatistical Learning: Challenges and Opportunities. The main difference compared to classical learning theory lies in the underlying assumptions used to derive learning models.
We provide the Learn
transform for supervised learning with geospatial data, and support various learning models written in native Julia. Besides the models listed below, we support all models from the ScikitLearn.jl and MLJ.jl packages.
StatsLearnModels.Learn
— TypeLearn(train, model, features => targets)
Fits the statistical learning model
to train
table, using the selectors of features
and targets
.
Examples
Learn(train, model, [1, 2, 3] => "d")
Learn(train, model, [:a, :b, :c] => :d)
Learn(train, model, ["a", "b", "c"] => 4)
Learn(train, model, [1, 2, 3] => [:d, :e])
Learn(train, model, r"[abc]" => ["d", "e"])
We highly recommend the native Julia models listed below for maximum performance and reproducibility. The integration with external models from ScikiLearn.jl requires a local Python installation, which can be hard to reproduce across different machines.
For model validation, including cross-validation error estimates, please check the Validation section.
Models
Nearest neighbor models
StatsLearnModels.KNNClassifier
— TypeKNNClassifier(k, metric=Euclidean(); leafsize=10, reorder=true)
K-nearest neighbor classification model with k
neighbors and metric
from Distances.jl. Optionally, specify the leafsize
and reorder
options for the underlying trees in NearestNeighbors.jl.
See also KNNRegressor
.
StatsLearnModels.KNNRegressor
— TypeKNNRegressor(k, metric=Euclidean(); leafsize=10, reorder=true)
K-nearest neighbor regression model with k
neighbors and metric
from Distances.jl. Optionally, specify the leafsize
and reorder
options for the underlying trees in NearestNeighbors.jl.
See also KNNClassifier
.
Generalized linear models
StatsLearnModels.LinearRegressor
— TypeLinearRegressor(; kwargs...)
Linear regression model.
The kwargs
are forwarded to the GLM.lm
function from GLM.jl.
See also GeneralizedLinearRegressor
.
StatsLearnModels.GeneralizedLinearRegressor
— TypeGeneralizedLinearRegressor(dist, link; kwargs...)
Generalized linear regression model with distribution dist
from Distributions.jl and link
function.
The kwargs
are forwarded to the GLM.glm
function from GLM.jl.
See also LinearRegressor
.
Decision tree models
DecisionTree.DecisionTreeClassifier
— TypeDecisionTreeClassifier(; pruning_purity_threshold=0.0,
max_depth::Int=-1,
min_samples_leaf::Int=1,
min_samples_split::Int=2,
min_purity_increase::Float=0.0,
n_subfeatures::Int=0,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)
Decision tree classifier. See DecisionTree.jl's documentation
Hyperparameters:
pruning_purity_threshold
: (post-pruning) merge leaves having>=thresh
combined purity (default: no pruning)max_depth
: maximum depth of the decision tree (default: no maximum)min_samples_leaf
: the minimum number of samples each leaf needs to have (default: 1)min_samples_split
: the minimum number of samples in needed for a split (default: 2)min_purity_increase
: minimum purity needed for a split (default: 0.0)n_subfeatures
: number of features to select at random (default: keep all)rng
: the random number generator to use. Can be anInt
, which will be used to seed and create a new random number generator.impurity_importance
: whether to calculate feature importances usingMean Decrease in Impurity (MDI)
. SeeDecisionTree.impurity_importance
Implements fit!
, predict
, predict_proba
, get_classes
DecisionTree.DecisionTreeRegressor
— TypeDecisionTreeRegressor(; pruning_purity_threshold=0.0,
max_depth::Int-1,
min_samples_leaf::Int=5,
min_samples_split::Int=2,
min_purity_increase::Float=0.0,
n_subfeatures::Int=0,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)
Decision tree regression. See DecisionTree.jl's documentation
Hyperparameters:
pruning_purity_threshold
: (post-pruning) merge leaves having>=thresh
combined purity (default: no pruning). This accuracy-based method may not be appropriate for regression tree.max_depth
: maximum depth of the decision tree (default: no maximum)min_samples_leaf
: the minimum number of samples each leaf needs to have (default: 5)min_samples_split
: the minimum number of samples in needed for a split (default: 2)min_purity_increase
: minimum purity needed for a split (default: 0.0)n_subfeatures
: number of features to select at random (default: keep all)rng
: the random number generator to use. Can be anInt
, which will be used to seed and create a new random number generator.impurity_importance
: whether to calculate feature importances usingMean Decrease in Impurity (MDI)
. SeeDecisionTree.impurity_importance
Implements fit!
, predict
, get_classes
DecisionTree.impurity_importance
— Functionimpurity_importance(tree; normalize::Bool = false)
impurity_importance(forest)
impurity_importance(adaboost, coeffs)
Return a vector of feature importance calculated by Mean Decrease in Impurity (MDI)
.
Feature importance is computed as follows:
- Single tree: For each feature, the associated importance is the sum, over all splits based on that feature, of the impurity decreases for that split (the node impurity minus the sum of the child impurities) divided by the total number of training observations. When
normalize
was true, the feature importances were normalized by the sum of feature importances.
More explicitly, the impurity decrease for node i is:
Δimpurityᵢ = nᵢ × lossᵢ - nₗ × lossₗ - nᵣ × lossᵣ
Where n is the number of observations, loss is entropy, gini index or other measures of impurity, index i denotes the quantity of node i, index l denotes the quantity of left child node, and index r denotes the quantity of right child node.
- Forests: The importance for a given feature is the average over trees in the forest of the normalized tree importances for that feature.
- AdaBoost models: The feature importance is as same as
split_importance
.
For forests and adaboost models, feature importance is normalized before averaging over trees, so the keyword argument normalize
is useless. Whether to normalize or not is controversial, but the current implementation is identical to scikitlearn
's RandomForestClassifier, RandomForestRegressor, and AdaBoostClassifier, which is different from feature importances described in G. Louppe, “Understanding Random Forests: From Theory to Practice”, PhD Thesis, U. of Liege, 2014. (https://arxiv.org/abs/1407.7502). See this PR for a detailed discussion.
If impurity_importance
was set false when building the tree, this function returns an empty vector.
Warn: The importance might be misleading because MDI is a biased method. See Beware Default Random Forest Importances for more discussion.
Random forest models
DecisionTree.RandomForestClassifier
— TypeRandomForestClassifier(; n_subfeatures::Int=-1,
n_trees::Int=10,
partial_sampling::Float=0.7,
max_depth::Int=-1,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)
Random forest classification. See DecisionTree.jl's documentation
Hyperparameters:
n_subfeatures
: number of features to consider at random per split (default: -1, sqrt(# features))n_trees
: number of trees to train (default: 10)partial_sampling
: fraction of samples to train each tree on (default: 0.7)max_depth
: maximum depth of the decision trees (default: no maximum)min_samples_leaf
: the minimum number of samples each leaf needs to havemin_samples_split
: the minimum number of samples in needed for a splitmin_purity_increase
: minimum purity needed for a splitrng
: the random number generator to use. Can be anInt
, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with anInt
impurity_importance
: whether to calculate feature importances usingMean Decrease in Impurity (MDI)
. SeeDecisionTree.impurity_importance
Implements fit!
, predict
, predict_proba
, get_classes
DecisionTree.RandomForestRegressor
— TypeRandomForestRegressor(; n_subfeatures::Int=-1,
n_trees::Int=10,
partial_sampling::Float=0.7,
max_depth::Int=-1,
min_samples_leaf::Int=5,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)
Random forest regression. See DecisionTree.jl's documentation
Hyperparameters:
n_subfeatures
: number of features to consider at random per split (default: -1, sqrt(# features))n_trees
: number of trees to train (default: 10)partial_sampling
: fraction of samples to train each tree on (default: 0.7)max_depth
: maximum depth of the decision trees (default: no maximum)min_samples_leaf
: the minimum number of samples each leaf needs to have (default: 5)min_samples_split
: the minimum number of samples in needed for a splitmin_purity_increase
: minimum purity needed for a splitrng
: the random number generator to use. Can be anInt
, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with anInt
impurity_importance
: whether to calculate feature importances usingMean Decrease in Impurity (MDI)
. SeeDecisionTree.impurity_importance
.
Implements fit!
, predict
, get_classes
Adaptive boosting models
DecisionTree.AdaBoostStumpClassifier
— TypeAdaBoostStumpClassifier(; n_iterations::Int=10,
rng=Random.GLOBAL_RNG)
Adaboosted decision tree stumps. See DecisionTree.jl's documentation
Hyperparameters:
n_iterations
: number of iterations of AdaBoostrng
: the random number generator to use. Can be anInt
, which will be used to seed and create a new random number generator.
Implements fit!
, predict
, predict_proba
, get_classes