Learning
Overview
We formalized the concept of geostatistical learning in Hoffimann et al. 2021. Geostatistical Learning: Challenges and Opportunities. The main difference compared to classical learning theory lies in the underlying assumptions used to derive learning models.
We provide the Learn transform for supervised learning with geospatial data, and support various learning models written in native Julia:
StatsLearnModels.Learn — TypeLearn(table; [model])Perform supervised learning with labeled table and statistical learning model.
Uses KNNClassifier(1) or KNNRegressor(1) model by default depending on the scientific type of the labels stored in the table.
Examples
Learn(label(table, "y"))
Learn(label(table, ["y1", "y2"]))
Learn(label(table, 3), model=KNNClassifier(5))See also label.
The transform takes a labeled table as input:
StatsLearnModels.label — Functionlabel(table, names)Creates a LabeledTable from table using names as label columns.
For model validation, including cross-validation error estimates, please check the Validation section.
Models
Nearest neighbor models
StatsLearnModels.KNNClassifier — TypeKNNClassifier(k, metric=Euclidean(); leafsize=10, reorder=true)K-nearest neighbor classification model with k neighbors and metric from Distances.jl. Optionally, specify the leafsize and reorder options for the underlying trees in NearestNeighbors.jl.
See also KNNRegressor.
StatsLearnModels.KNNRegressor — TypeKNNRegressor(k, metric=Euclidean(); leafsize=10, reorder=true)K-nearest neighbor regression model with k neighbors and metric from Distances.jl. Optionally, specify the leafsize and reorder options for the underlying trees in NearestNeighbors.jl.
See also KNNClassifier.
Generalized linear models
StatsLearnModels.LinearRegressor — TypeLinearRegressor(; kwargs...)Linear regression model.
The kwargs are forwarded to the GLM.lm function from GLM.jl.
See also GeneralizedLinearRegressor.
StatsLearnModels.GeneralizedLinearRegressor — TypeGeneralizedLinearRegressor(dist, link; kwargs...)Generalized linear regression model with distribution dist from Distributions.jl and link function.
The kwargs are forwarded to the GLM.glm function from GLM.jl.
See also LinearRegressor.
Decision tree models
DecisionTree.DecisionTreeClassifier — TypeDecisionTreeClassifier(; pruning_purity_threshold=0.0,
max_depth::Int=-1,
min_samples_leaf::Int=1,
min_samples_split::Int=2,
min_purity_increase::Float=0.0,
n_subfeatures::Int=0,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)Decision tree classifier. See DecisionTree.jl's documentation
Hyperparameters:
pruning_purity_threshold: (post-pruning) merge leaves having>=threshcombined purity (default: no pruning)max_depth: maximum depth of the decision tree (default: no maximum)min_samples_leaf: the minimum number of samples each leaf needs to have (default: 1)min_samples_split: the minimum number of samples in needed for a split (default: 2)min_purity_increase: minimum purity needed for a split (default: 0.0)n_subfeatures: number of features to select at random (default: keep all)rng: the random number generator to use. Can be anInt, which will be used to seed and create a new random number generator.impurity_importance: whether to calculate feature importances usingMean Decrease in Impurity (MDI). SeeDecisionTree.impurity_importance
Implements fit!, predict, predict_proba, get_classes
DecisionTree.DecisionTreeRegressor — TypeDecisionTreeRegressor(; pruning_purity_threshold=0.0,
max_depth::Int-1,
min_samples_leaf::Int=5,
min_samples_split::Int=2,
min_purity_increase::Float=0.0,
n_subfeatures::Int=0,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)Decision tree regression. See DecisionTree.jl's documentation
Hyperparameters:
pruning_purity_threshold: (post-pruning) merge leaves having>=threshcombined purity (default: no pruning). This accuracy-based method may not be appropriate for regression tree.max_depth: maximum depth of the decision tree (default: no maximum)min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)min_samples_split: the minimum number of samples in needed for a split (default: 2)min_purity_increase: minimum purity needed for a split (default: 0.0)n_subfeatures: number of features to select at random (default: keep all)rng: the random number generator to use. Can be anInt, which will be used to seed and create a new random number generator.impurity_importance: whether to calculate feature importances usingMean Decrease in Impurity (MDI). SeeDecisionTree.impurity_importance
Implements fit!, predict, get_classes
Random forest models
DecisionTree.RandomForestClassifier — TypeRandomForestClassifier(; n_subfeatures::Int=-1,
n_trees::Int=10,
partial_sampling::Float=0.7,
max_depth::Int=-1,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)Random forest classification. See DecisionTree.jl's documentation
Hyperparameters:
n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))n_trees: number of trees to train (default: 10)partial_sampling: fraction of samples to train each tree on (default: 0.7)max_depth: maximum depth of the decision trees (default: no maximum)min_samples_leaf: the minimum number of samples each leaf needs to havemin_samples_split: the minimum number of samples in needed for a splitmin_purity_increase: minimum purity needed for a splitrng: the random number generator to use. Can be anInt, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with anIntimpurity_importance: whether to calculate feature importances usingMean Decrease in Impurity (MDI). SeeDecisionTree.impurity_importance
Implements fit!, predict, predict_proba, get_classes
DecisionTree.RandomForestRegressor — TypeRandomForestRegressor(; n_subfeatures::Int=-1,
n_trees::Int=10,
partial_sampling::Float=0.7,
max_depth::Int=-1,
min_samples_leaf::Int=5,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)Random forest regression. See DecisionTree.jl's documentation
Hyperparameters:
n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))n_trees: number of trees to train (default: 10)partial_sampling: fraction of samples to train each tree on (default: 0.7)max_depth: maximum depth of the decision trees (default: no maximum)min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)min_samples_split: the minimum number of samples in needed for a splitmin_purity_increase: minimum purity needed for a splitrng: the random number generator to use. Can be anInt, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with anIntimpurity_importance: whether to calculate feature importances usingMean Decrease in Impurity (MDI). SeeDecisionTree.impurity_importance.
Implements fit!, predict, get_classes
Adaptive boosting models
DecisionTree.AdaBoostStumpClassifier — TypeAdaBoostStumpClassifier(; n_iterations::Int=10,
rng=Random.GLOBAL_RNG)Adaboosted decision tree stumps. See DecisionTree.jl's documentation
Hyperparameters:
n_iterations: number of iterations of AdaBoostrng: the random number generator to use. Can be anInt, which will be used to seed and create a new random number generator.
Implements fit!, predict, predict_proba, get_classes