Learning
Overview
We formalized the concept of geostatistical learning in Hoffimann et al. 2021. Geostatistical Learning: Challenges and Opportunities. The main difference compared to classical learning theory lies in the underlying assumptions used to derive learning models.
We provide the Learn transform for supervised learning with geospatial data, and support various learning models written in native Julia:
StatsLearnModels.Learn — Type
Learn(table; [model])Perform supervised learning with labeled table and statistical learning model.
Uses KNNClassifier(1) or KNNRegressor(1) model by default depending on the scientific type of the labels stored in the table.
Examples
Learn(label(table, "y"))
Learn(label(table, ["y1", "y2"]))
Learn(label(table, 3), model=KNNClassifier(5))See also label.
The transform takes a labeled table as input:
StatsLearnModels.label — Function
label(table, names)Creates a LabeledTable from table using names as label columns.
For model validation, including cross-validation error estimates, please check the Validation section.
Models
Nearest neighbor models
StatsLearnModels.KNNClassifier — Type
KNNClassifier(k, metric=Euclidean(); leafsize=10, reorder=true)K-nearest neighbor classification model with k neighbors and metric from Distances.jl. Optionally, specify the leafsize and reorder options for the underlying trees in NearestNeighbors.jl.
See also KNNRegressor.
StatsLearnModels.KNNRegressor — Type
KNNRegressor(k, metric=Euclidean(); leafsize=10, reorder=true)K-nearest neighbor regression model with k neighbors and metric from Distances.jl. Optionally, specify the leafsize and reorder options for the underlying trees in NearestNeighbors.jl.
See also KNNClassifier.
Generalized linear models
StatsLearnModels.LinearRegressor — Type
LinearRegressor(; kwargs...)Linear regression model.
The kwargs are forwarded to the GLM.lm function from GLM.jl.
See also GeneralizedLinearRegressor.
StatsLearnModels.GeneralizedLinearRegressor — Type
GeneralizedLinearRegressor(dist, link; kwargs...)Generalized linear regression model with distribution dist from Distributions.jl and link function.
The kwargs are forwarded to the GLM.glm function from GLM.jl.
See also LinearRegressor.
Decision tree models
DecisionTree.DecisionTreeClassifier — Type
DecisionTreeClassifier(; pruning_purity_threshold=0.0,
max_depth::Int=-1,
min_samples_leaf::Int=1,
min_samples_split::Int=2,
min_purity_increase::Float=0.0,
n_subfeatures::Int=0,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)Decision tree classifier. See DecisionTree.jl's documentation
Hyperparameters:
pruning_purity_threshold: (post-pruning) merge leaves having>=threshcombined purity (default: no pruning)max_depth: maximum depth of the decision tree (default: no maximum)min_samples_leaf: the minimum number of samples each leaf needs to have (default: 1)min_samples_split: the minimum number of samples in needed for a split (default: 2)min_purity_increase: minimum purity needed for a split (default: 0.0)n_subfeatures: number of features to select at random (default: keep all)rng: the random number generator to use. Can be anInt, which will be used to seed and create a new random number generator.impurity_importance: whether to calculate feature importances usingMean Decrease in Impurity (MDI). SeeDecisionTree.impurity_importance
Implements fit!, predict, predict_proba, get_classes
DecisionTree.DecisionTreeRegressor — Type
DecisionTreeRegressor(; pruning_purity_threshold=0.0,
max_depth::Int-1,
min_samples_leaf::Int=5,
min_samples_split::Int=2,
min_purity_increase::Float=0.0,
n_subfeatures::Int=0,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)Decision tree regression. See DecisionTree.jl's documentation
Hyperparameters:
pruning_purity_threshold: (post-pruning) merge leaves having>=threshcombined purity (default: no pruning). This accuracy-based method may not be appropriate for regression tree.max_depth: maximum depth of the decision tree (default: no maximum)min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)min_samples_split: the minimum number of samples in needed for a split (default: 2)min_purity_increase: minimum purity needed for a split (default: 0.0)n_subfeatures: number of features to select at random (default: keep all)rng: the random number generator to use. Can be anInt, which will be used to seed and create a new random number generator.impurity_importance: whether to calculate feature importances usingMean Decrease in Impurity (MDI). SeeDecisionTree.impurity_importance
Implements fit!, predict, get_classes
Random forest models
DecisionTree.RandomForestClassifier — Type
RandomForestClassifier(; n_subfeatures::Int=-1,
n_trees::Int=10,
partial_sampling::Float=0.7,
max_depth::Int=-1,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)Random forest classification. See DecisionTree.jl's documentation
Hyperparameters:
n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))n_trees: number of trees to train (default: 10)partial_sampling: fraction of samples to train each tree on (default: 0.7)max_depth: maximum depth of the decision trees (default: no maximum)min_samples_leaf: the minimum number of samples each leaf needs to havemin_samples_split: the minimum number of samples in needed for a splitmin_purity_increase: minimum purity needed for a splitrng: the random number generator to use. Can be anInt, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with anIntimpurity_importance: whether to calculate feature importances usingMean Decrease in Impurity (MDI). SeeDecisionTree.impurity_importance
Implements fit!, predict, predict_proba, get_classes
DecisionTree.RandomForestRegressor — Type
RandomForestRegressor(; n_subfeatures::Int=-1,
n_trees::Int=10,
partial_sampling::Float=0.7,
max_depth::Int=-1,
min_samples_leaf::Int=5,
rng=Random.GLOBAL_RNG,
impurity_importance::Bool=true)Random forest regression. See DecisionTree.jl's documentation
Hyperparameters:
n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))n_trees: number of trees to train (default: 10)partial_sampling: fraction of samples to train each tree on (default: 0.7)max_depth: maximum depth of the decision trees (default: no maximum)min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)min_samples_split: the minimum number of samples in needed for a splitmin_purity_increase: minimum purity needed for a splitrng: the random number generator to use. Can be anInt, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with anIntimpurity_importance: whether to calculate feature importances usingMean Decrease in Impurity (MDI). SeeDecisionTree.impurity_importance.
Implements fit!, predict, get_classes
Adaptive boosting models
DecisionTree.AdaBoostStumpClassifier — Type
AdaBoostStumpClassifier(; n_iterations::Int=10,
rng=Random.GLOBAL_RNG)Adaboosted decision tree stumps. See DecisionTree.jl's documentation
Hyperparameters:
n_iterations: number of iterations of AdaBoostrng: the random number generator to use. Can be anInt, which will be used to seed and create a new random number generator.
Implements fit!, predict, predict_proba, get_classes