\n",
"
Caution!
\n",
"
For the sake of clarity, no cross-validation will be used to estimate the\n",
- "testing error. We are only showing the effect of the parameters\n",
- "on the validation set of what should be the inner cross-validation.
\n",
+ "variability of the testing error. We are only showing the effect of the\n",
+ "parameters on the validation set of what should be the inner loop of a nested\n",
+ "cross-validation.\n",
"
\n",
"\n",
- "## Random forest\n",
- "\n",
- "The main parameter to tune for random forest is the `n_estimators` parameter.\n",
- "In general, the more trees in the forest, the better the generalization\n",
- "performance will be. However, it will slow down the fitting and prediction\n",
- "time. The goal is to balance computing time and generalization performance when\n",
- "setting the number of estimators when putting such learner in production.\n",
- "\n",
- "Then, we could also tune a parameter that controls the depth of each tree in\n",
- "the forest. Two parameters are important for this: `max_depth` and\n",
- "`max_leaf_nodes`. They differ in the way they control the tree structure.\n",
- "Indeed, `max_depth` will enforce to have a more symmetric tree, while\n",
- "`max_leaf_nodes` does not impose such constraint.\n",
- "\n",
- "Be aware that with random forest, trees are generally deep since we are\n",
- "seeking to overfit each tree on each bootstrap sample because this will be\n",
- "mitigated by combining them altogether. Assembling underfitted trees (i.e.\n",
- "shallow trees) might also lead to an underfitted forest."
+ "We will start by loading the california housing dataset."
]
},
{
@@ -56,6 +40,71 @@
" data, target, random_state=0)"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Random forest\n",
+ "\n",
+ "The main parameter to select in random forest is the `n_estimators` parameter.\n",
+ "In general, the more trees in the forest, the better the generalization\n",
+ "performance will be. However, it will slow down the fitting and prediction\n",
+ "time. The goal is to balance computing time and generalization performance\n",
+ "when setting the number of estimators. Here, we fix `n_estimators=100`, which\n",
+ "is already the default value.\n",
+ "\n",
+ "\n",
+ "
Caution!
\n",
+ "
Tuning the n_estimators for random forests generally result in a waste of\n",
+ "computer power. We just need to ensure that it is large enough so that doubling\n",
+ "its value does not lead to a significant improvement of the validation error.
\n",
+ "
\n",
+ "\n",
+ "Instead, we can tune the hyperparameter `max_features`, which controls the\n",
+ "size of the random subset of features to consider when looking for the best\n",
+ "split when growing the trees: smaller values for `max_features` will lead to\n",
+ "more random trees with hopefully more uncorrelated prediction errors. However\n",
+ "if `max_features` is too small, predictions can be too random, even after\n",
+ "averaging with the trees in the ensemble.\n",
+ "\n",
+ "If `max_features` is set to `None`, then this is equivalent to setting\n",
+ "`max_features=n_features` which means that the only source of randomness in\n",
+ "the random forest is the bagging procedure."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(f\"In this case, n_features={len(data.columns)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can also tune the different parameters that control the depth of each tree\n",
+ "in the forest. Two parameters are important for this: `max_depth` and\n",
+ "`max_leaf_nodes`. They differ in the way they control the tree structure.\n",
+ "Indeed, `max_depth` will enforce to have a more symmetric tree, while\n",
+ "`max_leaf_nodes` does not impose such constraint. If `max_leaf_nodes=None`\n",
+ "then the number of leaf nodes is unlimited.\n",
+ "\n",
+ "The hyperparameter `min_samples_leaf` controls the minimum number of samples\n",
+ "required to be at a leaf node. This means that a split point (at any depth) is\n",
+ "only done if it leaves at least `min_samples_leaf` training samples in each of\n",
+ "the left and right branches. A small value for `min_samples_leaf` means that\n",
+ "some samples can become isolated when a tree is deep, promoting overfitting. A\n",
+ "large value would prevent deep trees, which can lead to underfitting.\n",
+ "\n",
+ "Be aware that with random forest, trees are expected to be deep since we are\n",
+ "seeking to overfit each tree on each bootstrap sample. Overfitting is\n",
+ "mitigated when combining the trees altogether, whereas assembling underfitted\n",
+ "trees (i.e. shallow trees) might also lead to an underfitted forest."
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -67,8 +116,9 @@
"from sklearn.ensemble import RandomForestRegressor\n",
"\n",
"param_distributions = {\n",
- " \"n_estimators\": [1, 2, 5, 10, 20, 50, 100, 200, 500],\n",
- " \"max_leaf_nodes\": [2, 5, 10, 20, 50, 100],\n",
+ " \"max_features\": [1, 2, 3, 5, None],\n",
+ " \"max_leaf_nodes\": [10, 100, 1000, None],\n",
+ " \"min_samples_leaf\": [1, 2, 5, 10, 20, 50, 100],\n",
"}\n",
"search_cv = RandomizedSearchCV(\n",
" RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,\n",
@@ -88,15 +138,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We can observe in our search that we are required to have a large\n",
- "number of leaves and thus deep trees. This parameter seems particularly\n",
- "impactful in comparison to the number of trees for this particular dataset:\n",
- "with at least 50 trees, the generalization performance will be driven by the\n",
- "number of leaves.\n",
- "\n",
- "Now we will estimate the generalization performance of the best model by\n",
- "refitting it with the full training set and using the test set for scoring on\n",
- "unseen data. This is done by default when calling the `.fit` method."
+ "We can observe in our search that we are required to have a large number of\n",
+ "`max_leaf_nodes` and thus deep trees. This parameter seems particularly\n",
+ "impactful with respect to the other tuning parameters, but large values of\n",
+ "`min_samples_leaf` seem to reduce the performance of the model.\n",
+ "\n",
+ "In practice, more iterations of random search would be necessary to precisely\n",
+ "assert the role of each parameters. Using `n_iter=10` is good enough to\n",
+ "quickly inspect the hyperparameter combinations that yield models that work\n",
+ "well enough without spending too much computational resources. Feel free to\n",
+ "try more interations on your own.\n",
+ "\n",
+ "Once the `RandomizedSearchCV` has found the best set of hyperparameters, it\n",
+ "uses them to refit the model using the full training set. To estimate the\n",
+ "generalization performance of the best model it suffices to call `.score` on\n",
+ "the unseen data."
]
},
{
@@ -180,8 +236,8 @@
"\n",
"\n",
"
Caution!
\n",
- "
Here, we tune the n_estimators but be aware that using early-stopping as\n",
- "in the previous exercise will be better.
\n",
+ "
Here, we tune the n_estimators but be aware that is better to use\n",
+ "early_stopping as done in the Exercise M6.04.
\n",
"
\n",
"\n",
"In this search, we see that the `learning_rate` is required to be large\n",
@@ -196,8 +252,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Now we estimate the generalization performance of the best model\n",
- "using the test set."
+ "Now we estimate the generalization performance of the best model using the\n",
+ "test set."
]
},
{
@@ -216,8 +272,8 @@
"source": [
"The mean test score in the held-out test set is slightly better than the score\n",
"of the best model. The reason is that the final model is refitted on the whole\n",
- "training set and therefore, on more data than the inner cross-validated models\n",
- "of the grid search procedure."
+ "training set and therefore, on more data than the cross-validated models of\n",
+ "the grid search procedure."
]
}
],