diff --git a/notebooks/03_categorical_pipeline_sol_02.ipynb b/notebooks/03_categorical_pipeline_sol_02.ipynb index f23f265c9..569836225 100644 --- a/notebooks/03_categorical_pipeline_sol_02.ipynb +++ b/notebooks/03_categorical_pipeline_sol_02.ipynb @@ -250,7 +250,7 @@ "
\n", "

Important

\n", "

Which encoder should I use?

\n", - "\n", + "
\n", "\n", "\n", "\n", diff --git a/notebooks/03_categorical_pipeline_visualization.ipynb b/notebooks/03_categorical_pipeline_visualization.ipynb index 302866d49..29e29e213 100644 --- a/notebooks/03_categorical_pipeline_visualization.ipynb +++ b/notebooks/03_categorical_pipeline_visualization.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# How to define a scikit-learn pipeline and visualize it" + "# Visualizing scikit-learn pipelines in Jupyter" ] }, { @@ -22,7 +22,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### First we load the dataset" + "## First we load the dataset" ] }, { @@ -86,7 +86,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Then we create the pipeline" + "## Then we create the pipeline" ] }, { @@ -176,7 +176,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Finally we score the model" + "## Finally we score the model" ] }, { diff --git a/notebooks/ensemble_hyperparameters.ipynb b/notebooks/ensemble_hyperparameters.ipynb index df8eb0cee..998c0bd02 100644 --- a/notebooks/ensemble_hyperparameters.ipynb +++ b/notebooks/ensemble_hyperparameters.ipynb @@ -17,28 +17,12 @@ "
\n", "

Caution!

\n", "

For the sake of clarity, no cross-validation will be used to estimate the\n", - "testing error. We are only showing the effect of the parameters\n", - "on the validation set of what should be the inner cross-validation.

\n", + "variability of the testing error. We are only showing the effect of the\n", + "parameters on the validation set of what should be the inner loop of a nested\n", + "cross-validation.

\n", "
\n", "\n", - "## Random forest\n", - "\n", - "The main parameter to tune for random forest is the `n_estimators` parameter.\n", - "In general, the more trees in the forest, the better the generalization\n", - "performance will be. However, it will slow down the fitting and prediction\n", - "time. The goal is to balance computing time and generalization performance when\n", - "setting the number of estimators when putting such learner in production.\n", - "\n", - "Then, we could also tune a parameter that controls the depth of each tree in\n", - "the forest. Two parameters are important for this: `max_depth` and\n", - "`max_leaf_nodes`. They differ in the way they control the tree structure.\n", - "Indeed, `max_depth` will enforce to have a more symmetric tree, while\n", - "`max_leaf_nodes` does not impose such constraint.\n", - "\n", - "Be aware that with random forest, trees are generally deep since we are\n", - "seeking to overfit each tree on each bootstrap sample because this will be\n", - "mitigated by combining them altogether. Assembling underfitted trees (i.e.\n", - "shallow trees) might also lead to an underfitted forest." + "We will start by loading the california housing dataset." ] }, { @@ -56,6 +40,71 @@ " data, target, random_state=0)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Random forest\n", + "\n", + "The main parameter to select in random forest is the `n_estimators` parameter.\n", + "In general, the more trees in the forest, the better the generalization\n", + "performance will be. However, it will slow down the fitting and prediction\n", + "time. The goal is to balance computing time and generalization performance\n", + "when setting the number of estimators. Here, we fix `n_estimators=100`, which\n", + "is already the default value.\n", + "\n", + "
\n", + "

Caution!

\n", + "

Tuning the n_estimators for random forests generally result in a waste of\n", + "computer power. We just need to ensure that it is large enough so that doubling\n", + "its value does not lead to a significant improvement of the validation error.

\n", + "
\n", + "\n", + "Instead, we can tune the hyperparameter `max_features`, which controls the\n", + "size of the random subset of features to consider when looking for the best\n", + "split when growing the trees: smaller values for `max_features` will lead to\n", + "more random trees with hopefully more uncorrelated prediction errors. However\n", + "if `max_features` is too small, predictions can be too random, even after\n", + "averaging with the trees in the ensemble.\n", + "\n", + "If `max_features` is set to `None`, then this is equivalent to setting\n", + "`max_features=n_features` which means that the only source of randomness in\n", + "the random forest is the bagging procedure." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"In this case, n_features={len(data.columns)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also tune the different parameters that control the depth of each tree\n", + "in the forest. Two parameters are important for this: `max_depth` and\n", + "`max_leaf_nodes`. They differ in the way they control the tree structure.\n", + "Indeed, `max_depth` will enforce to have a more symmetric tree, while\n", + "`max_leaf_nodes` does not impose such constraint. If `max_leaf_nodes=None`\n", + "then the number of leaf nodes is unlimited.\n", + "\n", + "The hyperparameter `min_samples_leaf` controls the minimum number of samples\n", + "required to be at a leaf node. This means that a split point (at any depth) is\n", + "only done if it leaves at least `min_samples_leaf` training samples in each of\n", + "the left and right branches. A small value for `min_samples_leaf` means that\n", + "some samples can become isolated when a tree is deep, promoting overfitting. A\n", + "large value would prevent deep trees, which can lead to underfitting.\n", + "\n", + "Be aware that with random forest, trees are expected to be deep since we are\n", + "seeking to overfit each tree on each bootstrap sample. Overfitting is\n", + "mitigated when combining the trees altogether, whereas assembling underfitted\n", + "trees (i.e. shallow trees) might also lead to an underfitted forest." + ] + }, { "cell_type": "code", "execution_count": null, @@ -67,8 +116,9 @@ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "param_distributions = {\n", - " \"n_estimators\": [1, 2, 5, 10, 20, 50, 100, 200, 500],\n", - " \"max_leaf_nodes\": [2, 5, 10, 20, 50, 100],\n", + " \"max_features\": [1, 2, 3, 5, None],\n", + " \"max_leaf_nodes\": [10, 100, 1000, None],\n", + " \"min_samples_leaf\": [1, 2, 5, 10, 20, 50, 100],\n", "}\n", "search_cv = RandomizedSearchCV(\n", " RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,\n", @@ -88,15 +138,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can observe in our search that we are required to have a large\n", - "number of leaves and thus deep trees. This parameter seems particularly\n", - "impactful in comparison to the number of trees for this particular dataset:\n", - "with at least 50 trees, the generalization performance will be driven by the\n", - "number of leaves.\n", - "\n", - "Now we will estimate the generalization performance of the best model by\n", - "refitting it with the full training set and using the test set for scoring on\n", - "unseen data. This is done by default when calling the `.fit` method." + "We can observe in our search that we are required to have a large number of\n", + "`max_leaf_nodes` and thus deep trees. This parameter seems particularly\n", + "impactful with respect to the other tuning parameters, but large values of\n", + "`min_samples_leaf` seem to reduce the performance of the model.\n", + "\n", + "In practice, more iterations of random search would be necessary to precisely\n", + "assert the role of each parameters. Using `n_iter=10` is good enough to\n", + "quickly inspect the hyperparameter combinations that yield models that work\n", + "well enough without spending too much computational resources. Feel free to\n", + "try more interations on your own.\n", + "\n", + "Once the `RandomizedSearchCV` has found the best set of hyperparameters, it\n", + "uses them to refit the model using the full training set. To estimate the\n", + "generalization performance of the best model it suffices to call `.score` on\n", + "the unseen data." ] }, { @@ -180,8 +236,8 @@ "\n", "
\n", "

Caution!

\n", - "

Here, we tune the n_estimators but be aware that using early-stopping as\n", - "in the previous exercise will be better.

\n", + "

Here, we tune the n_estimators but be aware that is better to use\n", + "early_stopping as done in the Exercise M6.04.

\n", "
\n", "\n", "In this search, we see that the `learning_rate` is required to be large\n", @@ -196,8 +252,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now we estimate the generalization performance of the best model\n", - "using the test set." + "Now we estimate the generalization performance of the best model using the\n", + "test set." ] }, { @@ -216,8 +272,8 @@ "source": [ "The mean test score in the held-out test set is slightly better than the score\n", "of the best model. The reason is that the final model is refitted on the whole\n", - "training set and therefore, on more data than the inner cross-validated models\n", - "of the grid search procedure." + "training set and therefore, on more data than the cross-validated models of\n", + "the grid search procedure." ] } ],
Meaningful order