{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":8552606,"sourceType":"datasetVersion","datasetId":4965286}],"dockerImageVersionId":30732,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"**Ensemble learning in regression** involves combining **multiple models** to improve **overall performance**. This technique helps **balance bias and variance**, two critical sources of error in machine learning models. *Bias refers to errors due to overly simplistic models*, while *variance refers to errors due to overly complex models*. By aggregating the predictions of diverse models, ensemble learning can **reduce both bias and variance**. \n* **bagging** (e.g., Random Forests) **reduce variance** by averaging predictions of different models trained on varied subsets of data.  \n* **boosting** (e.g., AdaBoost) **reduces bias** by focusing on correcting the errors of weak models sequentially.   \n* **stacking** involves training a meta-model to combine the predictions of several base models, further **enhancing the robustness and accuracy** of classification tasks.  \n\nThus, ensemble learning effectively improves regression performance by mitigating both bias and variance.","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19"}},{"cell_type":"markdown","source":"### Read the Dataset","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nimport numpy as np\n\nX_data = pd.read_csv('/kaggle/input/house-price-ready-to-modeling/X_HousePrice.csv')\nX_data = X_data.set_index(['Unnamed: 0', 'Unnamed: 1'])\nX_data.info()\n\nX_train = X_data.loc['train']\nX_test = X_data.loc['test']\n\ny_data = pd.read_csv('/kaggle/input/house-price-ready-to-modeling/y_HousePrice.csv')\ny_data = y_data.set_index(['Unnamed: 0', 'Id'])\ny_data.info()\n\ny_train = y_data.loc['train']\ny_test = y_data.loc['test']\n\ny_train_log = np.log(y_train.copy()).squeeze()\ny_test_log = np.log(y_test.copy()).squeeze()","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:39:36.942075Z","iopub.execute_input":"2024-07-06T18:39:36.943198Z","iopub.status.idle":"2024-07-06T18:39:37.593111Z","shell.execute_reply.started":"2024-07-06T18:39:36.943148Z","shell.execute_reply":"2024-07-06T18:39:37.591836Z"},"trusted":true},"execution_count":1,"outputs":[{"name":"stdout","text":"<class 'pandas.core.frame.DataFrame'>\nMultiIndex: 1439 entries, ('train', 0) to ('test', 437)\nData columns (total 42 columns):\n #   Column                         Non-Null Count  Dtype  \n---  ------                         --------------  -----  \n 0   num__LotFrontage               1439 non-null   float64\n 1   num__LotArea                   1439 non-null   float64\n 2   num__OverallQual               1439 non-null   float64\n 3   num__OverallCond               1439 non-null   float64\n 4   num__BsmtUnfSF                 1439 non-null   float64\n 5   num__TotalBsmtSF               1439 non-null   float64\n 6   num__1stFlrSF                  1439 non-null   float64\n 7   num__GrLivArea                 1439 non-null   float64\n 8   num__BsmtFullBath              1439 non-null   float64\n 9   num__HalfBath                  1439 non-null   float64\n 10  num__TotRmsAbvGrd              1439 non-null   float64\n 11  num__Fireplaces                1439 non-null   float64\n 12  num__GarageCars                1439 non-null   float64\n 13  num__MoSold                    1439 non-null   float64\n 14  num__AgeBuilt                  1439 non-null   float64\n 15  num__AgeRemodAdd               1439 non-null   float64\n 16  num__AgeGarageBlt              1439 non-null   float64\n 17  num__AgeSold                   1439 non-null   float64\n 18  nom__MSZoning_RM               1439 non-null   float64\n 19  nom__LotShape_Reg              1439 non-null   float64\n 20  nom__LandContour_Not Lvl       1439 non-null   float64\n 21  nom__LandSlope_Not Gtl         1439 non-null   float64\n 22  nom__Neighborhood_CollgCr      1439 non-null   float64\n 23  nom__Neighborhood_Sawyer       1439 non-null   float64\n 24  nom__RoofStyle_Not Gable       1439 non-null   float64\n 25  nom__Exterior2nd_Stucco        1439 non-null   float64\n 26  nom__GarageType_Attchd         1439 non-null   float64\n 27  nom__GarageType_Basment        1439 non-null   float64\n 28  nom__Fence_non-existent        1439 non-null   float64\n 29  nom__SaleCondition_Not Normal  1439 non-null   float64\n 30  ord__MSSubClass                1439 non-null   float64\n 31  ord__ExterQual                 1439 non-null   float64\n 32  ord__BsmtQual                  1439 non-null   float64\n 33  ord__BsmtExposure              1439 non-null   float64\n 34  ord__BsmtFinType1              1439 non-null   float64\n 35  ord__BsmtFinSF1                1439 non-null   float64\n 36  ord__BsmtFinType2              1439 non-null   float64\n 37  ord__KitchenQual               1439 non-null   float64\n 38  ord__FireplaceQu               1439 non-null   float64\n 39  ord__GarageArea                1439 non-null   float64\n 40  ord__GarageCond                1439 non-null   float64\n 41  ord__OpenPorchSF               1439 non-null   float64\ndtypes: float64(42)\nmemory usage: 516.7+ KB\n<class 'pandas.core.frame.DataFrame'>\nMultiIndex: 1439 entries, ('train', 708) to ('test', 1121)\nData columns (total 1 columns):\n #   Column     Non-Null Count  Dtype\n---  ------     --------------  -----\n 0   SalePrice  1439 non-null   int64\ndtypes: int64(1)\nmemory usage: 59.2+ KB\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Bagging  \nBagging, short for **bootstrap aggregating**, is an ensemble learning technique used to improve the accuracy and robustness of regression models. It works by generating **multiple versions of a training dataset** through random sampling with replacement. Each version is then used to train a separate model. The predictions from these individual models are combined, typically by **averaging predictions** for regression tasks. Bagging helps to **reduce variance** by averaging out the errors of the individual models, making it *particularly effective for high-variance algorithms like decision trees*. A prominent example of bagging in classification is the **Random Forest algorithm**, which builds a multitude of regression trees and merges their predictions to achieve more accurate and stable results.","metadata":{}},{"cell_type":"markdown","source":"### Fit the Linear Regression Model","metadata":{}},{"cell_type":"code","source":"!pip install --upgrade scikit-learn","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:36:15.714757Z","iopub.execute_input":"2024-07-06T18:36:15.715276Z","iopub.status.idle":"2024-07-06T18:36:37.517648Z","shell.execute_reply.started":"2024-07-06T18:36:15.715230Z","shell.execute_reply":"2024-07-06T18:36:37.516303Z"},"trusted":true},"execution_count":1,"outputs":[{"name":"stdout","text":"Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (1.2.2)\nCollecting scikit-learn\n  Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\nRequirement already satisfied: numpy>=1.19.5 in /opt/conda/lib/python3.10/site-packages (from scikit-learn) (1.26.4)\nRequirement already satisfied: scipy>=1.6.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn) (1.11.4)\nRequirement already satisfied: joblib>=1.2.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn) (1.4.2)\nRequirement already satisfied: threadpoolctl>=3.1.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn) (3.2.0)\nDownloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.4/13.4 MB\u001b[0m \u001b[31m65.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m:01\u001b[0m\n\u001b[?25hInstalling collected packages: scikit-learn\n  Attempting uninstall: scikit-learn\n    Found existing installation: scikit-learn 1.2.2\n    Uninstalling scikit-learn-1.2.2:\n      Successfully uninstalled scikit-learn-1.2.2\n\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\nspopt 0.6.0 requires shapely>=2.0.1, but you have shapely 1.8.5.post1 which is incompatible.\u001b[0m\u001b[31m\n\u001b[0mSuccessfully installed scikit-learn-1.5.1\n","output_type":"stream"}]},{"cell_type":"code","source":"from sklearn.linear_model import LinearRegression\nfrom sklearn.metrics import r2_score, root_mean_squared_error\n\nlr_model = LinearRegression(fit_intercept=True)\nlr_model.fit(X_train, y_train_log)\n\ny_pred_train = lr_model.predict(X_train)\ny_pred_test = lr_model.predict(X_test)\n\n\n# R-squared (R2) score for training set\nr2_train = r2_score(y_train_log, y_pred_train)\n\n# Root Mean Squared Error (RMSE) for training set\nrmse_train = root_mean_squared_error(y_train_log, y_pred_train)\n\n# Mean Absolute Percentage Error (MAPE) for training set\n# Avoiding division by zero\ny_train_values = y_train_log.to_numpy()\ny_train_values = y_train_values.reshape(y_train_values.shape[0])\nmape_train = np.mean(np.abs((y_train_values - y_pred_train) / np.where(y_train_values == 0, 1, y_train_values))) * 100\n\n# Correlation between predicted and real values for training set\ncorr_train = np.corrcoef(y_train_values, y_pred_train.ravel())[0, 1]\n\nprint(\"R-squared (R2) score for Training Set:\", r2_train)\nprint(\"Root Mean Squared Error (RMSE) for Training Set:\", rmse_train)\nprint(\"Mean Absolute Percentage Error (MAPE) for Training Set:\", mape_train)\nprint(\"Correlation between predicted and real values for Training Set:\", corr_train)\n\nprint(\"#\"*100)\n#------------------------------------------------------------------------\n\n# R-squared (R2) score\nr2 = r2_score(y_test_log, y_pred_test)\n\n# Root Mean Squared Error (RMSE)\nrmse = root_mean_squared_error(y_test_log, y_pred_test)\n\n# Mean Absolute Percentage Error (MAPE)\n# Avoiding division by zero\ny_test_values = y_test_log.to_numpy()\ny_test_values = y_test_values.reshape(y_test_values.shape[0])\nmape = np.mean(np.abs((y_test_values - y_pred_test) / np.where(y_test_values == 0, 1, y_test_values))) * 100\n\n# Correlation between predicted and real values\ncorr = np.corrcoef(y_test_values, y_pred_test.ravel())[0, 1]\n\nprint(\"R-squared (R2) score for Test Set:\", r2)\nprint(\"Root Mean Squared Error (RMSE) for Test Set:\", rmse)\nprint(\"Mean Absolute Percentage Error (MAPE) for Test Set:\", mape)\nprint(\"Correlation between predicted and real values for Test Set:\", corr)","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:40:40.085708Z","iopub.execute_input":"2024-07-06T18:40:40.086144Z","iopub.status.idle":"2024-07-06T18:40:40.847381Z","shell.execute_reply.started":"2024-07-06T18:40:40.086107Z","shell.execute_reply":"2024-07-06T18:40:40.845850Z"},"trusted":true},"execution_count":2,"outputs":[{"name":"stdout","text":"R-squared (R2) score for Training Set: 0.9120896462653965\nRoot Mean Squared Error (RMSE) for Training Set: 0.11608348587873038\nMean Absolute Percentage Error (MAPE) for Training Set: 0.7060576966900067\nCorrelation between predicted and real values for Training Set: 0.9550338456124973\n####################################################################################################\nR-squared (R2) score for Test Set: 0.8709783052657468\nRoot Mean Squared Error (RMSE) for Test Set: 0.14114253503861884\nMean Absolute Percentage Error (MAPE) for Test Set: 0.9121491775550751\nCorrelation between predicted and real values for Test Set: 0.9489769733781047\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Fit the Bagging Linear Regression Model","metadata":{}},{"cell_type":"markdown","source":"### Parameters\n​\n#### `estimator` object, default=None\nThe base estimator to fit on random subsets of the dataset. If None, then the base estimator is a `DecisionTreeRegressor`.\n​\n#### `n_estimators` int, default=10\nThe number of base estimators in the ensemble.\n​\n#### `max_samples` int or float, default=1.0\nThe number of samples to draw from `X` to train each base estimator (with replacement by default, see `bootstrap` for more details).\n​\n- If int, then draw `max_samples` samples.\n- If float, then draw `max_samples * X.shape[0]` samples.\n​\n#### `max_features` int or float, default=1.0\nThe number of features to draw from `X` to train each base estimator (without replacement by default, see `bootstrap_features` for more details).\n​\n- If int, then draw `max_features` features.\n- If float, then draw `max(1, int(max_features * n_features_in_))` features.\n​\n#### `bootstrap` bool, default=True\nWhether samples are drawn with replacement. If False, sampling without replacement is performed.\n​\n#### `bootstrap_features` bool, default=False\nWhether features are drawn with replacement.\n​\n#### `n_jobs` int, default=None\nThe number of jobs to run in parallel for both `fit` and `predict`. None means 1. -1 means using all processors. \n​\n#### `random_state` int, RandomState instance, or None, default=None\nControls the random resampling of the original dataset (sample wise and feature wise). If the base estimator accepts a `random_state` attribute, a different seed is generated for each instance in the ensemble. Pass an int for reproducible output across multiple function calls.","metadata":{}},{"cell_type":"code","source":"from sklearn.ensemble import BaggingRegressor\nfrom sklearn.metrics import r2_score, root_mean_squared_error\n\nbag_model = BaggingRegressor(estimator=lr_model, n_estimators=100, max_samples=1.0, max_features=1.0,\n                            bootstrap=True, bootstrap_features=False, n_jobs=-1, random_state=111)\n\nbag_model.fit(X_train, y_train_log)\n\ny_pred_train = bag_model.predict(X_train)\ny_pred_test = bag_model.predict(X_test)\n\n\n# R-squared (R2) score for training set\nr2_train = r2_score(y_train_log, y_pred_train)\n\n# Root Mean Squared Error (RMSE) for training set\nrmse_train = root_mean_squared_error(y_train_log, y_pred_train)\n\n# Mean Absolute Percentage Error (MAPE) for training set\n# Avoiding division by zero\ny_train_values = y_train_log.to_numpy()\ny_train_values = y_train_values.reshape(y_train_values.shape[0])\nmape_train = np.mean(np.abs((y_train_values - y_pred_train) / np.where(y_train_values == 0, 1, y_train_values))) * 100\n\n# Correlation between predicted and real values for training set\ncorr_train = np.corrcoef(y_train_values, y_pred_train.ravel())[0, 1]\n\nprint(\"R-squared (R2) score for Training Set:\", r2_train)\nprint(\"Root Mean Squared Error (RMSE) for Training Set:\", rmse_train)\nprint(\"Mean Absolute Percentage Error (MAPE) for Training Set:\", mape_train)\nprint(\"Correlation between predicted and real values for Training Set:\", corr_train)\n\nprint(\"#\"*100)\n#------------------------------------------------------------------------\n\n# R-squared (R2) score\nr2 = r2_score(y_test_log, y_pred_test)\n\n# Root Mean Squared Error (RMSE)\nrmse = root_mean_squared_error(y_test_log, y_pred_test)\n\n# Mean Absolute Percentage Error (MAPE)\n# Avoiding division by zero\ny_test_values = y_test_log.to_numpy()\ny_test_values = y_test_values.reshape(y_test_values.shape[0])\nmape = np.mean(np.abs((y_test_values - y_pred_test) / np.where(y_test_values == 0, 1, y_test_values))) * 100\n\n# Correlation between predicted and real values\ncorr = np.corrcoef(y_test_values, y_pred_test.ravel())[0, 1]\n\nprint(\"R-squared (R2) score for Test Set:\", r2)\nprint(\"Root Mean Squared Error (RMSE) for Test Set:\", rmse)\nprint(\"Mean Absolute Percentage Error (MAPE) for Test Set:\", mape)\nprint(\"Correlation between predicted and real values for Test Set:\", corr)","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:41:57.375815Z","iopub.execute_input":"2024-07-06T18:41:57.376891Z","iopub.status.idle":"2024-07-06T18:41:59.500148Z","shell.execute_reply.started":"2024-07-06T18:41:57.376847Z","shell.execute_reply":"2024-07-06T18:41:59.498616Z"},"trusted":true},"execution_count":3,"outputs":[{"name":"stdout","text":"R-squared (R2) score for Training Set: 0.9120487925513798\nRoot Mean Squared Error (RMSE) for Training Set: 0.11611045591401994\nMean Absolute Percentage Error (MAPE) for Training Set: 0.7060770066682254\nCorrelation between predicted and real values for Training Set: 0.9550126466168957\n####################################################################################################\nR-squared (R2) score for Test Set: 0.8679541897644729\nRoot Mean Squared Error (RMSE) for Test Set: 0.1427870612283776\nMean Absolute Percentage Error (MAPE) for Test Set: 0.9244292869137098\nCorrelation between predicted and real values for Test Set: 0.9491859599588365\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Fit the Random Forest Regressor","metadata":{}},{"cell_type":"markdown","source":"### Parameters\n\n#### `n_estimators` int, default=100\nThe number of trees in the forest.\n\n#### `criterion` {“squared_error”, “absolute_error”, “friedman_mse”, “poisson”}, default=“squared_error”\nThe function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain. Note: This parameter is tree-specific.\n\n#### `max_depth` int, default=None\nThe maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than `min_samples_split` samples.\n\n#### `min_samples_split` int or float, default=2\nThe minimum number of samples required to split an internal node:\n\n- If int, then consider `min_samples_split` as the minimum number.\n- If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split.\n\n#### `min_samples_leaf` int or float, default=1\nThe minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least `min_samples_leaf` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.\n\n- If int, then consider `min_samples_leaf` as the minimum number.\n- If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node.\n\n#### `max_features` {\"sqrt\", \"log2\", None}, int or float, default=\"sqrt\"\nThe number of features to consider when looking for the best split:\n\n- If int, then consider `max_features` features at each split.\n- If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split.\n- If “sqrt”, then `max_features=sqrt(n_features)`.\n- If “log2”, then `max_features=log2(n_features)`.\n- If None, then `max_features=n_features`.\n\n\n#### `bootstrap` bool, default=True\nWhether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.\n\n#### `n_jobs` int, default=None\nThe number of jobs to run in parallel. `fit`, `predict`, `decision_path`, and `apply` are all parallelized over the trees. None means 1 unless in a `joblib.parallel_backend` context. -1 means using all processors. \n\n#### `random_state` int, RandomState instance, or None, default=None\nControls both the randomness of the bootstrapping of the samples used when building trees (if `bootstrap=True`) and the sampling of the features to consider when looking for the best split at each node (if `max_features < n_features`). See Glossary for details.\n\n\n#### `class_weight` {\"balanced\", \"balanced_subsample\"}, dict, or list of dicts, default=None\nWeights associated with classes in the form `{class_label: weight}`. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of `y`.\n\nNote that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be `[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}]` instead of `[{1:1}, {2:5}, {3:1}, {4:1}]`.\n\nThe “balanced” mode uses the values of `y` to automatically adjust weights inversely proportional to class frequencies in the input data as `n_samples / (n_classes * np.bincount(y))`.\n\nThe “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.\n\nFor multi-output, the weights of each column of `y` will be multiplied.\n\nNote that these weights will be multiplied with `sample_weight` (passed through the `fit` method) if `sample_weight` is specified.\n\n\n#### `max_samples` int or float, default=None\nIf `bootstrap` is True, the number of samples to draw from `X` to train each base estimator.\n\n- If None (default), then draw `X.shape[0]` samples.\n- If int, then draw `max_samples` samples.\n- If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus, `max_samples` should be in the interval (0.0, 1.0].","metadata":{}},{"cell_type":"code","source":"from sklearn.ensemble import RandomForestRegressor\nfrom sklearn.metrics import r2_score, root_mean_squared_error\n\nrf_model = RandomForestRegressor(n_estimators=200, criterion='friedman_mse', max_depth=15, min_samples_split=15,\n                                min_samples_leaf=7, max_features='sqrt', bootstrap=True, n_jobs=-1,\n                                random_state=111, max_samples=None)\n\nrf_model.fit(X_train, y_train_log)\n\ny_pred_train = rf_model.predict(X_train)\ny_pred_test = rf_model.predict(X_test)\n\n\n# R-squared (R2) score for training set\nr2_train = r2_score(y_train_log, y_pred_train)\n\n# Root Mean Squared Error (RMSE) for training set\nrmse_train = root_mean_squared_error(y_train_log, y_pred_train)\n\n# Mean Absolute Percentage Error (MAPE) for training set\n# Avoiding division by zero\ny_train_values = y_train_log.to_numpy()\ny_train_values = y_train_values.reshape(y_train_values.shape[0])\nmape_train = np.mean(np.abs((y_train_values - y_pred_train) / np.where(y_train_values == 0, 1, y_train_values))) * 100\n\n# Correlation between predicted and real values for training set\ncorr_train = np.corrcoef(y_train_values, y_pred_train.ravel())[0, 1]\n\nprint(\"R-squared (R2) score for Training Set:\", r2_train)\nprint(\"Root Mean Squared Error (RMSE) for Training Set:\", rmse_train)\nprint(\"Mean Absolute Percentage Error (MAPE) for Training Set:\", mape_train)\nprint(\"Correlation between predicted and real values for Training Set:\", corr_train)\n\nprint(\"#\"*100)\n#------------------------------------------------------------------------\n\n# R-squared (R2) score\nr2 = r2_score(y_test_log, y_pred_test)\n\n# Root Mean Squared Error (RMSE)\nrmse = root_mean_squared_error(y_test_log, y_pred_test)\n\n# Mean Absolute Percentage Error (MAPE)\n# Avoiding division by zero\ny_test_values = y_test_log.to_numpy()\ny_test_values = y_test_values.reshape(y_test_values.shape[0])\nmape = np.mean(np.abs((y_test_values - y_pred_test) / np.where(y_test_values == 0, 1, y_test_values))) * 100\n\n# Correlation between predicted and real values\ncorr = np.corrcoef(y_test_values, y_pred_test.ravel())[0, 1]\n\nprint(\"R-squared (R2) score for Test Set:\", r2)\nprint(\"Root Mean Squared Error (RMSE) for Test Set:\", rmse)\nprint(\"Mean Absolute Percentage Error (MAPE) for Test Set:\", mape)\nprint(\"Correlation between predicted and real values for Test Set:\", corr)","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:43:28.702417Z","iopub.execute_input":"2024-07-06T18:43:28.704066Z","iopub.status.idle":"2024-07-06T18:43:29.445233Z","shell.execute_reply.started":"2024-07-06T18:43:28.703955Z","shell.execute_reply":"2024-07-06T18:43:29.444096Z"},"trusted":true},"execution_count":4,"outputs":[{"name":"stdout","text":"R-squared (R2) score for Training Set: 0.9035799391018557\nRoot Mean Squared Error (RMSE) for Training Set: 0.12157215861240718\nMean Absolute Percentage Error (MAPE) for Training Set: 0.6983178023099436\nCorrelation between predicted and real values for Training Set: 0.956750402984286\n####################################################################################################\nR-squared (R2) score for Test Set: 0.7421862332633917\nRoot Mean Squared Error (RMSE) for Test Set: 0.19951685686147078\nMean Absolute Percentage Error (MAPE) for Test Set: 1.279564652184512\nCorrelation between predicted and real values for Test Set: 0.8945146241046652\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Boosting  \nBoosting is an ensemble learning technique used in regression to improve the predictive performance by **combining multiple weak learners**, typically decision trees, to form a strong learner. Unlike bagging, which builds models independently, boosting builds models **sequentially**, with *each new model focusing on correcting the errors made by the previous ones*. The process gives more weight to instances that have larger prediction error, ensuring that subsequent models address these errors more effectively. Popular boosting algorithms include **AdaBoost**, Gradient Boosting, and XGBoost. By iteratively refining the model, boosting **reduces bias**, leading to higher accuracy in regression tasks.","metadata":{}},{"cell_type":"markdown","source":"### Fit the AdaBoost Regressor","metadata":{}},{"cell_type":"markdown","source":"### Parameters\n​\n#### `estimator` object, default=None\nThe base estimator from which the boosted ensemble is built. If None, then the base estimator is DecisionTreeRegressor initialized with max_depth=3.\n​\n#### `n_estimators` int, default=50\nThe maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early. Values must be in the range [1, inf).\n​\n#### `learning_rate` float, default=1.0\nWeight applied to each regressor at each boosting iteration. A higher learning rate increases the contribution of each regressor. There is a trade-off between the learning_rate and n_estimators parameters. Values must be in the range (0.0, inf).\n​\n#### `loss` {‘linear’, ‘square’, ‘exponential’}, default=’linear’\nThe loss function to use when updating the weights after each boosting iteration.\n​\n#### `random_state` int, RandomState instance or None, default=None\nControls the random seed given at each estimator at each boosting iteration. Thus, it is only used when `estimator` exposes a `random_state`. Pass an int for reproducible output across multiple function calls.\n​","metadata":{}},{"cell_type":"code","source":"from sklearn.ensemble import AdaBoostRegressor\nfrom sklearn.metrics import r2_score, root_mean_squared_error\n\nadabo_model = AdaBoostRegressor(estimator=lr_model, n_estimators=50, learning_rate=0.01, loss='square', random_state=111)\nadabo_model.fit(X_train, y_train_log)\n\ny_pred_train = adabo_model.predict(X_train)\ny_pred_test = adabo_model.predict(X_test)\n\n\n# R-squared (R2) score for training set\nr2_train = r2_score(y_train_log, y_pred_train)\n\n# Root Mean Squared Error (RMSE) for training set\nrmse_train = root_mean_squared_error(y_train_log, y_pred_train)\n\n# Mean Absolute Percentage Error (MAPE) for training set\n# Avoiding division by zero\ny_train_values = y_train_log.to_numpy()\ny_train_values = y_train_values.reshape(y_train_values.shape[0])\nmape_train = np.mean(np.abs((y_train_values - y_pred_train) / np.where(y_train_values == 0, 1, y_train_values))) * 100\n\n# Correlation between predicted and real values for training set\ncorr_train = np.corrcoef(y_train_values, y_pred_train.ravel())[0, 1]\n\nprint(\"R-squared (R2) score for Training Set:\", r2_train)\nprint(\"Root Mean Squared Error (RMSE) for Training Set:\", rmse_train)\nprint(\"Mean Absolute Percentage Error (MAPE) for Training Set:\", mape_train)\nprint(\"Correlation between predicted and real values for Training Set:\", corr_train)\n\nprint(\"#\"*100)\n#------------------------------------------------------------------------\n\n# R-squared (R2) score\nr2 = r2_score(y_test_log, y_pred_test)\n\n# Root Mean Squared Error (RMSE)\nrmse = root_mean_squared_error(y_test_log, y_pred_test)\n\n# Mean Absolute Percentage Error (MAPE)\n# Avoiding division by zero\ny_test_values = y_test_log.to_numpy()\ny_test_values = y_test_values.reshape(y_test_values.shape[0])\nmape = np.mean(np.abs((y_test_values - y_pred_test) / np.where(y_test_values == 0, 1, y_test_values))) * 100\n\n# Correlation between predicted and real values\ncorr = np.corrcoef(y_test_values, y_pred_test.ravel())[0, 1]\n\nprint(\"R-squared (R2) score for Test Set:\", r2)\nprint(\"Root Mean Squared Error (RMSE) for Test Set:\", rmse)\nprint(\"Mean Absolute Percentage Error (MAPE) for Test Set:\", mape)\nprint(\"Correlation between predicted and real values for Test Set:\", corr)\n\n","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:45:44.122549Z","iopub.execute_input":"2024-07-06T18:45:44.123099Z","iopub.status.idle":"2024-07-06T18:45:45.136204Z","shell.execute_reply.started":"2024-07-06T18:45:44.123039Z","shell.execute_reply":"2024-07-06T18:45:45.134325Z"},"trusted":true},"execution_count":5,"outputs":[{"name":"stdout","text":"R-squared (R2) score for Training Set: 0.9116731092064391\nRoot Mean Squared Error (RMSE) for Training Set: 0.11635817441301274\nMean Absolute Percentage Error (MAPE) for Training Set: 0.7112956135838954\nCorrelation between predicted and real values for Training Set: 0.9548425171026094\n####################################################################################################\nR-squared (R2) score for Test Set: 0.8711011152478806\nRoot Mean Squared Error (RMSE) for Test Set: 0.14107534541208827\nMean Absolute Percentage Error (MAPE) for Test Set: 0.9094921626020038\nCorrelation between predicted and real values for Test Set: 0.9474595860219389\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Stacking  \nStacking in regression is an ensemble learning technique that **combines multiple base regressors** to improve **prediction accuracy**. Unlike traditional ensemble methods like bagging and boosting, *stacking involves training a meta-model that learns how to best combine the predictions of the base classifiers*. The process typically follows a two-stage approach: in the first stage, diverse base regressors are trained on the dataset; in the second stage, a meta-model is trained using the predictions from these base regressors as inputs. This **meta-model** then makes the final prediction. Stacking helps in **reducing bias and variance** by leveraging the strengths of different models and can lead to improved performance over individual regressors. However, it requires careful tuning and validation to avoid overfitting and maximize its benefits in regression tasks.","metadata":{}},{"cell_type":"markdown","source":"### Fit the Base Regressors","metadata":{}},{"cell_type":"code","source":"from sklearn.linear_model import LinearRegression\n\nlr = LinearRegression(fit_intercept=True)","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:47:51.663258Z","iopub.execute_input":"2024-07-06T18:47:51.663775Z","iopub.status.idle":"2024-07-06T18:47:51.670602Z","shell.execute_reply.started":"2024-07-06T18:47:51.663723Z","shell.execute_reply":"2024-07-06T18:47:51.668976Z"},"trusted":true},"execution_count":7,"outputs":[]},{"cell_type":"code","source":"from sklearn.svm import LinearSVR\n\nlsvr = LinearSVR(C=1, loss='squared_epsilon_insensitive', fit_intercept=True, random_state=111, max_iter=10000000)","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:47:54.761690Z","iopub.execute_input":"2024-07-06T18:47:54.762889Z","iopub.status.idle":"2024-07-06T18:47:54.772938Z","shell.execute_reply.started":"2024-07-06T18:47:54.762833Z","shell.execute_reply":"2024-07-06T18:47:54.770764Z"},"trusted":true},"execution_count":8,"outputs":[]},{"cell_type":"code","source":"from sklearn.neural_network import MLPRegressor\n\nann = MLPRegressor(hidden_layer_sizes=(100,), activation='logistic', solver='adam', alpha=0.0001, batch_size=20,\n                learning_rate='constant', learning_rate_init=0.001, max_iter=10000,random_state=111, n_iter_no_change=30)","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:47:57.286351Z","iopub.execute_input":"2024-07-06T18:47:57.286862Z","iopub.status.idle":"2024-07-06T18:47:57.300315Z","shell.execute_reply.started":"2024-07-06T18:47:57.286820Z","shell.execute_reply":"2024-07-06T18:47:57.299057Z"},"trusted":true},"execution_count":9,"outputs":[]},{"cell_type":"code","source":"from sklearn.neighbors import KNeighborsRegressor\n\nknn = KNeighborsRegressor(n_neighbors=5, weights='distance', metric='manhattan', n_jobs=-1)","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:48:00.021446Z","iopub.execute_input":"2024-07-06T18:48:00.021915Z","iopub.status.idle":"2024-07-06T18:48:00.028627Z","shell.execute_reply.started":"2024-07-06T18:48:00.021876Z","shell.execute_reply":"2024-07-06T18:48:00.027117Z"},"trusted":true},"execution_count":10,"outputs":[]},{"cell_type":"markdown","source":"### Fit the Stacking regressor","metadata":{}},{"cell_type":"markdown","source":"### Parameters\n\n#### `estimators` list of (str, estimator)\nBase estimators which will be stacked together. Each element of the list is defined as a tuple of string (i.e., name) and an estimator instance. An estimator can be set to ‘drop’ using `set_params`.\n\n#### `final_estimator` estimator, default=None\nA regressor which will be used to combine the base estimators. The default regressor is a RidgeCV.\n\n#### `cv` int, cross-validation generator, iterable, or “prefit”, default=None\nDetermines the cross-validation splitting strategy used in `cross_val_predict` to train `final_estimator`. Possible inputs for `cv` are:\n\n- None, to use the default 5-fold cross-validation,\n- integer, to specify the number of folds in a (Stratified) KFold,\n- An object to be used as a cross-validation generator,\n- An iterable yielding train, test splits,\n- \"prefit\" to assume the estimators are prefit. In this case, the estimators will not be refitted.\n\n#### `n_jobs` int, default=None\nThe number of jobs to run in parallel for all estimators fit. None means 1 unless in a `joblib.parallel_backend` context. -1 means using all processors. \n\n#### `passthrough` bool, default=False\nWhen False, only the predictions of estimators will be used as training data for `final_estimator`. When True, the `final_estimator` is trained on the predictions as well as the original training data.\n\n#### `verbose` int, default=0\nVerbosity level.","metadata":{}},{"cell_type":"code","source":"from sklearn.ensemble import StackingRegressor\nfrom sklearn.metrics import r2_score, root_mean_squared_error\n\nstack_model = StackingRegressor([(\"lr\",lr), (\"ann\",ann), (\"lsvr\",lsvr), (\"knn\",knn)], \n                               final_estimator=lr, cv=5, n_jobs=-1, passthrough=False, verbose=0)\nstack_model.fit(X_train, y_train_log)\n\nstack_model.fit(X_train, y_train_log)\n\ny_pred_train = stack_model.predict(X_train)\ny_pred_test = stack_model.predict(X_test)\n\n\n# R-squared (R2) score for training set\nr2_train = r2_score(y_train_log, y_pred_train)\n\n# Root Mean Squared Error (RMSE) for training set\nrmse_train = root_mean_squared_error(y_train_log, y_pred_train)\n\n# Mean Absolute Percentage Error (MAPE) for training set\n# Avoiding division by zero\ny_train_values = y_train_log.to_numpy()\ny_train_values = y_train_values.reshape(y_train_values.shape[0])\nmape_train = np.mean(np.abs((y_train_values - y_pred_train) / np.where(y_train_values == 0, 1, y_train_values))) * 100\n\n# Correlation between predicted and real values for training set\ncorr_train = np.corrcoef(y_train_values, y_pred_train.ravel())[0, 1]\n\nprint(\"R-squared (R2) score for Training Set:\", r2_train)\nprint(\"Root Mean Squared Error (RMSE) for Training Set:\", rmse_train)\nprint(\"Mean Absolute Percentage Error (MAPE) for Training Set:\", mape_train)\nprint(\"Correlation between predicted and real values for Training Set:\", corr_train)\n\nprint(\"#\"*100)\n#------------------------------------------------------------------------\n\n# R-squared (R2) score\nr2 = r2_score(y_test_log, y_pred_test)\n\n# Root Mean Squared Error (RMSE)\nrmse = root_mean_squared_error(y_test_log, y_pred_test)\n\n# Mean Absolute Percentage Error (MAPE)\n# Avoiding division by zero\ny_test_values = y_test_log.to_numpy()\ny_test_values = y_test_values.reshape(y_test_values.shape[0])\nmape = np.mean(np.abs((y_test_values - y_pred_test) / np.where(y_test_values == 0, 1, y_test_values))) * 100\n\n# Correlation between predicted and real values\ncorr = np.corrcoef(y_test_values, y_pred_test.ravel())[0, 1]\n\nprint(\"R-squared (R2) score for Test Set:\", r2)\nprint(\"Root Mean Squared Error (RMSE) for Test Set:\", rmse)\nprint(\"Mean Absolute Percentage Error (MAPE) for Test Set:\", mape)\nprint(\"Correlation between predicted and real values for Test Set:\", corr)","metadata":{"execution":{"iopub.status.busy":"2024-07-06T18:48:05.206206Z","iopub.execute_input":"2024-07-06T18:48:05.206664Z","iopub.status.idle":"2024-07-06T18:48:35.212973Z","shell.execute_reply.started":"2024-07-06T18:48:05.206628Z","shell.execute_reply":"2024-07-06T18:48:35.211161Z"},"trusted":true},"execution_count":11,"outputs":[{"name":"stdout","text":"R-squared (R2) score for Training Set: 0.9424400562956313\nRoot Mean Squared Error (RMSE) for Training Set: 0.09393135938720484\nMean Absolute Percentage Error (MAPE) for Training Set: 0.5845130141571946\nCorrelation between predicted and real values for Training Set: 0.9714100212919399\n####################################################################################################\nR-squared (R2) score for Test Set: 0.8925213515050023\nRoot Mean Squared Error (RMSE) for Test Set: 0.1288212890853498\nMean Absolute Percentage Error (MAPE) for Test Set: 0.8005537429241403\nCorrelation between predicted and real values for Test Set: 0.9488432382875527\n","output_type":"stream"}]}]}