{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":7364272,"sourceType":"datasetVersion","datasetId":4278011}],"dockerImageVersionId":30626,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"In this notebook, our focus shifts to the second level of data cleaning, encompassing crucial tasks such as **outlier** detection and handling, as well as addressing **missing values**. Outliers, which are data points significantly deviating from the norm, can distort analyses and impact **model performance**. By systematically detecting and appropriately handling outliers, we aim to enhance the robustness and reliability of our dataset. Simultaneously, addressing missing values is imperative for a comprehensive and accurate analysis. Whether through imputation techniques or strategic removal, handling missing values ensures that our dataset is **complete** and **representative**. This multifaceted approach to the second level of data cleaning lays the groundwork for more sophisticated analyses and reinforces the integrity of our data for subsequent modeling and interpretation.","metadata":{}},{"cell_type":"markdown","source":"### Read the dataset","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfiltered_df = pd.read_csv('/kaggle/input/bank-loan-cleaned-ver1/Bankloan_Cleanedv1.csv')","metadata":{"trusted":true},"execution_count":93,"outputs":[{"execution_count":93,"output_type":"execute_result","data":{"text/plain":"      age  employ  address  income  debtinc   creddebt   othdebt   ed  default\n0    41.0      17       12   176.0      9.3  11.359392  5.008608  3.0        1\n1    27.0      10        6    31.0     17.3   1.362202  4.000798  1.0        0\n2    40.0      15        7     NaN      5.5   0.856075  2.168925  1.0        0\n3    41.0      15       14   120.0      2.9   2.658720  0.821280  NaN        0\n4    24.0       2        0    28.0     17.3   1.787436  3.056564  2.0        1\n..    ...     ...      ...     ...      ...        ...       ...  ...      ...\n695  36.0       6       15    27.0      4.6   0.262062  0.979938  2.0        1\n696  29.0       6        4    21.0     11.5   0.369495  2.045505  2.0        0\n697  33.0      15        3    32.0      7.6   0.491264  1.940736  1.0        0\n698  45.0      19       22    77.0      8.4   2.302608  4.165392  1.0        0\n699  37.0      12       14     NaN     14.7   2.994684  3.473316  1.0        0\n\n[700 rows x 9 columns]","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>age</th>\n      <th>employ</th>\n      <th>address</th>\n      <th>income</th>\n      <th>debtinc</th>\n      <th>creddebt</th>\n      <th>othdebt</th>\n      <th>ed</th>\n      <th>default</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>41.0</td>\n      <td>17</td>\n      <td>12</td>\n      <td>176.0</td>\n      <td>9.3</td>\n      <td>11.359392</td>\n      <td>5.008608</td>\n      <td>3.0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>27.0</td>\n      <td>10</td>\n      <td>6</td>\n      <td>31.0</td>\n      <td>17.3</td>\n      <td>1.362202</td>\n      <td>4.000798</td>\n      <td>1.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>40.0</td>\n      <td>15</td>\n      <td>7</td>\n      <td>NaN</td>\n      <td>5.5</td>\n      <td>0.856075</td>\n      <td>2.168925</td>\n      <td>1.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>41.0</td>\n      <td>15</td>\n      <td>14</td>\n      <td>120.0</td>\n      <td>2.9</td>\n      <td>2.658720</td>\n      <td>0.821280</td>\n      <td>NaN</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>24.0</td>\n      <td>2</td>\n      <td>0</td>\n      <td>28.0</td>\n      <td>17.3</td>\n      <td>1.787436</td>\n      <td>3.056564</td>\n      <td>2.0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>695</th>\n      <td>36.0</td>\n      <td>6</td>\n      <td>15</td>\n      <td>27.0</td>\n      <td>4.6</td>\n      <td>0.262062</td>\n      <td>0.979938</td>\n      <td>2.0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>696</th>\n      <td>29.0</td>\n      <td>6</td>\n      <td>4</td>\n      <td>21.0</td>\n      <td>11.5</td>\n      <td>0.369495</td>\n      <td>2.045505</td>\n      <td>2.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>697</th>\n      <td>33.0</td>\n      <td>15</td>\n      <td>3</td>\n      <td>32.0</td>\n      <td>7.6</td>\n      <td>0.491264</td>\n      <td>1.940736</td>\n      <td>1.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>698</th>\n      <td>45.0</td>\n      <td>19</td>\n      <td>22</td>\n      <td>77.0</td>\n      <td>8.4</td>\n      <td>2.302608</td>\n      <td>4.165392</td>\n      <td>1.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>699</th>\n      <td>37.0</td>\n      <td>12</td>\n      <td>14</td>\n      <td>NaN</td>\n      <td>14.7</td>\n      <td>2.994684</td>\n      <td>3.473316</td>\n      <td>1.0</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n<p>700 rows × 9 columns</p>\n</div>"},"metadata":{}}]},{"cell_type":"markdown","source":"To enhance clarity and facilitate streamlined data processing, we **separate the dataset** into two distinct dataframes: one designated for the **y (target variable)** or response, and the other for the **X (input variables)** or predictors. This segregation allows for a more organized and efficient approach in preparing the data for subsequent modeling and analysis.","metadata":{}},{"cell_type":"code","source":"y = filtered_df.iloc[:,-1]\nX = filtered_df.iloc[:,0:-1]","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:19:24.436467Z","iopub.execute_input":"2024-01-17T15:19:24.436874Z","iopub.status.idle":"2024-01-17T15:19:24.443084Z","shell.execute_reply.started":"2024-01-17T15:19:24.436841Z","shell.execute_reply":"2024-01-17T15:19:24.441732Z"},"trusted":true},"execution_count":94,"outputs":[]},{"cell_type":"markdown","source":"### Data Leakage  \nData leakage poses a significant challenge in the area of machine learning and data analytics, emphasizing the critical importance of a well-considered evaluation design. Data leakage **occurs when information from the test set unintentionally influences the training process**, leading to **over-optimistic model performance**. To mitigate this risk, adopting a robust **evaluation design**, such as the **spliting method**, becomes imperative. By clearly delineating distinct sets for training and testing, this approach ensures that the model is assessed on **unseen data**, mirroring real-world scenarios. Equally crucial is the **preparation of the training set**, where potential sources of leakage are identified and addressed. Establishing a solid foundation through a carefully partitioned dataset and thorough data preparation not only guards against data leakage but also contributes to the development of more **reliable** and **generalizable** machine learning models.","metadata":{}},{"cell_type":"code","source":"from sklearn.model_selection import train_test_split\n\n# split into train and test sets\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)\n\ninputs = X_train","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:19:27.676047Z","iopub.execute_input":"2024-01-17T15:19:27.676998Z","iopub.status.idle":"2024-01-17T15:19:27.690042Z","shell.execute_reply.started":"2024-01-17T15:19:27.676952Z","shell.execute_reply":"2024-01-17T15:19:27.688763Z"},"trusted":true},"execution_count":95,"outputs":[]},{"cell_type":"markdown","source":"Create two lists distinguishing **categorical** and **continuous** variables based on their respective indices.","metadata":{}},{"cell_type":"code","source":"columns = inputs.columns\n\n# Choose categorical elements \ncategorical_indices = [-1]\n\n# Use a list comprehension to select the elements at the specified indices\ncategorical_fields = [columns[i] for i in categorical_indices]\n\n# Create a new list of columns excluding categorical_fields (continuous)\ncontinuous_fields = [j for j in columns if j not in categorical_fields]","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:19:32.155706Z","iopub.execute_input":"2024-01-17T15:19:32.156432Z","iopub.status.idle":"2024-01-17T15:19:32.161608Z","shell.execute_reply.started":"2024-01-17T15:19:32.156398Z","shell.execute_reply":"2024-01-17T15:19:32.160854Z"},"trusted":true},"execution_count":96,"outputs":[]},{"cell_type":"markdown","source":"### What are Outliers?\nAn outlier is an observation that is **unlike** the other observations. They are **rare**, **distinct**, or do not fit in some way.We will generally define outliers as samples that are **exceptionally far** from the mainstream of the data.","metadata":{}},{"cell_type":"markdown","source":"\n**Caution!!!** In data mining issues involving **Deviation Detection** tasks, the identification and management of outliers should be disregarded as part of the data cleaning process.","metadata":{}},{"cell_type":"markdown","source":"### Outlier Detection  \nOutlier detection is a crucial step in data analysis, employing various methods to identify and manage data points that significantly deviate from expected patterns. There are two main approaches to outlier detection:\n\n1. **One-Dimensional Methods:**  \n\n    1.1. Standard Deviation Method\n\n    1.2. Interquartile Range (IQR) Method\n\n2. **Multidimensional Method**","metadata":{}},{"cell_type":"markdown","source":"**Note!!!** Outlier detection is typically applied to **input variables** or features within a dataset.","metadata":{}},{"cell_type":"markdown","source":"### 1.One-Dimensional Methods  \n**Note** Outlier detection in one-dimensional methods can be applied only to **continuous variables**.\n\n### 1.1. Standard Deviation Method  \nIf we know that the distribution of values in the sample is **Normal(Gaussian) or Gaussian-like**, we can\nuse the standard deviation of the sample as a cut-off for identifying outliers. The Gaussian\ndistribution has the property that the standard deviation from the mean can be used to reliably\nsummarize the percentage of values in the sample.  \nWe can cover more of the data sample if we expand the range as follows:  \n\n* 1 Standard Deviation from the Mean: 68 percent.\n* 2 Standard Deviations from the Mean: 95 percent.\n* 3 Standard Deviations from the Mean: 99.7 percent.\n  \nA value that falls outside of **3 standard deviations** is part of the distribution, but it is an\nunlikely or rare event. Three standard deviations from the mean is a common cut-off in practice for identifying outliers in a Gaussian or Gaussian-like distribution. For **smaller samples** of data, perhaps a value of **2 standard deviations** (95 percent) can be used, and for **larger samples**, perhaps a value of **4 standard deviations** (99.9 percent) can be used.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\n\n# Iterate over each column in the DataFrame\nfor column in inputs[continuous_fields].columns:\n    # Extract the column data\n    data = inputs[column]\n\n    # Calculate summary statistics\n    data_mean, data_std = data.mean(), data.std()\n\n    # Define outliers\n    cut_off = data_std * 3\n    lower, upper = data_mean - cut_off, data_mean + cut_off\n    \n    globals()[f'lower_{column}_sigma'] = lower\n    globals()[f'upper_{column}_sigma'] = upper\n\n    # Identify outliers\n    globals()[f'outliers_{column}_sigma'] = [x for x in data if x < lower or x > upper]\n\n    # Print the results for each column\n    print(f\"Column: {column}\")\n    print(f\"Identified outliers: {len(globals()[f'outliers_{column}_sigma'])}\")\n    globals()[f'outliers_{column}_sigma'].sort()\n    print('outlier values: ' ,globals()[f'outliers_{column}_sigma'])\n    print(\"\\n\")","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:19:39.780890Z","iopub.execute_input":"2024-01-17T15:19:39.781270Z","iopub.status.idle":"2024-01-17T15:19:39.799496Z","shell.execute_reply.started":"2024-01-17T15:19:39.781241Z","shell.execute_reply":"2024-01-17T15:19:39.798269Z"},"trusted":true},"execution_count":97,"outputs":[{"name":"stdout","text":"Column: age\nIdentified outliers: 0\noutlier values:  []\n\n\nColumn: employ\nIdentified outliers: 4\noutlier values:  [29, 30, 31, 31]\n\n\nColumn: address\nIdentified outliers: 3\noutlier values:  [31, 31, 34]\n\n\nColumn: income\nIdentified outliers: 8\noutlier values:  [166.0, 177.0, 186.0, 221.0, 234.0, 242.0, 249.0, 446.0]\n\n\nColumn: debtinc\nIdentified outliers: 6\noutlier values:  [30.8, 32.5, 33.3, 33.4, 35.3, 36.6]\n\n\nColumn: creddebt\nIdentified outliers: 7\noutlier values:  [9.5934, 9.60048, 14.231448, 14.5962, 15.791776, 16.03147, 20.56131]\n\n\nColumn: othdebt\nIdentified outliers: 12\noutlier values:  [12.95853, 13.051206, 14.45273, 15.14916, 15.40539, 16.668126, 17.184552, 17.2038, 17.79899, 18.257382, 18.26913, 23.104224]\n\n\n","output_type":"stream"}]},{"cell_type":"code","source":"globals()[f'outliers_employ_sigma']\nglobals()[f'lower_employ_sigma']\nglobals()[f'upper_employ_sigma']","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Outlier Handling\nWhen dealing with detected outliers, two common approaches are employed:\n\n**Remove Rows:**\n\n    Exclude entire rows containing outliers from the dataset.\n**Coerce to Bounds:**\n\n    Modify outlier values to fall within an acceptable range, either by setting them to the lower or upper bound.  \n\nThese methods offer flexibility in managing the impact of outliers on data analysis, allowing analysts to choose the most suitable strategy based on the specific requirements of their analysis. Typically for smaller datasets remove rows method is applied.","metadata":{}},{"cell_type":"code","source":"inputs_sigma = inputs.copy()\n\n# Remove rows containing outliers\nremove_list = ['employ', 'address','debtinc']\n\n# Iterate over each column in remove list:\nfor column in remove_list:\n    column_out = inputs_sigma[column].isin(globals()[f'outliers_{column}_sigma'])\n    inputs_sigma = inputs_sigma[~column_out]\n\n\n    \n# Coerce outliers to lower or upper bound\ncoerce_list = ['income', 'creddebt', 'othdebt']\n\n# Iterate over each column in coerce list:\nfor column in coerce_list:\n    inputs_sigma[column] = inputs_sigma[column].apply(lambda x: globals()[f'lower_{column}_sigma'] if x < globals()[f'lower_{column}_sigma'] \n                                                          else (globals()[f'upper_{column}_sigma'] if x > globals()[f'upper_{column}_sigma'] \n                                                          else x))","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:19:49.015751Z","iopub.execute_input":"2024-01-17T15:19:49.016143Z","iopub.status.idle":"2024-01-17T15:19:49.032034Z","shell.execute_reply.started":"2024-01-17T15:19:49.016114Z","shell.execute_reply":"2024-01-17T15:19:49.030720Z"},"trusted":true},"execution_count":98,"outputs":[]},{"cell_type":"markdown","source":"### 1.One-Dimensional Methods  \n**Note** Outlier detection in one-dimensional methods can be applied only to **continuous variables**.\n\n### 1.2. Interquartile Range Method   \n**Not all data is normal** or normal enough to treat it as being drawn from a Gaussian distribution.\nA good statistic for summarizing a non-Gaussian distribution sample of data is the Interquartile\nRange, or **IQR** for short. The IQR is calculated as the difference between the 75th and the\n25th percentiles of the data.The IQR defines the middle 50 percent of the data, or the body of the data.  \nThe IQR can be used to identify outliers by defining limits on the sample values that are a\nfactor k of the IQR below the 25th percentile or above the 75th percentile. The common value\nfor the factor k is the value **1.5 or 2**. A factor k of 3 or more can be used to identify values that are\nextreme outliers or far outs. \n","metadata":{}},{"cell_type":"code","source":"import pandas as pd\n\n# Iterate over each column in the DataFrame\nfor column in inputs[continuous_fields].columns:\n   \n    # Extract the column data\n    data = inputs[column]\n    \n    # calculate interquartile range\n    q25, q75 = data.quantile(0.25), data.quantile(0.75)\n    iqr = q75 - q25\n  \n    # calculate the outlier cutoff\n    cut_off = iqr * 2\n    lower, upper = q25 - cut_off, q75 + cut_off\n    \n    globals()[f'lower_{column}_iqr'] = lower\n    globals()[f'upper_{column}_iqr'] = upper\n    \n    # identify outliers\n    globals()[f'outliers_{column}_iqr'] = [x for x in data if x < lower or x > upper]\n    \n    print(f\"Column: {column}\")\n    print(f\"Identified outliers: {len(globals()[f'outliers_{column}_iqr'])}\")\n    globals()[f'outliers_{column}_iqr'].sort()\n    print('outlier values: ' ,globals()[f'outliers_{column}_iqr'])\n    print(\"\\n\")\n    ","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:19:56.216723Z","iopub.execute_input":"2024-01-17T15:19:56.217145Z","iopub.status.idle":"2024-01-17T15:19:56.242915Z","shell.execute_reply.started":"2024-01-17T15:19:56.217112Z","shell.execute_reply":"2024-01-17T15:19:56.241796Z"},"trusted":true},"execution_count":99,"outputs":[{"name":"stdout","text":"Column: age\nIdentified outliers: 0\noutlier values:  []\n\n\nColumn: employ\nIdentified outliers: 0\noutlier values:  []\n\n\nColumn: address\nIdentified outliers: 1\noutlier values:  [34]\n\n\nColumn: income\nIdentified outliers: 29\noutlier values:  [113.0, 113.0, 113.0, 114.0, 115.0, 116.0, 116.0, 118.0, 120.0, 120.0, 120.0, 123.0, 126.0, 129.0, 132.0, 135.0, 136.0, 144.0, 145.0, 157.0, 159.0, 166.0, 177.0, 186.0, 221.0, 234.0, 242.0, 249.0, 446.0]\n\n\nColumn: debtinc\nIdentified outliers: 4\noutlier values:  [33.3, 33.4, 35.3, 36.6]\n\n\nColumn: creddebt\nIdentified outliers: 23\noutlier values:  [5.090526, 5.245296, 5.402, 5.439966, 5.549544, 5.574294, 5.781564, 6.1138, 6.226794, 6.58854, 6.91152, 6.935916, 6.94868, 7.32, 7.38738, 7.817144, 9.5934, 9.60048, 14.231448, 14.5962, 15.791776, 16.03147, 20.56131]\n\n\nColumn: othdebt\nIdentified outliers: 24\noutlier values:  [9.736768, 9.97464, 10.18356, 10.63062, 10.811388, 10.98, 11.723976, 11.893518, 12.07569, 12.42186, 12.659328, 12.714006, 12.95853, 13.051206, 14.45273, 15.14916, 15.40539, 16.668126, 17.184552, 17.2038, 17.79899, 18.257382, 18.26913, 23.104224]\n\n\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Outlier Handling\nWhen dealing with detected outliers, two common approaches are employed:\n\n**Remove Rows:**\n\n    Exclude entire rows containing outliers from the dataset.\n**Coerce to Bounds:**\n\n    Modify outlier values to fall within an acceptable range, either by setting them to the lower or upper bound.  \n\nThese methods offer flexibility in managing the impact of outliers on data analysis, allowing analysts to choose the most suitable strategy based on the specific requirements of their analysis. Typically for smaller datasets remove rows method is applied.","metadata":{}},{"cell_type":"code","source":"inputs_iqr = inputs.copy()\n\n# Remove rows containing outliers\nremove_list = ['employ', 'address','debtinc']\n\n# Iterate over each column in remove list:\nfor column in remove_list:\n    column_out = inputs_iqr[column].isin(globals()[f'outliers_{column}_iqr'])\n    inputs_iqr = inputs_iqr[~column_out]\n\n\n    \n# Coerce outliers to lower or upper bound\ncoerce_list = ['income', 'creddebt', 'othdebt']\n\n# Iterate over each column in coerce list:\nfor column in coerce_list:\n    inputs_iqr[column] = inputs_iqr[column].apply(lambda x: globals()[f'lower_{column}_iqr'] if x < globals()[f'lower_{column}_iqr'] \n                                                      else (globals()[f'upper_{column}_iqr'] if x > globals()[f'upper_{column}_iqr'] \n                                                      else x))","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:20:02.876333Z","iopub.execute_input":"2024-01-17T15:20:02.877592Z","iopub.status.idle":"2024-01-17T15:20:02.892551Z","shell.execute_reply.started":"2024-01-17T15:20:02.877546Z","shell.execute_reply":"2024-01-17T15:20:02.891341Z"},"trusted":true},"execution_count":100,"outputs":[]},{"cell_type":"markdown","source":"### Mixed Method\nIf we aim to utilize both the standard deviation and IQR methods for distinct lists and apply different outlier-handling methods to each list, the following code can be employed.","metadata":{}},{"cell_type":"code","source":"inputs_new = inputs.copy()\n\n# Remove rows containing outliers\nremove_list_sigma = ['employ']\nremove_list_iqr = ['address','debtinc']\n\n# Iterate over each column in remove list:\nfor column in (remove_list_sigma + remove_list_iqr):\n    if column in remove_list_sigma:\n        column_out = inputs_new[column].isin(globals()[f'outliers_{column}_sigma'])\n    else:\n        column_out = inputs_new[column].isin(globals()[f'outliers_{column}_iqr'])\n    inputs_new = inputs_new[~column_out]\n\n\n    \n# Coerce outliers to lower or upper bound\ncoerce_list_sigma = ['othdebt']\ncoerce_list_iqr = ['income', 'creddebt']\n\n# Iterate over each column in coerce list:\nfor column in (coerce_list_sigma + coerce_list_iqr):\n    if column in coerce_list_sigma:\n        inputs_new[column] = inputs_new[column].apply(lambda x: globals()[f'lower_{column}_sigma'] if x < globals()[f'lower_{column}_sigma'] \n                                                      else (globals()[f'upper_{column}_sigma'] if x > globals()[f'upper_{column}_sigma'] \n                                                      else x))\n    else:\n        inputs_new[column] = inputs_new[column].apply(lambda x: globals()[f'lower_{column}_iqr'] if x < globals()[f'lower_{column}_iqr'] \n                                                      else (globals()[f'upper_{column}_iqr'] if x > globals()[f'upper_{column}_iqr'] \n                                                      else x))","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:20:09.346169Z","iopub.execute_input":"2024-01-17T15:20:09.346554Z","iopub.status.idle":"2024-01-17T15:20:09.364767Z","shell.execute_reply.started":"2024-01-17T15:20:09.346524Z","shell.execute_reply":"2024-01-17T15:20:09.363615Z"},"trusted":true},"execution_count":101,"outputs":[]},{"cell_type":"markdown","source":"### 2. **Multidimensional Method**  \n\nOne of multidimensional methods is **Isolation Forest** method that serves as a powerful multidimensional approach to outlier detection. Leveraging the principles of **decision trees**, this algorithm isolates outliers by constructing trees and measuring the number of partitions required to isolate an observation. Outliers, being less frequent and more isolated, necessitate **fewer partitions** to be identified. This method excels in handling multidimensional datasets, providing a robust and efficient means to detect anomalies across **multiple variables** simultaneously. Its ability to isolate outliers without relying **on assumptions** about the data's distribution makes it particularly effective in diverse data environments, offering a valuable tool for data analysts and machine learning practitioners seeking to enhance the **quality** and **reliability** of their datasets.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.ensemble import IsolationForest\nfrom sklearn.preprocessing import StandardScaler, LabelEncoder\n\ninputs_iso = inputs.copy()\n\n# Discard rows with NaN valuse\ninputs_iso = inputs_iso.dropna()\n\n# Apply Z-score scaling to numerical columns\nscaler = StandardScaler()\ninputs_iso[continuous_fields] = scaler.fit_transform(inputs_iso[continuous_fields])\n\n# Apply label encoding to categorical columns\nlabel_encoder = LabelEncoder()\ninputs_iso[categorical_fields] = inputs_iso[categorical_fields].apply(label_encoder.fit_transform)\n\n# Fit Isolation Forest model\nclf = IsolationForest(contamination=0.01, random_state=42)\nclf.fit(inputs_iso)\n\n# Predict outliers\noutliers = clf.predict(inputs_iso)\n\n# Add the outlier predictions to your DataFrame\ninputs_iso['outlier'] = outliers\n\n# Display the DataFrame with outlier information\nprint(inputs_iso)\n\n# Calculate the percentage of outliers\npercentage_outliers = (outliers[outliers == -1].shape[0] / len(outliers)) * 100\nprint(f\"Percentage of outliers: {percentage_outliers:.2f}%\")","metadata":{"trusted":true},"execution_count":103,"outputs":[{"name":"stdout","text":"Percentage of outliers: 1.14%\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Outlier Handling\nAfter detecting outliers, we **discard** the corresponding rows from **both the inputs and target**, then concatenate them to form a new dataframe after handling outliers.","metadata":{}},{"cell_type":"code","source":"outlier_index = inputs_iso[inputs_iso['outlier'] == -1].index\ninputs_outprep = inputs.drop(outlier_index)\ny_train_outprep = y_train.drop(outlier_index)\n\ntrain_outprep = pd.concat([inputs_outprep, y_train_outprep], axis=1)","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:20:44.236063Z","iopub.execute_input":"2024-01-17T15:20:44.236442Z","iopub.status.idle":"2024-01-17T15:20:44.246641Z","shell.execute_reply.started":"2024-01-17T15:20:44.236413Z","shell.execute_reply":"2024-01-17T15:20:44.245110Z"},"trusted":true},"execution_count":104,"outputs":[]},{"cell_type":"markdown","source":"### Missing Values  \n#### Missing Checking: A Quick Guide\n**1. Initial Check:**\n   - Assess missing values by creating a **report** on the **percentage of records with missing data**.\n   - If the **percentage is low**, consider discarding those records.(**Discard Method**)\n\n**2. Detailed Check:**\n   - If the **percentage is high**, create a detailed report based on the **number of missing cells in each record**. Sort this report to identify records with a higher number of missing cells.\n   -  Records with a significant number of missing cells are considered **low quality**. Discarding such records is essential to maintain data quality and prevent misleading analyses.(**Discard Method**)\n\n**3. Feature-Wise Check:**\n   - Analyze missing values with creating a **report** on missing values for **each feature**.\n   - Use **simple imputation** methods (mean,mode,...) for features with low missing data.(**Impution Method**)\n   - Consider **advanced imputation** methods (iterative,k-NN,...) for features with more missing data.(**Impution Method**)","metadata":{}},{"cell_type":"code","source":"import pandas as pd\n\n# Create a new column with the number of missing values in each row\ntrain_outprep['Num_Missing_Values'] = train_outprep.isnull().sum(axis=1)\n\n# Count and percentage of rows with missing values\nrows_with_missing_values = train_outprep[train_outprep['Num_Missing_Values'] > 0]\n\ntotal_rows = len(train_outprep)\nrows_with_missing_count = len(rows_with_missing_values)\npercentage_rows_with_missing = (rows_with_missing_count / total_rows) * 100\n\n# Display the report\nprint(\"Report on Rows with Missing Values:\")\nprint(f\"Total Rows: {total_rows}\")\nprint(f\"Rows with Missing Values: {rows_with_missing_count} ({percentage_rows_with_missing:.2f}%)\")\n\n\n# Display the DataFrame with the new column\nprint(\"\\nDataFrame with Num_Missing_Values column:\")\nprint(train_outprep.sort_values(by='Num_Missing_Values', ascending = False))\n","metadata":{"trusted":true},"execution_count":106,"outputs":[{"name":"stdout","text":"Report on Rows with Missing Values:\nTotal Rows: 485\nRows with Missing Values: 50 (10.31%)\n","output_type":"stream"}]},{"cell_type":"code","source":"# Discard rows with missing values\ntrain_outprp_no_missing = train_outprep.dropna()\n\n# Define the threshold for maximum allowable missing values per row\nmax_missing_values_threshold = 5\n\n# Filter rows based on the 'Num_Missing_Values' column\ntrain_outprep = train_outprep[train_outprep['Num_Missing_Values'] <= max_missing_values_threshold].iloc[:, :-1]\n","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import pandas as pd\n\n# Report on count and percentage of missing values in each column\nmissing_values_report = pd.DataFrame({\n    'Column': train_outprep.columns,\n    'Missing Values': train_outprep.isnull().sum(),\n    'Percentage Missing': train_outprep.isnull().mean() * 100\n})\n\n# Display the missing values report\nprint(\"Missing Values Report:\")\nprint(missing_values_report)","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:21:38.200821Z","iopub.execute_input":"2024-01-17T15:21:38.201197Z","iopub.status.idle":"2024-01-17T15:21:38.214068Z","shell.execute_reply.started":"2024-01-17T15:21:38.201169Z","shell.execute_reply":"2024-01-17T15:21:38.212924Z"},"trusted":true},"execution_count":107,"outputs":[{"name":"stdout","text":"Missing Values Report:\n                                Column  Missing Values  Percentage Missing\nage                                age              10            2.061856\nemploy                          employ               0            0.000000\naddress                        address               0            0.000000\nincome                          income              26            5.360825\ndebtinc                        debtinc               0            0.000000\ncreddebt                      creddebt               0            0.000000\nothdebt                        othdebt               0            0.000000\ned                                  ed              15            3.092784\ndefault                        default               0            0.000000\nNum_Missing_Values  Num_Missing_Values               0            0.000000\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Imputation Methods for Handling Missing Values\n\n#### 1. One-Dimensional Methods\n\nMissing values in a dataset can be handled using one-dimensional imputation methods. These methods focus on individual features without considering relationships with other variables.\n\n**Note:** One-dimensional imputation methods are typically employed when dealing with datasets featuring a **limited number of features**, especially in cases where the percentage of missing values in each feature is **low—typically less than 5%**. In situations where certain features are particularly crucial, a more stringent threshold of 3% missing values may be applied.\n\n \n\n##### 1-1. Impute with Fixed Values\n\nWhen dealing with missing data, one common approach is to replace the missing values with fixed values. This method is known for its **simplicity** and **efficiency** in handling missing values, making it a **quick** and **straightforward** approach. Several fixed value options are available:\n\n- **Mean Imputation:**\n  - *Explanation:*\n    - Mean imputation involves replacing missing values with the mean of the observed values in the respective column. This method is suitable when the missing values are assumed to be missing at random.\n  - *Applications:*\n    - Applicable when the distribution of the data is roughly symmetric.\n    - Useful for continuous variables without extreme outliers.\n\n- **Median Imputation:**\n  - *Explanation:*\n    - Similar to mean imputation, median imputation replaces missing values with the median of the observed values. This approach is robust to outliers, making it suitable for skewed distributions.\n  - *Applications:*\n    - Effective when dealing with data containing outliers.\n\n- **Mode Imputation:**\n  - *Explanation:*\n    - Mode imputation replaces missing values with the most frequent value in the column. This is suitable for categorical variables.\n  - *Applications:*\n    - Useful for handling missing values in categorical features.\n\n- **Constant Value Imputation:**\n  - *Explanation:*\n    - Replacing missing values with a predefined constant, which can be useful when a specific value is deemed appropriate.\n  - *Applications:*\n    - Applicable when missing values should be replaced with a known or meaningful constant.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.impute import SimpleImputer\n\ntrain_outprep_no_missing_fix = train_outprep.copy()\n\n# Create SimpleImputer instances for 'age' and 'ed' columns\nage_imputer = SimpleImputer(strategy='median') # you can use 'mean' \ned_imputer = SimpleImputer(strategy='most_frequent')  # When strategy = “constant”, fill_value is used to \n                                                      #replace all occurrences of missing_values\n\n# Impute missing values in 'age' column\ntrain_outprep_no_missing_fix['age'] = age_imputer.fit_transform(train_outprep_no_missing_fix[['age']])\n\n# Impute missing values in 'ed' column\ntrain_outprep_no_missing_fix['ed'] = ed_imputer.fit_transform(train_outprep_no_missing_fix[['ed']])\n\n# Display the DataFrame after imputation\nprint(\"DataFrame after Imputation:\")\nprint(train_outprep_no_missing_fix)\n\ntrain_outprep_no_missing_fix.info()","metadata":{"trusted":true},"execution_count":117,"outputs":[{"name":"stdout","text":"<class 'pandas.core.frame.DataFrame'>\nIndex: 485 entries, 286 to 37\nData columns (total 10 columns):\n #   Column              Non-Null Count  Dtype  \n---  ------              --------------  -----  \n 0   age                 485 non-null    float64\n 1   employ              485 non-null    int64  \n 2   address             485 non-null    int64  \n 3   income              459 non-null    float64\n 4   debtinc             485 non-null    float64\n 5   creddebt            485 non-null    float64\n 6   othdebt             485 non-null    float64\n 7   ed                  485 non-null    float64\n 8   default             485 non-null    int64  \n 9   Num_Missing_Values  485 non-null    int64  \ndtypes: float64(6), int64(4)\nmemory usage: 41.7 KB\n","output_type":"stream"}]},{"cell_type":"markdown","source":"#### 1. One-Dimensional Methods¶  \n\nMissing values in a dataset can be handled using one-dimensional imputation methods. These methods focus on individual features without considering relationships with other variables.\n\n**Note**: One-dimensional imputation methods are typically employed when dealing with datasets featuring a limited number of features, especially in cases where the percentage of missing values in each feature is **low—typically less than 5%**. In situations where certain features are particularly crucial, a more stringent threshold of 3% missing values may be applied. \n##### 1-2. Impute with Random Values Following Statistical Distribution\n\nIn certain scenarios, imputing missing values with random values generated from a statistical distribution is a suitable strategy. Unlike fixed values imputation, this method **preserves** the **variability** of features.\n\n- *Explanation:*\n  - This method involves replacing missing values with random numbers drawn from a distribution (e.g., normal distribution based on mean and standard deviation of observed values).\n- *Applications:*\n  - Appropriate when the missing values exhibit a similar distribution to the observed values.\n  - Useful for maintaining variability in the dataset.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nimport numpy as np\n\n# Create a copy of the DataFrame to store imputed values\ntrain_outprep_no_missing_rand = train_outprep.copy()\n\n# Identify columns to impute\ncolumns_to_impute = ['age', 'ed']\n\n# Impute missing values in 'age' and 'ed' columns with random values\nfor column in columns_to_impute:\n    missing_values = train_outprep_no_missing_rand[column].isnull()\n    num_missing = missing_values.sum()\n    \n    if num_missing > 0:\n        # Generate random values based on the distribution of existing values\n        np.random.seed(2024)\n        random_values = np.random.choice(train_outprep_no_missing_rand[column].dropna(), size=num_missing)\n        \n        # Assign the random values to missing values\n        train_outprep_no_missing_rand.loc[missing_values, column] = random_values\n\n# Display the DataFrame after imputation\nprint(\"DataFrame after Imputation with Random Values:\")\nprint(train_outprep_no_missing_rand)\n\ntrain_outprep_no_missing_rand.info()","metadata":{"trusted":true},"execution_count":119,"outputs":[{"name":"stdout","text":"<class 'pandas.core.frame.DataFrame'>\nIndex: 485 entries, 286 to 37\nData columns (total 10 columns):\n #   Column              Non-Null Count  Dtype  \n---  ------              --------------  -----  \n 0   age                 485 non-null    float64\n 1   employ              485 non-null    int64  \n 2   address             485 non-null    int64  \n 3   income              459 non-null    float64\n 4   debtinc             485 non-null    float64\n 5   creddebt            485 non-null    float64\n 6   othdebt             485 non-null    float64\n 7   ed                  485 non-null    float64\n 8   default             485 non-null    int64  \n 9   Num_Missing_Values  485 non-null    int64  \ndtypes: float64(6), int64(4)\nmemory usage: 41.7 KB\n","output_type":"stream"}]},{"cell_type":"markdown","source":"#### 2: Multi-Dimensional Methods\n\nMulti-dimensional imputation methods consider relationships among variables when imputing missing values. These methods take into account the entire dataset rather than individual features.  \n\n**Note:** Multi-dimensional imputation methods are commonly used when dealing with data sets that have a **large number of features**, especially in cases where the percentage of missing values in each feature is **high—typically greater than 5%**. In situations where certain features are particularly crucial, a more stringent threshold of 3% missing values may be applied.\n\n\n##### 2-1: Multivariate Methods Based on Iterative Models\n\nIterative imputation methods leverage iterative modeling to estimate missing values, considering relationships with other variables.\n\n- *Explanation:*\n  - Imputes missing values by iteratively updating estimates based on observed values and relationships with other variables.\n- *Applications:*\n  - Effective when the missingness of a variable is related to the values of other variables.\n  - Suitable for datasets with complex dependencies between variables.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.experimental import enable_iterative_imputer\nfrom sklearn.impute import IterativeImputer\n\n# Create a copy of the DataFrame to store imputed values\ntrain_outprep_no_missing_iter = train_outprep_no_missing_fix.copy()\n\n# Create IterativeImputer instance for 'income' column\nincome_imputer = IterativeImputer()\n\n# Identify columns to impute\ncolumns_to_impute = ['income']\n\n# Impute missing values in 'income' column\ntrain_outprep_no_missing_iter[columns_to_impute] = income_imputer.fit_transform(train_outprep_no_missing_iter[columns_to_impute])\n\n# Display the DataFrame after imputation\nprint(\"DataFrame after Imputation:\")\nprint(train_outprep_no_missing_iter)\n\ntrain_outprep_no_missing_iter.info()","metadata":{"trusted":true},"execution_count":122,"outputs":[{"name":"stdout","text":"<class 'pandas.core.frame.DataFrame'>\nIndex: 485 entries, 286 to 37\nData columns (total 10 columns):\n #   Column              Non-Null Count  Dtype  \n---  ------              --------------  -----  \n 0   age                 485 non-null    float64\n 1   employ              485 non-null    int64  \n 2   address             485 non-null    int64  \n 3   income              485 non-null    float64\n 4   debtinc             485 non-null    float64\n 5   creddebt            485 non-null    float64\n 6   othdebt             485 non-null    float64\n 7   ed                  485 non-null    float64\n 8   default             485 non-null    int64  \n 9   Num_Missing_Values  485 non-null    int64  \ndtypes: float64(6), int64(4)\nmemory usage: 41.7 KB\n","output_type":"stream"}]},{"cell_type":"markdown","source":"#### 2: Multi-Dimensional Methods\n\nMulti-dimensional imputation methods consider relationships among variables when imputing missing values. These methods take into account the entire dataset rather than individual features.  \n\n**Note:** Multi-dimensional imputation methods are commonly used when dealing with data sets that have a **large number of features**, especially in cases where the percentage of missing values in each feature is **high—typically greater than 5%**. In situations where certain features are particularly crucial, a more stringent threshold of 3% missing values may be applied.\n\n##### 2-2: Multivariate Methods Based on k-NN Model\n\nUtilizing k-NN (k-Nearest Neighbors) models for imputation involves predicting missing values based on the values of other data points.\n\n- *Explanation:*\n  - Missing values are imputed based on the values of their k-nearest neighbors in the feature space.\n- *Applications:*\n  - Useful when missing values are expected to be similar to other observations in the dataset.\n  - Effective for datasets where local patterns or clusters exist.\n\nThese methods offer a range of options for handling missing values in datasets, each with its own set of assumptions and applicability. The choice of method depends on the nature of the data, the reasons for missingness, and the characteristics of the variables involved. Always consider the context and objectives of the analysis when selecting an imputation strategy.\n","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.impute import KNNImputer\n\n# Create a copy of the DataFrame to store imputed values\ntrain_outprep_no_missing_knn = train_outprep_no_missing_fix.copy()\n\n# Create KNNImputer instance for 'income' column\nknn_imputer = KNNImputer()\n\n# Identify columns to impute\ncolumns_to_impute = ['income']\n\n# Impute missing values in 'income' column\ntrain_outprep_no_missing_knn[columns_to_impute] = knn_imputer.fit_transform(train_outprep_no_missing_knn[columns_to_impute])\n\n# Display the DataFrame after imputation\nprint(\"DataFrame after Imputation:\")\nprint(train_outprep_no_missing_knn)\n","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"Check the features to ensure data compeletness.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\n\n# Report on count and percentage of missing values in each column\nmissing_values_report = pd.DataFrame({\n    'Column': train_outprep_no_missing_knn.columns,\n    'Missing Values': train_outprep_no_missing_knn.isnull().sum(),\n    'Percentage Missing': train_outprep_no_missing_knn.isnull().mean() * 100\n})\n\n# Display the missing values report\nprint(\"Missing Values Report:\")\nprint(missing_values_report)","metadata":{"execution":{"iopub.status.busy":"2024-01-17T15:27:05.395942Z","iopub.execute_input":"2024-01-17T15:27:05.396315Z","iopub.status.idle":"2024-01-17T15:27:05.412438Z","shell.execute_reply.started":"2024-01-17T15:27:05.396286Z","shell.execute_reply":"2024-01-17T15:27:05.411206Z"},"trusted":true},"execution_count":124,"outputs":[{"name":"stdout","text":"Missing Values Report:\n                                Column  Missing Values  Percentage Missing\nage                                age               0                 0.0\nemploy                          employ               0                 0.0\naddress                        address               0                 0.0\nincome                          income               0                 0.0\ndebtinc                        debtinc               0                 0.0\ncreddebt                      creddebt               0                 0.0\nothdebt                        othdebt               0                 0.0\ned                                  ed               0                 0.0\ndefault                        default               0                 0.0\nNum_Missing_Values  Num_Missing_Values               0                 0.0\n","output_type":"stream"}]},{"cell_type":"code","source":"train_outprep_no_missing_knn.iloc[:,0:-1].to_csv('/kaggle/working/Bankloan_Cleanedv2.csv')","metadata":{"trusted":true},"execution_count":129,"outputs":[{"execution_count":129,"output_type":"execute_result","data":{"text/plain":"      age  employ  address      income  debtinc  creddebt    othdebt   ed  \\\n286  29.0      11        7   32.000000      6.0  0.927360   0.992640  1.0   \n146  28.0       1        3   26.000000     12.4  0.377208   2.846792  4.0   \n214  34.0      16        3   75.000000     10.4  3.954600   3.845400  1.0   \n528  51.0      31       14  249.000000      7.8  4.272840  15.149160  2.0   \n165  40.0      13       11  102.000000     18.9  6.226794  13.051206  2.0   \n..    ...     ...      ...         ...      ...       ...        ...  ...   \n144  41.0      16       17   68.000000      5.4  0.447984   3.224016  1.0   \n645  23.0       0        1   42.000000      3.9  1.018836   0.619164  2.0   \n72   47.0      26       21  100.000000     12.8  4.582400   8.217600  1.0   \n235  24.0       7        0   18.000000      6.5  0.526500   0.643500  1.0   \n37   32.0      12        1   44.448802     14.4  3.195936   4.580064  2.0   \n\n     default  \n286        0  \n146        0  \n214        0  \n528        0  \n165        1  \n..       ...  \n144        0  \n645        1  \n72         0  \n235        0  \n37         0  \n\n[485 rows x 9 columns]","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>age</th>\n      <th>employ</th>\n      <th>address</th>\n      <th>income</th>\n      <th>debtinc</th>\n      <th>creddebt</th>\n      <th>othdebt</th>\n      <th>ed</th>\n      <th>default</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>286</th>\n      <td>29.0</td>\n      <td>11</td>\n      <td>7</td>\n      <td>32.000000</td>\n      <td>6.0</td>\n      <td>0.927360</td>\n      <td>0.992640</td>\n      <td>1.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>146</th>\n      <td>28.0</td>\n      <td>1</td>\n      <td>3</td>\n      <td>26.000000</td>\n      <td>12.4</td>\n      <td>0.377208</td>\n      <td>2.846792</td>\n      <td>4.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>214</th>\n      <td>34.0</td>\n      <td>16</td>\n      <td>3</td>\n      <td>75.000000</td>\n      <td>10.4</td>\n      <td>3.954600</td>\n      <td>3.845400</td>\n      <td>1.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>528</th>\n      <td>51.0</td>\n      <td>31</td>\n      <td>14</td>\n      <td>249.000000</td>\n      <td>7.8</td>\n      <td>4.272840</td>\n      <td>15.149160</td>\n      <td>2.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>165</th>\n      <td>40.0</td>\n      <td>13</td>\n      <td>11</td>\n      <td>102.000000</td>\n      <td>18.9</td>\n      <td>6.226794</td>\n      <td>13.051206</td>\n      <td>2.0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>144</th>\n      <td>41.0</td>\n      <td>16</td>\n      <td>17</td>\n      <td>68.000000</td>\n      <td>5.4</td>\n      <td>0.447984</td>\n      <td>3.224016</td>\n      <td>1.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>645</th>\n      <td>23.0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>42.000000</td>\n      <td>3.9</td>\n      <td>1.018836</td>\n      <td>0.619164</td>\n      <td>2.0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>72</th>\n      <td>47.0</td>\n      <td>26</td>\n      <td>21</td>\n      <td>100.000000</td>\n      <td>12.8</td>\n      <td>4.582400</td>\n      <td>8.217600</td>\n      <td>1.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>235</th>\n      <td>24.0</td>\n      <td>7</td>\n      <td>0</td>\n      <td>18.000000</td>\n      <td>6.5</td>\n      <td>0.526500</td>\n      <td>0.643500</td>\n      <td>1.0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>37</th>\n      <td>32.0</td>\n      <td>12</td>\n      <td>1</td>\n      <td>44.448802</td>\n      <td>14.4</td>\n      <td>3.195936</td>\n      <td>4.580064</td>\n      <td>2.0</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n<p>485 rows × 9 columns</p>\n</div>"},"metadata":{}}]}]}