{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":7442761,"sourceType":"datasetVersion","datasetId":4332104},{"sourceId":7564893,"sourceType":"datasetVersion","datasetId":4404813}],"dockerImageVersionId":30635,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"\nIn data science, **preprocessing** techniques play a crucial role in enhancing the **quality** and **effectiveness** of machine learning models.   \n* **Feature construction** involves creating new features from existing ones to capture more relevant information or patterns.   \n* **Feature discretization** involves converting continuous variables into discrete categories, facilitating the handling of non-linear relationships and improving model interpretability.   \n* **Feature transformation** aims to modify the distribution of data, making it more suitable for modeling.   \n* **Encoding** is essential for converting categorical variables into numerical formats, enabling algorithms to process them effectively.   \n* **Scaling** involves standardizing or normalizing features to ensure that variables with different scales contribute equally to model training.  \n\nThese techniques collectively contribute to the overall data preparation process, ensuring that input data is appropriately structured and informative for **machine learning algorithms**, ultimately leading to more **accurate** and **robust** models.","metadata":{}},{"cell_type":"markdown","source":"### Read the Dataset","metadata":{}},{"cell_type":"code","source":"import pandas as pd\ncleaned_df = pd.read_csv('/kaggle/input/bank-loan-cleaned-ver2/Bankloan_Cleanedv2.csv')\n\ncleaned_df[['ed', 'default']] = cleaned_df[['ed', 'default']].astype('int8').astype('str')\ncleaned_df[['ed', 'default']].info()","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Feature Construction  \n\nFeature construction methods involve **creating new variables** to enhance model performance.These methods includes:   \n\n**Domain knowledge**, leveraging expertise in the specific field, guides the creation of meaningful features.  \n**Statistical insights**,such as interaction terms or transformations, help uncover underlying patterns within the data.","metadata":{}},{"cell_type":"markdown","source":"#### Feature Construction Based on Domain Knowlege  \n\nDomain knowledge-driven feature construction is a pivotal aspect of data preprocessing in data science. This technique involves leveraging insights and understanding from the specific field of study to engineer features that capture essential information for modeling.For instance,  \n* in healthcare, creating the Body Mass Index (**BMI**) as a feature involves using domain-specific knowledge about the relationship between a person's weight and height to derive a meaningful indicator of their overall health status.   \n* In manufacturing or architecture, domain expertise can be applied to create a feature representing the **area** of a space based on the length and width dimensions.   \n* In finance, particularly in credit risk assessment, domain knowledge comes into play when constructing features like the **debt-to-income** ratio, which is a key indicator of an individual's financial health.\n* Dividing **income and debt by their average** to avoid changing the distribution after years and making the model results obsolete.\n* In e-commerce, customer behavior features, such as purchase **recency** and browsing patterns, informed by industry expertise, can significantly enhance predictive models.  \n\nThese examples underscore how incorporating domain-specific insights into feature construction can lead to more relevant and **impactful variables** for model training in various fields.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\n\n# Calculate the mean of 'income' and 'otherdebt'\nmean_income = cleaned_df['income'].mean()\nmean_creddebt = cleaned_df['creddebt'].mean()\nmean_othdebt = cleaned_df['othdebt'].mean()\n\n\n# Create new features by dividing the original columns by their means\ncleaned_df['income_ratio'] = cleaned_df['income'] / mean_income\ncleaned_df['creddebt_ratio'] = cleaned_df['creddebt'] / mean_creddebt\ncleaned_df['othdebt_ratio'] = cleaned_df['othdebt'] / mean_othdebt\n\n# Display the updated DataFrame\nprint(cleaned_df)\n","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#### Feature Construction Based on Statistical Relationship    \n\nFeature construction based on statistical insights involves utilizing exploratory data analysis (**EDA**) and statistical investigations to derive meaningful variables that enhance the predictive power of machine learning models. Through EDA, **patterns**, **trends**, and **relationships** within the data can be identified, guiding the creation of features that capture relevant information. Additionally, statistical **hypothesis testing** can be employed to validate assumptions and uncover potential relationships that may lead to the formation of new features. For example:   \n* In marketing, statistical analysis of customer purchase patterns may inspire the creation of a **loyalty index** as a feature.   \n* In manufacturing, statistical investigations into production efficiency might result in the formulation of a **quality control metric** as a novel feature.   \n* In healthcare, exploring correlations between sodium (Na) and potassium (K) levels and creating a scatter plot to visualize their relationship in the context of different drugs could indeed inspire the development of the Na to K ratio as a feature. The **Na to K ratio** is a relevant metric that may provide valuable insights into the balance of these electrolytes in a patient's body, and it could serve as a meaningful feature for predictive modeling or clinical analysis. \n* **Grouping categories** with low frequencies aims to create a category with sufficient frequency, enabling the identification of generalizable patterns during the modeling step.\n\nThese examples underscore how statistical insights can drive the construction of features that contribute valuable information to the modeling process.","metadata":{}},{"cell_type":"code","source":"import numpy as np\n\ndef frequency_table(variable):\n    \n    # Get unique elements and their counts\n    unique_elements, counts = np.unique(variable, return_counts=True)\n\n    # Calculate percentages\n    percentages = (counts / len(variable)) * 100\n\n    # Create a dictionary to store the value counts and percentages\n    value_counts_and_percentages = zip(unique_elements, counts, percentages)\n\n    # Print the value counts and percentages\n    for i, j, k in value_counts_and_percentages:\n        print(f\"{i}: Count: {j}, Percentage: {k:.2f}%\")\n    return\n\n\nfrequency_table(cleaned_df['ed'])","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"cleaned_df['ed'] = cleaned_df['ed'].replace('5', '4')","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"frequency_table(cleaned_df['ed'])","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Feature Discretization  \n\nDiscretization, a vital preprocessing technique in data science, involves **transforming continuous variables into discrete categories** or bins. This method proves beneficial in scenarios where machine learning algorithms improve with categorical data or when **simplicity** and **interpretability** are paramount. By discretizing continuous features, data scientists can effectively address **non-linear relationships** and mitigate the **impact of outliers**. Moreover, discretization aids in creating more interpretable models by allowing the identification of patterns and trends within specific intervals. Additionally, when dealing with **skewed distributions**, discretization can be advantageous as it helps manage the imbalance in data points across different values, contributing to improved model performance and robustness. In summary, discretization not only prepares data for modeling but also enhances interpretability and addresses issues associated with skewed distributions, making it an indispensable tool in the data scientist's toolkit. ","metadata":{}},{"cell_type":"markdown","source":"Discretization methods are:   \n\n1. **Domain Knowledge**   \n\n2. **Unsupervised Approaches**   \n* Equal-Width (Uniform)  \n* Equal-Frequency (Quantile) \n* K-means    \n\n3. **Supervised Approaches**\n* Chi-Merge","metadata":{}},{"cell_type":"markdown","source":"#### Domain Knowledge  \n\nThe Domain Knowledge method in discretization is a strategic approach employed in data science, where **experts** leverage their understanding of the **specific field** to guide the process of transforming continuous variables into discrete categories or bins. By tapping into subject **matter expertise**, practitioners can define **meaningful intervals** that align with the inherent characteristics of the data. This method ensures that the discretization aligns closely with the nuances of the domain, capturing relevant information effectively. For example, in healthcare, experts might discretize patient age into categories that align with different life stages, reflecting the unique health considerations associated with each phase. By incorporating domain knowledge into the discretization process, this method enhances the **relevance** and **interpretability** of the resulting discrete features, ultimately contributing to the success of machine learning models within the context of the particular field.","metadata":{}},{"cell_type":"code","source":"import matplotlib.pyplot as plt\nplt.hist(cleaned_df['age'], bins=25 ,edgecolor=\"black\")\nplt.xlabel(\"Age\")","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import pandas as pd\n\n# Define the age categories\nbins = [0, 30, 50, float('inf')]  # These are the bin edges\nlabels = [1, 2, 3]  # These are the corresponding labels\n\n# Create a new column 'age_category' with the specified categories\ncleaned_df['age_cat_DK'] = pd.cut(cleaned_df['age'], bins=bins, labels=labels, right=False)\n\n#  If right == True (the default), then the bins [1, 2, 3, 4] indicate (1,2], (2,3], (3,4].\n\n# Display the updated DataFrame\nprint(cleaned_df)\n\nprint('='*50)\n\nfrequency_table(cleaned_df['age_cat_DK'])","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#### Unsupervised Approaches  \n\nUnsupervised approaches to discretization involve methods that **do not rely on labeled target** variables. These techniques,autonomously group continuous data into distinct intervals, allowing for pattern recognition without the need for predefined class labels. Unsupervised discretization is valuable for exploratory data analysis and can be particularly useful when the nature of the underlying patterns is unknown or complex.","metadata":{}},{"cell_type":"markdown","source":"#### Equal-Width (Uniform) Discretization Transform  \n\nEqual-Width (Uniform) Discretization is an **unsupervised approach** that divides the range of continuous data into **equal-ranged intervals**. This method creates **fixed-width bins**, disregarding the distribution of data, and is particularly straightforward to implement. While **simplicity** is a strength, it **may not capture variations** in data density, making it less effective for datasets with unevenly distributed values.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.preprocessing import KBinsDiscretizer\n\n# Define a list of variables\nvariables_list = ['age', 'employ']\n\ntemp_df_uniform = cleaned_df.copy()\n\n# Create an instance of KBinsDiscretizer\nkbin_discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')\n\n# Iterate through the variables in the list\nfor variable in variables_list:\n    # Fit and transform the current variable\n    temp_df_uniform[f'{variable}_cat_uni'] = kbin_discretizer.fit_transform(temp_df_uniform[[variable]])\n\n    # Print the variable name and its bin edges\n    print(f'{variable}_cat_uni bin edges:', kbin_discretizer.bin_edges_[0])\n    \n    # Print the frequency table\n    frequency_table(temp_df_uniform[f'{variable}_cat_uni'])\n    print(\"\\n\")\n\n# Display the updated DataFrame\nprint(temp_df_uniform)\n","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#### Equal-Frequency (Quantile) Discretization Transform  \n\nEqual-Frequency (Quantile) Discretization is an **unsupervised approach** that divides continuous data into bins of approximately **equal-sized frequency**. This method ensures that each bin contains an equal number of data points, aiding in **handling skewed distributions** and creating **balanced intervals**. It is particularly useful in scenarios where maintaining an equal distribution of data across categories is essential for subsequent analysis or modeling.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.preprocessing import KBinsDiscretizer\n\n# Define a list of variables\nvariables_list = ['age', 'employ', 'address', 'income_ratio', 'debtinc', 'creddebt_ratio', 'othdebt_ratio']\n\ntemp_df_quantile = cleaned_df.copy()\n\n# Create an instance of KBinsDiscretizer\nkbin_discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')\n\n# Iterate through the variables in the list\nfor variable in variables_list:\n    # Fit and transform the current variable\n    temp_df_quantile[f'{variable}_cat_qua'] = kbin_discretizer.fit_transform(temp_df_quantile[[variable]])\n\n    # Print the variable name and its bin edges\n    print(f'{variable}_cat_qua bin edges:', kbin_discretizer.bin_edges_[0])\n    \n    # Print the frequency table\n    frequency_table(temp_df_quantile[f'{variable}_cat_qua'])\n    print(\"\\n\")\n\n# Display the updated DataFrame\nprint(temp_df_quantile)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#### K-means Discretization\nK-means Discretization Transform is an **unsupervised approach** that utilizes the **K-means clustering** algorithm to automatically group continuous data into discrete bins. It assigns each data point to the **cluster centroid**, effectively creating distinct intervals without the need for predefined labels. This method is valuable for simplifying continuous features while preserving underlying patterns in an unsupervised manner.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.preprocessing import KBinsDiscretizer\n\n# Define a list of variables\nvariables_list = ['age', 'employ', 'address', 'income_ratio', 'debtinc', 'creddebt_ratio', 'othdebt_ratio']\n\ntemp_df_kmeans = cleaned_df.copy()\n\n# Create an instance of KBinsDiscretizer\nkbin_discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans')\n\n# Iterate through the variables in the list\nfor variable in variables_list:\n    # Fit and transform the current variable\n    temp_df_kmeans[f'{variable}_cat_km'] = kbin_discretizer.fit_transform(temp_df_kmeans[[variable]])\n\n    # Print the variable name and its bin edges\n    print(f'{variable}_cat_km bin edges:', kbin_discretizer.bin_edges_[0])\n    \n    # Print the frequency table\n    frequency_table(temp_df_kmeans[f'{variable}_cat_km'])\n    print(\"\\n\")\n\n# Display the updated DataFrame\nprint(temp_df_kmeans)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#### Supervised Approaches\nSupervised approaches to discretization involve **utilizing target variable** information to guide the binning process, optimizing for predictive accuracy. These methods aim to identify discrete intervals that enhance the discriminatory power of features in alignment with the target variable. By incorporating the outcome variable during discretization, supervised approaches contribute to **more effective** feature engineering in the context of supervised learning tasks while it may lead to **overfitting**.","metadata":{}},{"cell_type":"markdown","source":"#### Chi-Merge  \nChi-Merge discretization is a supervised approach that **utilizes statistical significance tests**, such as the **chi-squared test**, to iteratively merge adjacent bins based on their similarity in terms of the target variable. This method seeks to create discrete intervals that **maximize the homogeneity** within each bin while emphasizing distinctions in the target variable between bins. Chi-Merge ensures a **balance between granularity and significance**, making it a valuable technique for supervised learning tasks where the predictive power of features is crucial.","metadata":{}},{"cell_type":"code","source":"pip install scorecardbundle","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from scorecardbundle.feature_discretization import ChiMerge as cm\n\nchi_merge_list = ['age', 'employ', 'address', 'income_ratio', 'debtinc', 'creddebt_ratio', 'othdebt_ratio']\n\ntrans_cm = cm.ChiMerge(max_intervals=5, min_intervals=2, decimal=3,output_dataframe=True)\nresult_cm = trans_cm.fit_transform(cleaned_df[chi_merge_list], cleaned_df['default'].astype('int')) \ntrans_cm.boundaries_\n\n\n# Add -inf to the beginning of each array\nboundaries_dict = {key: np.insert(boundaries, 0, -np.inf) for key, boundaries in trans_cm.boundaries_.items()}\n\n# Iterate through the dictionary and add new columns to cleaned_df\nfor key, boundaries in boundaries_dict.items():\n    column_name = f\"{key}_cat_cm\"\n    cleaned_df[column_name] = pd.cut(cleaned_df[key], bins=boundaries, labels=False, right=False)\n    \n    # Print the variable name and its bin edges\n    print(f'{column_name} bin edges:', boundaries)\n    \n    # Print the frequency table\n    frequency_table(cleaned_df[column_name])\n    print(\"\\n\")\n\n# Display the updated DataFrame\nprint(cleaned_df)\n\ncleaned_df.describe()","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Feature Transformation-Normalization\nFeature transformation plays a crucial role in data preprocessing, aiming to enhance the performance of statistical models by **making variable distributions more Gaussian**. Two prominent methods for achieving this are the **Box-Cox** and **Yeo-Johnson** transformations. While both methods strive to normalize data, they differ in their applicability and constraints. The Box-Cox transformation is restricted to strictly **positive input data**, limiting its use in scenarios where non-positive values are prevalent. On the other hand, the Yeo-Johnson transformation accommodates both positive and negative data, rendering it more versatile. It is essential to acknowledge that feature transformations can **compromise interpretability**, making results less intuitive. Therefore, they are best employed in problem domains where the **conceptual distribution is not clearly defined** or when **the primary focus is on model performance rather than interpretability.**","metadata":{}},{"cell_type":"code","source":"from sklearn.preprocessing import PowerTransformer\n\n# List of features to transform\nselected_features = ['age', 'employ', 'address', 'income_ratio', 'debtinc', 'creddebt_ratio', 'othdebt_ratio']\n\n# Iterate through selected features\nfor feature in selected_features:\n    # Check if the feature contains negative values\n    has_negative_values = (cleaned_df[feature] <= 0).any()\n\n    # Choose the appropriate transformation method\n    if has_negative_values:\n        transformer = PowerTransformer(method='yeo-johnson', standardize=False)\n    else:\n        transformer = PowerTransformer(method='box-cox', standardize=False)\n\n    # Fit and transform the feature, and store the result in the new DataFrame\n    cleaned_df[f\"{feature}_transformed\"] = transformer.fit_transform(cleaned_df[[feature]])\n\n\n    # Get the lambda parameter used for transformation\n    lambda_value = transformer.lambdas_[0]\n    print(f\"Lambda for {feature}: {lambda_value}\")\n    \n    # Plot histograms for original and transformed features\n    plt.figure(figsize=(7, 3))\n\n    plt.subplot(1, 2, 1)\n    plt.hist(cleaned_df[feature], bins=30, color='blue', alpha=0.7)\n    plt.title(f'Original {feature} Histogram')\n\n    plt.subplot(1, 2, 2)\n    plt.hist(cleaned_df[f\"{feature}_transformed\"], bins=30, color='green', alpha=0.7)\n    plt.title(f'Transformed {feature} Histogram')\n\n    plt.tight_layout()\n    plt.show()\n\n    \n# Display the transformed DataFrame\nprint('\\n')\nprint(cleaned_df)\n","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Feature Encoding  \nFeature encoding is a crucial step in preparing categorical data for machine learning models, aiming to convert non-numeric categories into a format that algorithms can effectively interpret. Here are some common techniques for feature encoding:\n\n* Label Encoding for Target\n\n* Ordinal Encoding for Features\n\n* One-Hot Encoding\n\n* Ordered Label Encoding","metadata":{}},{"cell_type":"markdown","source":"#### Label Encoding for Target\nLabel encoding involves assigning a unique numerical label to each category in the target variable. This is often employed when dealing with binary classification problems, where the target variable has only two classes. It simplifies the learning process for certain algorithms but should be used cautiously as it might introduce ordinal relationships that may not exist in the actual data.","metadata":{}},{"cell_type":"code","source":"# Apply Label Encoding for Target in dataset\n\nfrom sklearn.preprocessing import LabelEncoder\n\nlabel_encoder = LabelEncoder()\ncleaned_df['default-le'] = label_encoder.fit_transform(cleaned_df['default'])\n\ncleaned_df = cleaned_df.drop(['default-le'], axis=1)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#### Ordinal Encoding for Features\nOrdinal encoding is suitable for categorical features with an inherent order or ranking. It assigns integer values based on the specified order of the categories. This encoding method is beneficial when the categories have a meaningful and interpretable sequence, such as low, medium, and high.","metadata":{}},{"cell_type":"code","source":"# Apply Ordinal Encoding for Features in dataset\n\nfrom sklearn.preprocessing import OrdinalEncoder\n\nordinal_encoder = OrdinalEncoder()\ncleaned_df['ed_oe'] = ordinal_encoder.fit_transform(cleaned_df[['ed']])\nordinal_encoder.categories_\n\nordinal_encoder = OrdinalEncoder(categories=[['4', '3', '2', '1']])\ncleaned_df['ed_oe'] = ordinal_encoder.fit_transform(cleaned_df[['ed']])\n\ncleaned_df = cleaned_df.drop('ed_oe', axis=1)","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from sklearn.preprocessing import OrdinalEncoder\n\n# Sample DataFrame\ndata = {\n    'Size': ['Small', 'Medium', 'Large', 'Medium'],\n    'Color': ['Red', 'Green', 'Blue', 'Red'],\n    'Temperature': ['Cold', 'Hot', 'Warm', 'Hot']\n}\n\ndf = pd.DataFrame(data)\n\n# Define specific orders for each feature\nsize_order = ['Small', 'Medium', 'Large']\ntemperature_order = ['Cold', 'Warm', 'Hot']\n\n# Create a dictionary with feature names as keys and category orders as values\ncategories_dict = {\n    'Size': size_order,\n    'Temperature': temperature_order\n}\n\n# Apply ordinal encoding to each feature with the specified order\nordinal_encoder = OrdinalEncoder(categories=[categories_dict['Size'], categories_dict['Temperature']])\ndf[['Size_oe', 'Temperature_oe']] = ordinal_encoder.fit_transform(df[['Size', 'Temperature']])\n\n# Display the result\nprint(df)","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#### One-Hot Encoding\nOne-Hot Encoding transforms each category into a binary column, creating a binary matrix where each column corresponds to a unique category. This method is useful for nominal categorical features (categories without inherent order). One-Hot Encoding helps prevent the model from assuming ordinal relationships between the categories.","metadata":{}},{"cell_type":"code","source":"# Apply One-Hot Encoding\nfrom sklearn.preprocessing import OneHotEncoder\n\n# one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)\n# one_hot_encoder = OneHotEncoder(categories=[['1', '2']], handle_unknown='ignore' , sparse_output=False)\none_hot_encoder = OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False)\n\none_hot_encoded = one_hot_encoder.fit_transform(cleaned_df[['ed']])\n\n# Add results into dataframe\none_hot_encoded_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out())\n\nencoded_df = pd.concat([cleaned_df.reset_index(drop=True), one_hot_encoded_df.reset_index(drop=True)], axis=1)","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#### Ordered Label Encoding\nOrdered label encoding is similar to ordinal encoding but requires explicit mapping of each category to a numerical value based on their natural order. This method is particularly useful when dealing with categorical features that possess a clear ranking, and maintaining this ranking is crucial for the model to capture the underlying patterns in the data.","metadata":{}},{"cell_type":"code","source":"!pip install --upgrade scikit-learn","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from sklearn.preprocessing import TargetEncoder\n\n\ntarget_encoder = TargetEncoder(target_type=\"auto\")  # {\"auto\", \"continuous\", \"binary\", \"multiclass\"}, default=\"auto\" Type of target.\ntarget_encoder.fit(cleaned_df[['ed']], cleaned_df['default'])\ntarget_encoded = target_encoder.transform(cleaned_df[['ed']])\n\n''' Note: fit(X, y).transform(X) does not equal fit_transform(X, y) \n    because a cross fitting scheme is used in fit_transform for encoding. \n'''\n# Add results into dataframe\ntarget_encoded_df = pd.DataFrame(target_encoded, columns=target_encoder.get_feature_names_out())\n\nencoded_df = pd.concat([cleaned_df.reset_index(drop=True), target_encoded_df.reset_index(drop=True)], axis=1)\n\n# Display the encoded DataFrame\nprint(encoded_df)","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#### Feature Scaling  \nFeature scaling is a critical preprocessing step in machine learning that aims to **standardize** or **normalize** the range of independent variables or features of a dataset. Scaling ensures that the variables contribute equally to the analysis, preventing certain features from dominating due to their larger magnitudes. Here are some common techniques for feature scaling:\n\n* Min-Max Scaling:\nMin-Max scaling, also known as normalization, transforms the features to a specific range, typically between 0 and 1. The formula for Min-Max scaling is **(X - X_min) / (X_max - X_min)**, where X is the original value. This scaling method is sensitive to outliers but is effective in situations where the data needs to be constrained within a particular range.\n\n* Z-Score Scaling:\nZ-Score scaling, or standardization, transforms the features to have a mean of 0 and a standard deviation of 1. It is calculated using the formula **(X - mean) / standard deviation**, where X is the original value. Z-Score scaling is robust to outliers and is suitable for algorithms that assume a normal distribution of the data.\n\n* Robust Scaling:\nRobust scaling is an alternative method that uses the median and interquartile range (IQR) to scale the features. It is less sensitive to outliers than Min-Max scaling. The formula for robust scaling is **(X - median) / IQR**, where X is the original value. This method is advantageous when the dataset contains outliers that could significantly impact other scaling techniques.\n\nThe aim of feature scaling is to create a **level playing field for different features**, allowing machine learning algorithms to converge faster and perform more effectively. Scaling is particularly crucial for **distance-based** algorithms, such as k-nearest neighbors or support vector machines, where the scale of features directly influences the calculation of distances.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler\n\n# List of features to rescale\nselected_features = ['age','income_ratio', 'debtinc']\n\n# List of scaling methods\nscaling_methods = ['min-max', 'z-score']  # Select one or more from: 'min-max', 'z-score', 'robust-scaling', 'decimal-scaling'\n\n\n# Iterate through selected features\nfor feature in selected_features:\n    # Apply Min-Max Scaling\n    if 'min-max' in scaling_methods:\n        min_max_scaler = MinMaxScaler()\n        cleaned_df[f\"{feature}_min_max\"] = min_max_scaler.fit_transform(cleaned_df[[feature]])\n\n\n    # Apply Z-Score Scaling\n    if 'z-score' in scaling_methods:\n        z_score_scaler = StandardScaler()\n        cleaned_df[f\"{feature}_z_score\"] = z_score_scaler.fit_transform(cleaned_df[[feature]])\n\n\n    # Apply Robust Scaling\n    if 'Robust Scaling' in scaling_methods:\n        robust_scaler = RobustScaler()\n        cleaned_df[f\"{feature}_robust\"] = robust_scaler.fit_transform(cleaned_df[[feature]])\n\n\n    # Apply Decimal Scaling\n    if 'Decimal Scaling' in scaling_methods:\n        cleaned_df[f\"{feature}_decimal\"] = cleaned_df[feature] / 10**len(str(int(cleaned_df[feature].max())))\n\n\n# Display the scaled DataFrame\nprint(cleaned_df)","metadata":{},"execution_count":null,"outputs":[]}]}