{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":7273365,"sourceType":"datasetVersion","datasetId":4216596}],"dockerImageVersionId":30698,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Understanding Data Science Pipelines\n\n## What is a Pipeline?\n\nIn the context of data science projects, a pipeline refers to a systematic arrangement of sequential and functional operations that transform raw data into meaningful insights. It encompasses a series of interconnected steps designed to streamline the data processing workflow from start to finish.\n\n## Why Use Pipelines?\n\nPipelines offer several advantages in data science projects:\n\n- **Organization**: Pipelines provide a structured framework for organizing and managing the various stages of data processing, from data ingestion to model deployment.\n\n- **Efficiency**: By automating repetitive tasks and standardizing data processing procedures, pipelines enhance efficiency and reduce the time required for analysis.\n\n- **Consistency**: Pipelines ensure consistency in data handling and modeling techniques across different iterations of the project, leading to more reliable results.\n\n- **Scalability**: As projects evolve and datasets grow, pipelines can easily adapt to accommodate changes and scale up processing capabilities.\n\n## Components of a Data Science Pipeline\n\n\n### 1. Data Preprocessing\n   - Cleaning and filtering raw data to remove inconsistencies and outliers.\n   - Imputing missing values and handling data formatting issues.\n   - Feature engineering to create new features or transform existing ones for model input.\n\n### 2. Model Training and Evaluation\n   - Selecting appropriate machine learning algorithms based on project requirements.\n   - Splitting data into training and testing sets for model validation.\n   - Tuning model hyperparameters and evaluating model performance using suitable metrics.\n\n### 3. Model Deployment\n   - Integrating trained models into production environments for real-time inference.\n   - Monitoring model performance and updating models as needed.\n\n## Role of Pipelines in Data Science Projects\n\nPipelines serve as the backbone of data science projects, orchestrating the flow of data and operations from inception to deployment. They enable data scientists to iterate efficiently, experiment with different approaches, and deliver actionable insights to stakeholders.\n\n## Conclusion\n\nData science pipelines play a crucial role in managing the complexity of data processing workflows and ensuring the reliability and scalability of analytical solutions. By embracing the pipeline concept, data scientists can enhance productivity, maintain consistency, and drive impactful decision-making in their projects.\n","metadata":{"execution":{"iopub.status.busy":"2024-05-12T18:09:56.470648Z","iopub.execute_input":"2024-05-12T18:09:56.471062Z","iopub.status.idle":"2024-05-12T18:09:56.486900Z","shell.execute_reply.started":"2024-05-12T18:09:56.471034Z","shell.execute_reply":"2024-05-12T18:09:56.485069Z"}}},{"cell_type":"markdown","source":"### Read the Data","metadata":{}},{"cell_type":"code","source":"import pandas as pd\ndf = pd.read_csv('/kaggle/input/bank-loan/Bankloan.txt')","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:16:56.476968Z","iopub.execute_input":"2024-06-05T16:16:56.478235Z","iopub.status.idle":"2024-06-05T16:16:57.821300Z","shell.execute_reply.started":"2024-06-05T16:16:56.478183Z","shell.execute_reply":"2024-06-05T16:16:57.819978Z"},"trusted":true},"execution_count":1,"outputs":[]},{"cell_type":"markdown","source":"### Split Data into Train and Test Sets","metadata":{}},{"cell_type":"code","source":"y = df.iloc[:,-1]\nX = df.iloc[:,0:-1]\n\nfrom sklearn.model_selection import train_test_split\n\n# split into train and test sets\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)\nX_train.shape,X_test.shape","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:17:38.674742Z","iopub.execute_input":"2024-06-05T16:17:38.675235Z","iopub.status.idle":"2024-06-05T16:17:39.443593Z","shell.execute_reply.started":"2024-06-05T16:17:38.675192Z","shell.execute_reply":"2024-06-05T16:17:39.442305Z"},"trusted":true},"execution_count":2,"outputs":[{"execution_count":2,"output_type":"execute_result","data":{"text/plain":"((490, 8), (210, 8))"},"metadata":{}}]},{"cell_type":"markdown","source":"### Seperate Categorical and Continuous Features","metadata":{}},{"cell_type":"code","source":"columns = X_train.columns\n\n# Choose categorical elements \ncategorical_indices = [1]\n\n# Use a list comprehension to select the elements at the specified indices\ncategorical_fields = [columns[i] for i in categorical_indices]\n\n# Create a new list of columns excluding categorical_fields (continuous)\ncontinuous_fields = [j for j in columns if j not in categorical_fields]","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:17:44.100230Z","iopub.execute_input":"2024-06-05T16:17:44.100774Z","iopub.status.idle":"2024-06-05T16:17:44.109037Z","shell.execute_reply.started":"2024-06-05T16:17:44.100731Z","shell.execute_reply":"2024-06-05T16:17:44.107556Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":"### Screen Features  \nMore details are in https://www.kaggle.com/code/zahrazolghadr/data-cleaning-in-dm-featurescreening-consistency","metadata":{}},{"cell_type":"code","source":"def feature_screening(data, min_cv=0.1, mode_threshold=95, distinct_threshold=90):\n    processed_data = data.copy()\n    \n    # Define a minimum value for coefficient of variation\n    min_cv = min_cv\n\n    # Calculate the coefficient of variation for each column\n    cv_values = processed_data[continuous_fields].std() / processed_data[continuous_fields].mean()\n\n    # Filter out columns with CV less than 0.1\n    screen_cv =  cv_values[cv_values < 0.1].index.tolist()\n\n\n\n    # Define a threshold for the dominant category percentage\n    threshold = mode_threshold\n\n    # Calculate the percentage of the mode category for each column\n    mode_category = (processed_data[categorical_fields].apply(lambda x: x.value_counts().max() / len(x)) * 100)\n\n    # Select columns where the mode category percentage is greater than the threshold\n    screen_mode = mode_category[mode_category > threshold].index.tolist()\n\n\n\n    # Set a threshold for excluding columns \n    threshold = distinct_threshold\n\n    # Calculate the percentage of distinct categories in categorical variables\n    distinct_percentage = (processed_data[categorical_fields].apply(lambda x: x.dropna().nunique() / x.count()) * 100)\n\n    # Select categorical columns based on distinct percentage threshold\n    screen_distinct = distinct_percentage[distinct_percentage > threshold].index.tolist()\n\n    screened_features  = list(set(screen_cv + screen_mode + screen_distinct))\n    \n    return screened_features","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:19:39.730641Z","iopub.execute_input":"2024-06-05T16:19:39.731205Z","iopub.status.idle":"2024-06-05T16:19:39.745788Z","shell.execute_reply.started":"2024-06-05T16:19:39.731167Z","shell.execute_reply":"2024-06-05T16:19:39.743851Z"},"trusted":true},"execution_count":4,"outputs":[]},{"cell_type":"code","source":"drop_list = feature_screening(X_train, min_cv=0.1, mode_threshold=95, distinct_threshold=90)\n\nX_train = X_train.drop(drop_list, axis=1)\nX_test = X_test.drop(drop_list, axis=1)\n\nX_train.shape, X_test.shape","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:19:47.249631Z","iopub.execute_input":"2024-06-05T16:19:47.250084Z","iopub.status.idle":"2024-06-05T16:19:47.282013Z","shell.execute_reply.started":"2024-06-05T16:19:47.250024Z","shell.execute_reply":"2024-06-05T16:19:47.280820Z"},"trusted":true},"execution_count":5,"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"((490, 8), (210, 8))"},"metadata":{}}]},{"cell_type":"markdown","source":"### Handle Out-of-Range and Inconsistent Values  \nMore details are in https://www.kaggle.com/code/zahrazolghadr/data-cleaning-in-dm-featurescreening-consistency","metadata":{}},{"cell_type":"code","source":"import pandas as pd\n\ndef range_consistency(data, target):\n    # Define ranges for each column\n    column_ranges = {\n        'age': (18, 70),\n        'employ': (0, 31),\n        'address': (0, 80),\n        'income': (0, 1000),\n        'debtinc': (0, 100),\n        'creddebt': (0, 30),\n        'othdebt': (0, 30)\n    }\n\n    # Iterate through each column and fill NaN values outside the defined range\n    for column, (min_val, max_val) in column_ranges.items():\n        data[column] = data[column].apply(lambda x: x if min_val <= x <= max_val else None)\n\n\n    target = target.replace([':0', \"'0'\"], '0')\n    \n    return data, target","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:19:53.279571Z","iopub.execute_input":"2024-06-05T16:19:53.279991Z","iopub.status.idle":"2024-06-05T16:19:53.288911Z","shell.execute_reply.started":"2024-06-05T16:19:53.279957Z","shell.execute_reply":"2024-06-05T16:19:53.287471Z"},"trusted":true},"execution_count":6,"outputs":[]},{"cell_type":"code","source":"X_train = range_consistency(X_train, y_train)[0]\nX_test = range_consistency(X_test, y_test)[0]\n\ny_train = range_consistency(X_train, y_train)[1]\ny_test = range_consistency(X_test, y_test)[1]","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:19:57.959387Z","iopub.execute_input":"2024-06-05T16:19:57.959931Z","iopub.status.idle":"2024-06-05T16:19:57.993395Z","shell.execute_reply.started":"2024-06-05T16:19:57.959884Z","shell.execute_reply":"2024-06-05T16:19:57.991915Z"},"trusted":true},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":"### Handle Outliers  \nMore details are in https://www.kaggle.com/code/zahrazolghadr/data-cleaning-in-dm-outliers-missing-values","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.ensemble import IsolationForest\nfrom sklearn.preprocessing import StandardScaler, LabelEncoder\n\ndef outlier_handling(data, contamination=0.01):\n    inputs_iso = data.copy()\n\n    # Discard rows with NaN valuse\n    inputs_iso = inputs_iso.dropna()\n\n    # Apply Z-score scaling to numerical columns\n    scaler = StandardScaler()\n    inputs_iso_array = scaler.fit_transform(inputs_iso)\n\n    # Fit Isolation Forest model\n    clf = IsolationForest(contamination = contamination, random_state=42)\n    clf.fit(inputs_iso_array)\n\n    # Predict outliers\n    outliers = clf.predict(inputs_iso_array)\n\n    # Add the outlier predictions to your DataFrame\n    inputs_iso['outlier'] = outliers\n    \n    outlier_index = inputs_iso[inputs_iso['outlier'] == -1].index\n    \n    return outlier_index","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:20:04.119991Z","iopub.execute_input":"2024-06-05T16:20:04.120514Z","iopub.status.idle":"2024-06-05T16:20:04.507331Z","shell.execute_reply.started":"2024-06-05T16:20:04.120420Z","shell.execute_reply":"2024-06-05T16:20:04.506089Z"},"trusted":true},"execution_count":8,"outputs":[]},{"cell_type":"code","source":"outlier_index = outlier_handling(X_train, contamination=0.01)\n\nX_train = X_train.drop(outlier_index.tolist())\n\ny_train = y_train.drop(outlier_index.tolist())\n\nX_train.shape, y_train.shape","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:20:08.455082Z","iopub.execute_input":"2024-06-05T16:20:08.456307Z","iopub.status.idle":"2024-06-05T16:20:08.894651Z","shell.execute_reply.started":"2024-06-05T16:20:08.456267Z","shell.execute_reply":"2024-06-05T16:20:08.893507Z"},"trusted":true},"execution_count":9,"outputs":[{"execution_count":9,"output_type":"execute_result","data":{"text/plain":"((485, 8), (485,))"},"metadata":{}}]},{"cell_type":"markdown","source":"### Handle Missing Values  \nMore details are in https://www.kaggle.com/code/zahrazolghadr/data-cleaning-in-dm-outliers-missing-values","metadata":{}},{"cell_type":"code","source":"def missing_row_report(data, missrow=5):\n    processed_data = data.copy()\n\n    # Create a new column with the number of missing values in each row\n    processed_data['Num_Missing_Values'] = processed_data.isnull().sum(axis=1)\n\n    discard_missing_row = processed_data[processed_data['Num_Missing_Values'] > missrow].index.tolist()\n\n    return discard_missing_row","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:20:12.924006Z","iopub.execute_input":"2024-06-05T16:20:12.924471Z","iopub.status.idle":"2024-06-05T16:20:12.931623Z","shell.execute_reply.started":"2024-06-05T16:20:12.924430Z","shell.execute_reply":"2024-06-05T16:20:12.930379Z"},"trusted":true},"execution_count":10,"outputs":[]},{"cell_type":"code","source":"discard_missing_row = missing_row_report(X_train, missrow=5)\n\nX_train = X_train.drop(discard_missing_row)\ny_train = y_train.drop(discard_missing_row)\n\nX_train.shape, y_train.shape","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:20:20.114453Z","iopub.execute_input":"2024-06-05T16:20:20.114947Z","iopub.status.idle":"2024-06-05T16:20:20.133130Z","shell.execute_reply.started":"2024-06-05T16:20:20.114901Z","shell.execute_reply":"2024-06-05T16:20:20.131593Z"},"trusted":true},"execution_count":11,"outputs":[{"execution_count":11,"output_type":"execute_result","data":{"text/plain":"((485, 8), (485,))"},"metadata":{}}]},{"cell_type":"code","source":"def missing_col_report(data, misscol=50):\n    processed_data = data.copy()\n\n    # Report on count and percentage of missing values in each column\n    missing_values_report = pd.DataFrame({\n        'Column': processed_data.columns,\n        'Missing Values': processed_data.isnull().sum(),\n        'Percentage Missing': processed_data.isnull().mean() * 100\n    })\n    discard_missing_col = missing_values_report[missing_values_report['Percentage Missing'] > misscol].index.tolist()\n    \n    return discard_missing_col","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:20:27.159310Z","iopub.execute_input":"2024-06-05T16:20:27.159766Z","iopub.status.idle":"2024-06-05T16:20:27.167502Z","shell.execute_reply.started":"2024-06-05T16:20:27.159731Z","shell.execute_reply":"2024-06-05T16:20:27.166169Z"},"trusted":true},"execution_count":12,"outputs":[]},{"cell_type":"code","source":"discard_missing_col = missing_col_report(X_train, misscol=50)\n\nX_train = X_train.drop(discard_missing_col, axis=1)\nX_test = X_test.drop(discard_missing_col, axis=1)\n\nX_train.shape, X_test.shape","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:20:33.314900Z","iopub.execute_input":"2024-06-05T16:20:33.315353Z","iopub.status.idle":"2024-06-05T16:20:33.333126Z","shell.execute_reply.started":"2024-06-05T16:20:33.315318Z","shell.execute_reply":"2024-06-05T16:20:33.331724Z"},"trusted":true},"execution_count":13,"outputs":[{"execution_count":13,"output_type":"execute_result","data":{"text/plain":"((485, 8), (210, 8))"},"metadata":{}}]},{"cell_type":"code","source":"from sklearn.impute import KNNImputer, SimpleImputer\n\n\n# Define imputation strategies for each subset of columns\nknn_list = ['income']\ncat_simple_list = [i for i in categorical_fields if i not in knn_list]\ncon_simple_list = [i for i in continuous_fields if i not in knn_list]\n\ndef missing_imputer(train, test, knn_list, cat_list, con_list):\n    \n    knn_imputer = KNNImputer()\n    cat_imputer = SimpleImputer(strategy='most_frequent')\n    con_imputer = SimpleImputer(strategy='median')\n\n    # Impute missing values\n    train[knn_list] = knn_imputer.fit_transform(train[knn_list])\n    train[cat_list] = cat_imputer.fit_transform(train[cat_list])\n    train[con_list] = con_imputer.fit_transform(train[con_list])\n\n    test[knn_list] = knn_imputer.transform(X_test[knn_list])\n    test[cat_list] = cat_imputer.transform(test[cat_list])\n    test[con_list] = con_imputer.transform(test[con_list])\n    \n    return train, test","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:20:37.534440Z","iopub.execute_input":"2024-06-05T16:20:37.534874Z","iopub.status.idle":"2024-06-05T16:20:37.551374Z","shell.execute_reply.started":"2024-06-05T16:20:37.534840Z","shell.execute_reply":"2024-06-05T16:20:37.549945Z"},"trusted":true},"execution_count":14,"outputs":[]},{"cell_type":"code","source":"X_train, X_test = missing_imputer(X_train, X_test, knn_list, cat_simple_list, con_simple_list)\n\nX_train.info()\nX_test.info()","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:20:44.735360Z","iopub.execute_input":"2024-06-05T16:20:44.736382Z","iopub.status.idle":"2024-06-05T16:20:44.792712Z","shell.execute_reply.started":"2024-06-05T16:20:44.736337Z","shell.execute_reply":"2024-06-05T16:20:44.791144Z"},"trusted":true},"execution_count":15,"outputs":[{"name":"stdout","text":"<class 'pandas.core.frame.DataFrame'>\nIndex: 485 entries, 286 to 37\nData columns (total 8 columns):\n #   Column    Non-Null Count  Dtype  \n---  ------    --------------  -----  \n 0   age       485 non-null    float64\n 1   ed        485 non-null    float64\n 2   employ    485 non-null    float64\n 3   address   485 non-null    float64\n 4   income    485 non-null    float64\n 5   debtinc   485 non-null    float64\n 6   creddebt  485 non-null    float64\n 7   othdebt   485 non-null    float64\ndtypes: float64(8)\nmemory usage: 34.1 KB\n<class 'pandas.core.frame.DataFrame'>\nIndex: 210 entries, 681 to 103\nData columns (total 8 columns):\n #   Column    Non-Null Count  Dtype  \n---  ------    --------------  -----  \n 0   age       210 non-null    float64\n 1   ed        210 non-null    float64\n 2   employ    210 non-null    float64\n 3   address   210 non-null    float64\n 4   income    210 non-null    float64\n 5   debtinc   210 non-null    float64\n 6   creddebt  210 non-null    float64\n 7   othdebt   210 non-null    float64\ndtypes: float64(8)\nmemory usage: 14.8 KB\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Construct Features\nMore details are in https://www.kaggle.com/code/zahrazolghadr/data-transformation","metadata":{}},{"cell_type":"code","source":"import pandas as pd\n\n\ndef construct(train, test):\n    \n    # Calculate the mean of 'income' ,'creddebt' and 'othdebt'\n    mean_income = train['income'].mean()\n    mean_creddebt = train['creddebt'].mean()\n    mean_othdebt = train['othdebt'].mean()\n\n\n    # Create new features by dividing the original columns by their means\n    train['income_ratio'] = train['income'] / mean_income\n    train['creddebt_ratio'] = train['creddebt'] / mean_creddebt\n    train['othdebt_ratio'] = train['othdebt'] / mean_othdebt\n\n    test['income_ratio'] = test['income'] / mean_income\n    test['creddebt_ratio'] = test['creddebt'] / mean_creddebt\n    test['othdebt_ratio'] = test['othdebt'] / mean_othdebt\n\n    train = train.drop(['income', 'creddebt', 'othdebt'], axis=1)\n    test = test.drop(['income', 'creddebt', 'othdebt'], axis=1)\n\n    train['ed'] = train['ed'].replace('5', '4')\n    test['ed'] = test['ed'].replace('5', '4')\n    \n    return train, test","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:20:49.649478Z","iopub.execute_input":"2024-06-05T16:20:49.649892Z","iopub.status.idle":"2024-06-05T16:20:49.660482Z","shell.execute_reply.started":"2024-06-05T16:20:49.649862Z","shell.execute_reply":"2024-06-05T16:20:49.659143Z"},"trusted":true},"execution_count":16,"outputs":[]},{"cell_type":"code","source":"X_train, X_test = construct(X_train, X_test)\n\nX_train.info()\nX_test.info()","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:20:55.019483Z","iopub.execute_input":"2024-06-05T16:20:55.020027Z","iopub.status.idle":"2024-06-05T16:20:55.056096Z","shell.execute_reply.started":"2024-06-05T16:20:55.019983Z","shell.execute_reply":"2024-06-05T16:20:55.054670Z"},"trusted":true},"execution_count":17,"outputs":[{"name":"stdout","text":"<class 'pandas.core.frame.DataFrame'>\nIndex: 485 entries, 286 to 37\nData columns (total 8 columns):\n #   Column          Non-Null Count  Dtype  \n---  ------          --------------  -----  \n 0   age             485 non-null    float64\n 1   ed              485 non-null    float64\n 2   employ          485 non-null    float64\n 3   address         485 non-null    float64\n 4   debtinc         485 non-null    float64\n 5   income_ratio    485 non-null    float64\n 6   creddebt_ratio  485 non-null    float64\n 7   othdebt_ratio   485 non-null    float64\ndtypes: float64(8)\nmemory usage: 34.1 KB\n<class 'pandas.core.frame.DataFrame'>\nIndex: 210 entries, 681 to 103\nData columns (total 8 columns):\n #   Column          Non-Null Count  Dtype  \n---  ------          --------------  -----  \n 0   age             210 non-null    float64\n 1   ed              210 non-null    float64\n 2   employ          210 non-null    float64\n 3   address         210 non-null    float64\n 4   debtinc         210 non-null    float64\n 5   income_ratio    210 non-null    float64\n 6   creddebt_ratio  210 non-null    float64\n 7   othdebt_ratio   210 non-null    float64\ndtypes: float64(8)\nmemory usage: 14.8 KB\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Discretize Features  \nMore details are in https://www.kaggle.com/code/zahrazolghadr/data-transformation","metadata":{}},{"cell_type":"code","source":"pip install scorecardbundle","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:21:02.359496Z","iopub.execute_input":"2024-06-05T16:21:02.359961Z","iopub.status.idle":"2024-06-05T16:21:19.869152Z","shell.execute_reply.started":"2024-06-05T16:21:02.359922Z","shell.execute_reply":"2024-06-05T16:21:19.866568Z"},"trusted":true},"execution_count":18,"outputs":[{"name":"stdout","text":"Collecting scorecardbundle\n  Downloading scorecardbundle-1.2.2-py3-none-any.whl.metadata (1.4 kB)\nRequirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from scorecardbundle) (1.26.4)\nRequirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from scorecardbundle) (1.11.4)\nRequirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from scorecardbundle) (2.2.2)\nRequirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (from scorecardbundle) (3.7.5)\nRequirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from scorecardbundle) (1.2.2)\nRequirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->scorecardbundle) (1.2.0)\nRequirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib->scorecardbundle) (0.12.1)\nRequirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->scorecardbundle) (4.47.0)\nRequirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->scorecardbundle) (1.4.5)\nRequirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->scorecardbundle) (21.3)\nRequirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->scorecardbundle) (9.5.0)\nRequirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->scorecardbundle) (3.1.1)\nRequirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib->scorecardbundle) (2.9.0.post0)\nRequirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->scorecardbundle) (2023.3.post1)\nRequirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas->scorecardbundle) (2023.4)\nRequirement already satisfied: joblib>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->scorecardbundle) (1.4.0)\nRequirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->scorecardbundle) (3.2.0)\nRequirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->scorecardbundle) (1.16.0)\nDownloading scorecardbundle-1.2.2-py3-none-any.whl (28 kB)\nInstalling collected packages: scorecardbundle\nSuccessfully installed scorecardbundle-1.2.2\nNote: you may need to restart the kernel to use updated packages.\n","output_type":"stream"}]},{"cell_type":"code","source":"import numpy as np\nfrom scorecardbundle.feature_discretization import ChiMerge as cm\n\nchi_merge_list = ['age', 'employ', 'address', 'income_ratio', 'debtinc', 'creddebt_ratio', 'othdebt_ratio']\n\ndef discretizer(train, test, y, chi_list):\n\n    trans_cm = cm.ChiMerge(max_intervals=5, min_intervals=2, decimal=3,output_dataframe=True)\n    trans_cm.fit(train[chi_list], y.astype('int')) \n\n    # Add -inf to the beginning of each array\n    boundaries_dict = {key: np.insert(boundaries, 0, -np.inf) for key, boundaries in trans_cm.boundaries_.items()}\n\n    # Iterate through the dictionary and add new columns to data\n    for key, boundaries in boundaries_dict.items():\n        column_name = f\"{key}_cat_cm\"\n        train[column_name] = pd.cut(train[key], bins=boundaries, labels=False, right=False)\n        test[column_name] = pd.cut(test[key], bins=boundaries, labels=False, right=False)\n        \n    return train, test\n","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:21:38.069625Z","iopub.execute_input":"2024-06-05T16:21:38.070111Z","iopub.status.idle":"2024-06-05T16:21:38.088653Z","shell.execute_reply.started":"2024-06-05T16:21:38.070060Z","shell.execute_reply":"2024-06-05T16:21:38.087106Z"},"trusted":true},"execution_count":19,"outputs":[]},{"cell_type":"code","source":"X_train, X_test = discretizer(X_train, X_test, y_train, chi_merge_list)\n\nX_train.shape, X_test.shape","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:21:42.970790Z","iopub.execute_input":"2024-06-05T16:21:42.971242Z","iopub.status.idle":"2024-06-05T16:21:45.799385Z","shell.execute_reply.started":"2024-06-05T16:21:42.971208Z","shell.execute_reply":"2024-06-05T16:21:45.798136Z"},"trusted":true},"execution_count":20,"outputs":[{"execution_count":20,"output_type":"execute_result","data":{"text/plain":"((485, 15), (210, 15))"},"metadata":{}}]},{"cell_type":"markdown","source":"### Transform Features  \nMore details are in https://www.kaggle.com/code/zahrazolghadr/data-transformation","metadata":{}},{"cell_type":"code","source":"from sklearn.preprocessing import PowerTransformer\n\n# List of features to transform\nselected_features = ['age', 'employ', 'address', 'income_ratio', 'debtinc', 'creddebt_ratio', 'othdebt_ratio']\n\ndef transform(train, test, trans_list):\n    # Iterate through selected features\n    for feature in trans_list:\n        # Check if the feature contains negative values\n        has_negative_values = (train[feature] <= 0).any()\n\n        # Choose the appropriate transformation method\n        if has_negative_values:\n            transformer = PowerTransformer(method='yeo-johnson', standardize=False)\n        else:\n            transformer = PowerTransformer(method='box-cox', standardize=False)\n\n        # Fit and transform the feature, and store the result in the new DataFrame\n        train[f\"{feature}_transformed\"] = transformer.fit_transform(train[[feature]])\n        test[f\"{feature}_transformed\"] = transformer.transform(test[[feature]])\n        \n    return train, test","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:22:08.255885Z","iopub.execute_input":"2024-06-05T16:22:08.256426Z","iopub.status.idle":"2024-06-05T16:22:08.266593Z","shell.execute_reply.started":"2024-06-05T16:22:08.256385Z","shell.execute_reply":"2024-06-05T16:22:08.265277Z"},"trusted":true},"execution_count":21,"outputs":[]},{"cell_type":"code","source":"X_train, X_test = transform(X_train, X_test, selected_features)\n\nX_train.shape, X_test.shape","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:22:13.259701Z","iopub.execute_input":"2024-06-05T16:22:13.260222Z","iopub.status.idle":"2024-06-05T16:22:13.340780Z","shell.execute_reply.started":"2024-06-05T16:22:13.260186Z","shell.execute_reply":"2024-06-05T16:22:13.339235Z"},"trusted":true},"execution_count":22,"outputs":[{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"((485, 22), (210, 22))"},"metadata":{}}]},{"cell_type":"markdown","source":"### Scale Features  \nMore details are in https://www.kaggle.com/code/zahrazolghadr/data-transformation","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.preprocessing import StandardScaler\n    \n# Apply Z-Score Scaling\nz_score_scaler = StandardScaler()\nX_train[X_train.columns.tolist()] = z_score_scaler.fit_transform(X_train)\nX_test[X_test.columns.tolist()] = z_score_scaler.transform(X_test)","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:22:24.884780Z","iopub.execute_input":"2024-06-05T16:22:24.885236Z","iopub.status.idle":"2024-06-05T16:22:24.905639Z","shell.execute_reply.started":"2024-06-05T16:22:24.885202Z","shell.execute_reply":"2024-06-05T16:22:24.904329Z"},"trusted":true},"execution_count":23,"outputs":[]},{"cell_type":"markdown","source":"### Create Different Datasets","metadata":{}},{"cell_type":"code","source":"original_features = ['age', 'ed', 'employ', 'address', 'debtinc', 'income_ratio', 'creddebt_ratio', 'othdebt_ratio']\n\n\ndiscretized_features = [ 'ed', 'age_cat_cm', 'employ_cat_cm','address_cat_cm', 'income_ratio_cat_cm', 'debtinc_cat_cm',\n                         'creddebt_ratio_cat_cm', 'othdebt_ratio_cat_cm']\n\ntransformed_features = ['ed','age_transformed','employ_transformed', 'address_transformed', 'income_ratio_transformed',\n                        'debtinc_transformed','creddebt_ratio_transformed','othdebt_ratio_transformed']\n\n\ndef scenario(data, scen_list):\n    return data[scen_list]\n","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:22:31.134641Z","iopub.execute_input":"2024-06-05T16:22:31.135119Z","iopub.status.idle":"2024-06-05T16:22:31.142725Z","shell.execute_reply.started":"2024-06-05T16:22:31.135084Z","shell.execute_reply":"2024-06-05T16:22:31.141289Z"},"trusted":true},"execution_count":24,"outputs":[]},{"cell_type":"code","source":"X_train_original = scenario(X_train, original_features)\nX_train_discretized = scenario(X_train, discretized_features)\nX_train_transformed = scenario(X_train, transformed_features)\n\nX_test_original = scenario(X_test, original_features)\nX_test_discretized = scenario(X_test, discretized_features)\nX_test_transformed = scenario(X_test, transformed_features)","metadata":{"trusted":true},"execution_count":26,"outputs":[{"execution_count":26,"output_type":"execute_result","data":{"text/plain":"          age        ed    employ   address   debtinc  income_ratio  \\\n286 -0.757182 -0.781351  0.421454 -0.214252 -0.625682 -3.943046e-01   \n146 -0.882261  2.385406 -1.105097 -0.785197  0.323823 -5.843492e-01   \n214 -0.131785 -0.781351  1.184729 -0.785197  0.027102  9.676817e-01   \n528  1.994565  0.274235  3.474556  0.784903 -0.358634  6.478975e+00   \n165  0.618692  0.274235  0.726764  0.356694  1.288163  1.822882e+00   \n..        ...       ...       ...       ...       ...           ...   \n144  0.743771 -0.781351  1.184729  1.213112 -0.714698  7.459630e-01   \n645 -1.507658  0.274235 -1.257752 -1.070670 -0.937238 -7.756359e-02   \n72   1.494247 -0.781351  2.711280  1.784057  0.383167  1.759534e+00   \n235 -1.382579 -0.781351 -0.189166 -1.213406 -0.551502 -8.377420e-01   \n37  -0.381943  0.274235  0.574109 -1.070670  0.620543  1.563056e-16   \n\n     creddebt_ratio  othdebt_ratio  \n286       -0.316469      -0.657086  \n146       -0.679458      -0.030961  \n214        1.680901       0.306257  \n528        1.890876       4.123403  \n165        3.180092       3.414949  \n..              ...            ...  \n144       -0.632760       0.096423  \n645       -0.256113      -0.783204  \n72         2.095123       1.782697  \n235       -0.580956      -0.774986  \n37         1.180335       0.554345  \n\n[485 rows x 8 columns]","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>age</th>\n      <th>ed</th>\n      <th>employ</th>\n      <th>address</th>\n      <th>debtinc</th>\n      <th>income_ratio</th>\n      <th>creddebt_ratio</th>\n      <th>othdebt_ratio</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>286</th>\n      <td>-0.757182</td>\n      <td>-0.781351</td>\n      <td>0.421454</td>\n      <td>-0.214252</td>\n      <td>-0.625682</td>\n      <td>-3.943046e-01</td>\n      <td>-0.316469</td>\n      <td>-0.657086</td>\n    </tr>\n    <tr>\n      <th>146</th>\n      <td>-0.882261</td>\n      <td>2.385406</td>\n      <td>-1.105097</td>\n      <td>-0.785197</td>\n      <td>0.323823</td>\n      <td>-5.843492e-01</td>\n      <td>-0.679458</td>\n      <td>-0.030961</td>\n    </tr>\n    <tr>\n      <th>214</th>\n      <td>-0.131785</td>\n      <td>-0.781351</td>\n      <td>1.184729</td>\n      <td>-0.785197</td>\n      <td>0.027102</td>\n      <td>9.676817e-01</td>\n      <td>1.680901</td>\n      <td>0.306257</td>\n    </tr>\n    <tr>\n      <th>528</th>\n      <td>1.994565</td>\n      <td>0.274235</td>\n      <td>3.474556</td>\n      <td>0.784903</td>\n      <td>-0.358634</td>\n      <td>6.478975e+00</td>\n      <td>1.890876</td>\n      <td>4.123403</td>\n    </tr>\n    <tr>\n      <th>165</th>\n      <td>0.618692</td>\n      <td>0.274235</td>\n      <td>0.726764</td>\n      <td>0.356694</td>\n      <td>1.288163</td>\n      <td>1.822882e+00</td>\n      <td>3.180092</td>\n      <td>3.414949</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>144</th>\n      <td>0.743771</td>\n      <td>-0.781351</td>\n      <td>1.184729</td>\n      <td>1.213112</td>\n      <td>-0.714698</td>\n      <td>7.459630e-01</td>\n      <td>-0.632760</td>\n      <td>0.096423</td>\n    </tr>\n    <tr>\n      <th>645</th>\n      <td>-1.507658</td>\n      <td>0.274235</td>\n      <td>-1.257752</td>\n      <td>-1.070670</td>\n      <td>-0.937238</td>\n      <td>-7.756359e-02</td>\n      <td>-0.256113</td>\n      <td>-0.783204</td>\n    </tr>\n    <tr>\n      <th>72</th>\n      <td>1.494247</td>\n      <td>-0.781351</td>\n      <td>2.711280</td>\n      <td>1.784057</td>\n      <td>0.383167</td>\n      <td>1.759534e+00</td>\n      <td>2.095123</td>\n      <td>1.782697</td>\n    </tr>\n    <tr>\n      <th>235</th>\n      <td>-1.382579</td>\n      <td>-0.781351</td>\n      <td>-0.189166</td>\n      <td>-1.213406</td>\n      <td>-0.551502</td>\n      <td>-8.377420e-01</td>\n      <td>-0.580956</td>\n      <td>-0.774986</td>\n    </tr>\n    <tr>\n      <th>37</th>\n      <td>-0.381943</td>\n      <td>0.274235</td>\n      <td>0.574109</td>\n      <td>-1.070670</td>\n      <td>0.620543</td>\n      <td>1.563056e-16</td>\n      <td>1.180335</td>\n      <td>0.554345</td>\n    </tr>\n  </tbody>\n</table>\n<p>485 rows × 8 columns</p>\n</div>"},"metadata":{}}]},{"cell_type":"code","source":"X_train_original.to_csv('X_train_original.csv', index=True)\nX_train_discretized.to_csv('X_train_discretized.csv', index=True)\nX_train_transformed.to_csv('X_train_transformed.csv', index=True)\nX_test_original.to_csv('X_test_original.csv', index=True)\nX_test_discretized.to_csv('X_test_discretized.csv', index=True)\nX_test_transformed.to_csv('X_test_transformed.csv', index=True)\ny_train.to_csv('y_train.csv', index=True)","metadata":{"execution":{"iopub.status.busy":"2024-06-05T16:35:30.701486Z","iopub.execute_input":"2024-06-05T16:35:30.702116Z","iopub.status.idle":"2024-06-05T16:35:30.765648Z","shell.execute_reply.started":"2024-06-05T16:35:30.702072Z","shell.execute_reply":"2024-06-05T16:35:30.764396Z"},"trusted":true},"execution_count":27,"outputs":[]},{"cell_type":"markdown","source":"### Fit the Model and Evaluate It","metadata":{}},{"cell_type":"code","source":"from sklearn.tree import DecisionTreeClassifier\nfrom sklearn.metrics import accuracy_score\n\ndef model(X_train,X_test,y_train):\n\n    # Create a decision tree classifier\n    clf = DecisionTreeClassifier(random_state=111)\n\n    # Train the classifier on the training data\n    clf.fit(X_train, y_train)\n\n    # Predict the labels for the test set\n    y_pred = clf.predict(X_test)\n\n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    \n    return accuracy\n","metadata":{"execution":{"iopub.status.busy":"2024-05-14T15:02:21.676901Z","iopub.execute_input":"2024-05-14T15:02:21.677394Z","iopub.status.idle":"2024-05-14T15:02:21.687952Z","shell.execute_reply.started":"2024-05-14T15:02:21.677315Z","shell.execute_reply":"2024-05-14T15:02:21.686076Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"accuracy_original = model(X_train_original,X_test_original,y_train)\naccuracy_discretized = model(X_train_discretized,X_test_discretized,y_train)\naccuracy_transformed = model(X_train_transformed,X_test_transformed,y_train)\n\nprint(\"accuracy_original:\", accuracy_original)\nprint(\"accuracy_discretized:\", accuracy_discretized)\nprint(\"accuracy_transformed:\", accuracy_transformed)","metadata":{"execution":{"iopub.status.busy":"2024-05-14T15:02:55.631780Z","iopub.execute_input":"2024-05-14T15:02:55.632200Z","iopub.status.idle":"2024-05-14T15:02:55.667966Z","shell.execute_reply.started":"2024-05-14T15:02:55.632170Z","shell.execute_reply":"2024-05-14T15:02:55.666813Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"X_data = pd.concat((X_train_discretized,X_test_discretized), keys=['train','test'])\ny_data = pd.concat((y_train, y_test), keys=['train','test'])\nX_data.to_csv('X_data',index=True)\ny_data.to_csv('y_data_discretized_no_scaling',index=True)","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}