{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":2232033,"sourceType":"datasetVersion","datasetId":1340957}],"dockerImageVersionId":30558,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"### Guideline Map: \n\n##### Previous Notebook:\n###### [Project 1: Prediction of drug prescription](https://www.kaggle.com/code/rouzbeh/prediction-of-drug-prescription)\n\n##### Next Notebook:\n###### [Basic Programming in Python: Dive into Python-Section 1](https://www.kaggle.com/code/rouzbeh/dive-into-python-section-1)   \n\n##### Course Content:   \n###### [1. Dive into Data Science: Content -->](https://www.kaggle.com/rouzbeh/1-dive-into-ds-content) ","metadata":{}},{"cell_type":"markdown","source":"# Loan Credit Prediction Overview\nThe objective of this project is to provide a rapid overview of the **CRISP-DM methodology** for data-driven problem-solving in a practical manner. To achieve this, we leveraged **ChatGPT 3.5** to generate Python code through a **step-by-step dialogue** based on **prompts**. The process involved **copying and pasting** the generated Python code into **Kaggle** code cells, followed by execution.\n\nThis approach is particularly beneficial for individuals familiar with **Data Science challenges** who aim to solve problems **without in-depth knowledge** of techniques, algorithms, and coding. It streamlines the implementation process and allows for a quick understanding of the data science workflow.","metadata":{}},{"cell_type":"markdown","source":"# About Loan Dataset\n\npublicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back.\n\nWe will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full.\n\n### Here are what the columns represent:\n\n**credit.policy:** 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.  \n\n**purpose:** The purpose of the loan (takes values \"credit_card\", \"debt_consolidation\", \"educational\", \"major_purchase\", \"small_business\", and \"all_other\").  \n\n**int.rate:** The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.  \n\n**installment:** The monthly installments owed by the borrower if the loan is funded.  \n\n**log.annual.inc:** The natural log of the self-reported annual income of the borrower.  \n\n**dti:** The debt-to-income ratio of the borrower (amount of debt divided by annual income).  \n\n**fico:** The FICO credit score of the borrower.  \n\n**days.with.cr.line:** The number of days the borrower has had a credit line.  \n\n**revol.bal:** The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).  \n\n**revol.util:** The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).  \n\n**inq.last.6mths:** The borrower's number of inquiries by creditors in the last 6 months.  \n\n**delinq.2yrs:** The number of times the borrower had been 30+ days past due on a payment in the past 2 years.  \n\n**pub.rec:** The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).  \n\n**not.fully.paid:** It is my Target.","metadata":{}},{"cell_type":"markdown","source":"**Prompt:**  \nHey GPT! I am beginner in data science skills. Please act as a senior data scientist and help me in problem solving with python. I have a dataset in kaggle profile (csv file) about loan data.\nFirst, i want read data in python and i certain that its true. Please give me python code with comment and explain.","metadata":{}},{"cell_type":"code","source":"# Import the Pandas library\nimport pandas as pd\n\n# Define the file path to your dataset (replace 'your_dataset.csv' with the actual file path)\nfile_path = '/kaggle/input/loan-data/loan_data.csv'\n\n# Use Pandas to read the CSV file into a DataFrame\ntry:\n    # Attempt to read the CSV file\n    df = pd.read_csv(file_path)\n\n    # If successful, display the first few rows of the dataset to verify\n    print(\"Data successfully loaded. Here are the first few rows:\")\n    print(df.head())\n\n    # Display basic information about the dataset\n    print(\"\\nDataset Info:\")\n    print(df.info())\n\n    # Display summary statistics for numerical columns\n    print(\"\\nSummary Statistics:\")\n    print(df.describe())\n\n    # Check for missing values\n    print(\"\\nMissing Values:\")\n    print(df.isnull().sum())\n\nexcept FileNotFoundError:\n    # Handle the case where the file is not found\n    print(\"File not found. Please check the file path.\")\nexcept Exception as e:\n    # Handle other potential exceptions\n    print(f\"An error occurred: {str(e)}\")\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:50:56.636014Z","iopub.execute_input":"2023-11-25T17:50:56.636485Z","iopub.status.idle":"2023-11-25T17:50:57.276275Z","shell.execute_reply.started":"2023-11-25T17:50:56.636446Z","shell.execute_reply":"2023-11-25T17:50:57.275325Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nOK! I want change credit.policy and not.fully.paid features type int64 to boolean with True and False values.","metadata":{}},{"cell_type":"code","source":"# Convert 'credit.policy' and 'not.fully.paid' columns to boolean type\ndf['credit.policy'] = df['credit.policy'].astype(bool)\ndf['not.fully.paid'] = df['not.fully.paid'].astype(bool)\n\n# Verify the changes\nprint(df[['credit.policy', 'not.fully.paid']].dtypes)\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:50:57.277887Z","iopub.execute_input":"2023-11-25T17:50:57.278704Z","iopub.status.idle":"2023-11-25T17:50:57.294001Z","shell.execute_reply.started":"2023-11-25T17:50:57.278673Z","shell.execute_reply":"2023-11-25T17:50:57.292000Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nNext, i want sorted unique values for features {inq.last.6mths, delinq.2yrs, pub.rec} include percentage per values.","metadata":{}},{"cell_type":"code","source":"# List of features to analyze\nfeatures_to_analyze = ['inq.last.6mths', 'delinq.2yrs', 'pub.rec']\n\n# Loop through each feature\nfor feature in features_to_analyze:\n    # Get unique values and their counts\n    unique_values_counts = df[feature].value_counts()\n\n    # Calculate the percentage per unique value\n    percentages = (unique_values_counts / unique_values_counts.sum()) * 100\n\n    # Sort the unique values in ascending order\n    unique_values_counts = unique_values_counts.sort_index()\n    percentages = percentages.sort_index()\n\n    # Print the results\n    print(f\"Feature: {feature}\")\n    print(\"Unique Values\\tCounts\\tPercentage\")\n    for value, count, percentage in zip(unique_values_counts.index, unique_values_counts, percentages):\n        print(f\"{value}\\t\\t{count}\\t\\t{percentage:.2f}%\")\n    print(\"\\n\")\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:50:57.295324Z","iopub.execute_input":"2023-11-25T17:50:57.295767Z","iopub.status.idle":"2023-11-25T17:50:57.314217Z","shell.execute_reply.started":"2023-11-25T17:50:57.295728Z","shell.execute_reply":"2023-11-25T17:50:57.313088Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nI want custom binning method for inq.last.6mths feature: (4 bins)  \n\nbin1 : value =0,  \nbin2 : value =1 & 2,  \nbin3 : value =3 & 4 & 5,  \nbin4 : value > 5  \n\nAfter that i want add to my data frame and show distribution of new discrete labels with percentage.","metadata":{}},{"cell_type":"code","source":"# Define custom bin labels and bins\ncustom_bins = [-1, 0, 2, 5, float('inf')]\nbin_labels = ['bin1', 'bin2', 'bin3', 'bin4']\n\n# Create a new column 'inq.last.6mths_bin' with custom bins\ndf['inq.last.6mths_bin'] = pd.cut(df['inq.last.6mths'], bins=custom_bins, labels=bin_labels)\n\n# Calculate the distribution of the new discrete labels\nlabel_distribution = df['inq.last.6mths_bin'].value_counts()\n\n# Calculate the percentage distribution\npercentage_distribution = (label_distribution / label_distribution.sum()) * 100\n\n# Sort the values based on the custom bin labels\nlabel_distribution = label_distribution[bin_labels]\npercentage_distribution = percentage_distribution[bin_labels]\n\n# Print the distribution of new discrete labels with percentages\nprint(\"Distribution of New Discrete Labels:\")\nprint(\"Label\\tCounts\\tPercentage\")\nfor label, count, percentage in zip(label_distribution.index, label_distribution, percentage_distribution):\n    print(f\"{label}\\t{count}\\t{percentage:.2f}%\")\n\n# Remove the temporary 'inq.last.6mths_bin' column if you don't need it\n# df.drop('inq.last.6mths_bin', axis=1, inplace=True)\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:50:57.316288Z","iopub.execute_input":"2023-11-25T17:50:57.317184Z","iopub.status.idle":"2023-11-25T17:50:57.342502Z","shell.execute_reply.started":"2023-11-25T17:50:57.317147Z","shell.execute_reply":"2023-11-25T17:50:57.341318Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nI want custom binning method for delinq.2yrs feature: (3 bins)  \n\nbin1 : value =0,  \nbin2 : value =1,  \nbin3 : value > 1  \n\nAfter that i want add to my data frame and show distribution of new discrete labels with percentage.","metadata":{}},{"cell_type":"code","source":"# Define custom bin labels and bins\ncustom_bins_delinq = [-1, 0, 1, float('inf')]\nbin_labels_delinq = ['bin1', 'bin2', 'bin3']\n\n# Create a new column 'delinq.2yrs_bin' with custom bins\ndf['delinq.2yrs_bin'] = pd.cut(df['delinq.2yrs'], bins=custom_bins_delinq, labels=bin_labels_delinq)\n\n# Calculate the distribution of the new discrete labels\nlabel_distribution_delinq = df['delinq.2yrs_bin'].value_counts()\n\n# Calculate the percentage distribution\npercentage_distribution_delinq = (label_distribution_delinq / label_distribution_delinq.sum()) * 100\n\n# Sort the values based on the custom bin labels\nlabel_distribution_delinq = label_distribution_delinq[bin_labels_delinq]\npercentage_distribution_delinq = percentage_distribution_delinq[bin_labels_delinq]\n\n# Print the distribution of new discrete labels with percentages\nprint(\"Distribution of New Discrete Labels for 'delinq.2yrs':\")\nprint(\"Label\\tCounts\\tPercentage\")\nfor label, count, percentage in zip(label_distribution_delinq.index, label_distribution_delinq, percentage_distribution_delinq):\n    print(f\"{label}\\t{count}\\t{percentage:.2f}%\")\n\n# Remove the temporary 'delinq.2yrs_bin' column if you don't need it\n# df.drop('delinq.2yrs_bin', axis=1, inplace=True)\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:50:57.345782Z","iopub.execute_input":"2023-11-25T17:50:57.346122Z","iopub.status.idle":"2023-11-25T17:50:57.362925Z","shell.execute_reply.started":"2023-11-25T17:50:57.346093Z","shell.execute_reply":"2023-11-25T17:50:57.361886Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nI want custom binning method for pub.rec feature: (2 bins)  \n\nbin1 : value =0,  \nbin2 : value > 0  \n\nAfter that i want add to my data frame and show distribution of new discrete labels with percentage.","metadata":{}},{"cell_type":"code","source":"# Define custom bin labels and bins\ncustom_bins_pub_rec = [-1, 0, float('inf')]\nbin_labels_pub_rec = ['bin1', 'bin2']\n\n# Create a new column 'pub.rec_bin' with custom bins\ndf['pub.rec_bin'] = pd.cut(df['pub.rec'], bins=custom_bins_pub_rec, labels=bin_labels_pub_rec)\n\n# Calculate the distribution of the new discrete labels\nlabel_distribution_pub_rec = df['pub.rec_bin'].value_counts()\n\n# Calculate the percentage distribution\npercentage_distribution_pub_rec = (label_distribution_pub_rec / label_distribution_pub_rec.sum()) * 100\n\n# Sort the values based on the custom bin labels\nlabel_distribution_pub_rec = label_distribution_pub_rec[bin_labels_pub_rec]\npercentage_distribution_pub_rec = percentage_distribution_pub_rec[bin_labels_pub_rec]\n\n# Print the distribution of new discrete labels with percentages\nprint(\"Distribution of New Discrete Labels for 'pub.rec':\")\nprint(\"Label\\tCounts\\tPercentage\")\nfor label, count, percentage in zip(label_distribution_pub_rec.index, label_distribution_pub_rec, percentage_distribution_pub_rec):\n    print(f\"{label}\\t{count}\\t{percentage:.2f}%\")\n\n# Remove the temporary 'pub.rec_bin' column if you don't need it\n# df.drop('pub.rec_bin', axis=1, inplace=True)\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:50:57.364424Z","iopub.execute_input":"2023-11-25T17:50:57.364789Z","iopub.status.idle":"2023-11-25T17:50:57.383933Z","shell.execute_reply.started":"2023-11-25T17:50:57.364759Z","shell.execute_reply":"2023-11-25T17:50:57.382695Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nNow, i want bar chart for my categorical features (types are object and boolean).","metadata":{}},{"cell_type":"code","source":"import matplotlib.pyplot as plt\nimport seaborn as sns\n\n# List of categorical features (including boolean features)\ncategorical_features = ['purpose', 'credit.policy', 'not.fully.paid', 'inq.last.6mths_bin', 'delinq.2yrs_bin', 'pub.rec_bin']\n\n# Create subplots for each categorical feature\nplt.figure(figsize=(16, 10))\n\nfor i, feature in enumerate(categorical_features, 1):\n    plt.subplot(2, 3, i)\n    plt.title(f'Bar Chart for {feature}')\n    \n    if feature in ['credit.policy', 'not.fully.paid']:\n        sns.countplot(data=df, x=feature, palette='Set2')\n    else:\n        sns.countplot(data=df, x=feature, palette='viridis')\n    \n    plt.xticks(rotation=45)\n\nplt.tight_layout()\nplt.show()\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:50:57.385571Z","iopub.execute_input":"2023-11-25T17:50:57.386009Z","iopub.status.idle":"2023-11-25T17:50:59.763435Z","shell.execute_reply.started":"2023-11-25T17:50:57.385969Z","shell.execute_reply":"2023-11-25T17:50:59.762192Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nI see in revol.util feature some records are over 100%. please show me these records.","metadata":{}},{"cell_type":"code","source":"# Filter and display records where 'revol.util' is greater than 100%\ninvalid_revol_util_records = df[df['revol.util'] > 100]\n\n# Print the records\nprint(\"Records with 'revol.util' > 100%:\")\nprint(invalid_revol_util_records)\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:50:59.765176Z","iopub.execute_input":"2023-11-25T17:50:59.765895Z","iopub.status.idle":"2023-11-25T17:50:59.792798Z","shell.execute_reply.started":"2023-11-25T17:50:59.765843Z","shell.execute_reply":"2023-11-25T17:50:59.791497Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nI want change these values to value=100 in original dataset and show them.","metadata":{}},{"cell_type":"code","source":"# Filter and change records where 'revol.util' is greater than 100% to 100%\ndf.loc[df['revol.util'] > 100, 'revol.util'] = 100\n\n# Display the modified records\nprint(\"Records after changing 'revol.util' values:\")\nprint(df[df['revol.util'] > 100])\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:50:59.794351Z","iopub.execute_input":"2023-11-25T17:50:59.794674Z","iopub.status.idle":"2023-11-25T17:50:59.804170Z","shell.execute_reply.started":"2023-11-25T17:50:59.794645Z","shell.execute_reply":"2023-11-25T17:50:59.802621Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nNow i want histogram for all numerical features in dataset (types are int and float)","metadata":{}},{"cell_type":"code","source":"import matplotlib.pyplot as plt\n\n# Select only numerical columns\nnumerical_columns = df.select_dtypes(include=['int64', 'float64'])\n\n# Set the number of bins for the histograms\nnum_bins = 30  # You can adjust this value based on your preference\n\n# Calculate the number of rows and columns for subplots\nnum_features = len(numerical_columns.columns)\nnum_rows = (num_features - 1) // 3 + 1\nnum_cols = min(3, num_features)\n\n# Create subplots for each numerical feature\nplt.figure(figsize=(16, 4 * num_rows))  # Adjust the height based on the number of rows\nfor i, column in enumerate(numerical_columns.columns, 1):\n    plt.subplot(num_rows, num_cols, i)\n    plt.title(f'Histogram for {column}')\n    plt.hist(df[column], bins=num_bins, edgecolor='k')\n    plt.xlabel(column)\n    plt.ylabel('Frequency')\n\nplt.tight_layout()\nplt.show()\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:50:59.805987Z","iopub.execute_input":"2023-11-25T17:50:59.806614Z","iopub.status.idle":"2023-11-25T17:51:03.226526Z","shell.execute_reply.started":"2023-11-25T17:50:59.806570Z","shell.execute_reply":"2023-11-25T17:51:03.225303Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nNext, i want logarithm transformation on revol.bal and show histogram of this.","metadata":{}},{"cell_type":"code","source":"import matplotlib.pyplot as plt\nimport numpy as np\n\n# Apply logarithm transformation to 'revol.bal'\ndf['log_revol_bal'] = np.log1p(df['revol.bal'])\n\n# Create a histogram of the transformed values\nplt.figure(figsize=(8, 6))\nplt.title('Histogram of Log-transformed revol.bal')\nplt.hist(df['log_revol_bal'], bins=20, edgecolor='k')\nplt.xlabel('Log-transformed revol.bal')\nplt.ylabel('Frequency')\nplt.show()\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:03.228134Z","iopub.execute_input":"2023-11-25T17:51:03.229319Z","iopub.status.idle":"2023-11-25T17:51:03.510870Z","shell.execute_reply.started":"2023-11-25T17:51:03.229277Z","shell.execute_reply":"2023-11-25T17:51:03.509421Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nShow me correlation matrix  between all numerical features and run spearman statistical test for them. ","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nimport scipy.stats as stats\n\n# Select only numerical columns\nnumerical_columns = df.select_dtypes(include=['int64', 'float64'])\n\n# Calculate the correlation matrix using the Spearman method\ncorrelation_matrix = numerical_columns.corr(method='spearman')\n\n# Display the correlation matrix\nprint(\"Spearman Correlation Matrix:\")\nprint(correlation_matrix)\n\n# Perform Spearman rank correlation test\nprint(\"\\nSpearman Rank Correlation Test:\")\nfor column1 in numerical_columns.columns:\n    for column2 in numerical_columns.columns:\n        if column1 != column2:\n            correlation, p_value = stats.spearmanr(numerical_columns[column1], numerical_columns[column2])\n            print(f\"{column1} vs. {column2}:\")\n            print(f\"  - Spearman correlation: {correlation:.4f}\")\n            print(f\"  - p-value: {p_value:.4f}\\n\")\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:03.512475Z","iopub.execute_input":"2023-11-25T17:51:03.513197Z","iopub.status.idle":"2023-11-25T17:51:03.912574Z","shell.execute_reply.started":"2023-11-25T17:51:03.513159Z","shell.execute_reply":"2023-11-25T17:51:03.911218Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nI want a heatmap for correlation matrix","metadata":{}},{"cell_type":"code","source":"import seaborn as sns\nimport matplotlib.pyplot as plt\n\n# Create a heatmap for the correlation matrix\nplt.figure(figsize=(10, 8))\nsns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)\nplt.title('Spearman Correlation Heatmap')\nplt.show()\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:03.914050Z","iopub.execute_input":"2023-11-25T17:51:03.914418Z","iopub.status.idle":"2023-11-25T17:51:04.807181Z","shell.execute_reply.started":"2023-11-25T17:51:03.914387Z","shell.execute_reply":"2023-11-25T17:51:04.805870Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nNow i want one hot encoding for purpose and add to dataset.","metadata":{}},{"cell_type":"code","source":"# Perform one-hot encoding for the 'purpose' column\none_hot_encoded_purpose = pd.get_dummies(df['purpose'], prefix='purpose')\n\n# Concatenate the one-hot encoded columns with the original DataFrame\ndf = pd.concat([df, one_hot_encoded_purpose], axis=1)\n\n# Drop the original 'purpose' column, as it's no longer needed\ndf.drop('purpose', axis=1, inplace=True)\n\n# Display the updated DataFrame with one-hot encoding\nprint(df.head())\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:04.812707Z","iopub.execute_input":"2023-11-25T17:51:04.813086Z","iopub.status.idle":"2023-11-25T17:51:04.837644Z","shell.execute_reply.started":"2023-11-25T17:51:04.813052Z","shell.execute_reply":"2023-11-25T17:51:04.836455Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"df.info()","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:04.839604Z","iopub.execute_input":"2023-11-25T17:51:04.840500Z","iopub.status.idle":"2023-11-25T17:51:04.859418Z","shell.execute_reply.started":"2023-11-25T17:51:04.840459Z","shell.execute_reply":"2023-11-25T17:51:04.858229Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nI want drop features:{revol.bal, inq.last.6mths, delinq.2yrs, pub.rec} from df.","metadata":{}},{"cell_type":"code","source":"# Drop the {revol.bal, inq.last.6mths, delinq.2yrs, pub.rec} columns from the DataFrame\ndf.drop(['revol.bal', 'inq.last.6mths', 'delinq.2yrs', 'pub.rec'], axis=1, inplace=True)\n\n# Display the updated DataFrame\nprint(df.head())\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:04.861364Z","iopub.execute_input":"2023-11-25T17:51:04.862229Z","iopub.status.idle":"2023-11-25T17:51:04.883552Z","shell.execute_reply.started":"2023-11-25T17:51:04.862186Z","shell.execute_reply":"2023-11-25T17:51:04.882289Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nNext, i want lable encoding for category type features.","metadata":{}},{"cell_type":"code","source":"from sklearn.preprocessing import LabelEncoder\n\n# List of categorical columns to label encode\ncategorical_columns_to_encode = ['inq.last.6mths_bin', 'delinq.2yrs_bin', 'pub.rec_bin']\n\n# Initialize the LabelEncoder\nlabel_encoder = LabelEncoder()\n\n# Apply label encoding to each specified column\nfor column in categorical_columns_to_encode:\n    df[column] = label_encoder.fit_transform(df[column])\n\n# Display the updated DataFrame with label encoding applied\nprint(df.head())\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:04.885067Z","iopub.execute_input":"2023-11-25T17:51:04.885440Z","iopub.status.idle":"2023-11-25T17:51:04.998426Z","shell.execute_reply.started":"2023-11-25T17:51:04.885407Z","shell.execute_reply":"2023-11-25T17:51:04.997116Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nNext, split all data to train and test set for modeling and evaluation and than show me its doing correct or not.","metadata":{}},{"cell_type":"code","source":"from sklearn.model_selection import train_test_split\n\n# Define your target variable (not.fully.paid) and features\nX = df.drop('not.fully.paid', axis=1)\ny = df['not.fully.paid']\n\n# Split the data into a training set (80%) and a test set (20%)\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n# Check the shapes of the resulting sets to verify the split\nprint(f\"X_train shape: {X_train.shape}\")\nprint(f\"X_test shape: {X_test.shape}\")\nprint(f\"y_train shape: {y_train.shape}\")\nprint(f\"y_test shape: {y_test.shape}\")\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:05.000200Z","iopub.execute_input":"2023-11-25T17:51:05.000687Z","iopub.status.idle":"2023-11-25T17:51:05.139194Z","shell.execute_reply.started":"2023-11-25T17:51:05.000642Z","shell.execute_reply":"2023-11-25T17:51:05.137793Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nI want doing normalization on just numerical features (int and float) in training set, and apply that on numerical features in test set.","metadata":{}},{"cell_type":"code","source":"from sklearn.preprocessing import StandardScaler\n\n# Select only numerical columns in the training and test sets\nnumerical_columns = X_train.select_dtypes(include=['int64', 'float64'])\nnumerical_columns_test = X_test.select_dtypes(include=['int64', 'float64'])\n\n# Initialize the StandardScaler\nscaler = StandardScaler()\n\n# Fit the scaler on the training data and transform it\nX_train[numerical_columns.columns] = scaler.fit_transform(numerical_columns)\n\n# Transform the test data using the same scaler\nX_test[numerical_columns_test.columns] = scaler.transform(numerical_columns_test)\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:05.141541Z","iopub.execute_input":"2023-11-25T17:51:05.142025Z","iopub.status.idle":"2023-11-25T17:51:05.162776Z","shell.execute_reply.started":"2023-11-25T17:51:05.141971Z","shell.execute_reply":"2023-11-25T17:51:05.161857Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"X_train.head()","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:05.164643Z","iopub.execute_input":"2023-11-25T17:51:05.165518Z","iopub.status.idle":"2023-11-25T17:51:05.197133Z","shell.execute_reply.started":"2023-11-25T17:51:05.165477Z","shell.execute_reply":"2023-11-25T17:51:05.195840Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"X_test.head()","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:05.198967Z","iopub.execute_input":"2023-11-25T17:51:05.199441Z","iopub.status.idle":"2023-11-25T17:51:05.227132Z","shell.execute_reply.started":"2023-11-25T17:51:05.199400Z","shell.execute_reply":"2023-11-25T17:51:05.225658Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nPlease recommend me 3 best models for my problem.","metadata":{}},{"cell_type":"markdown","source":"**GPT Response:**  \n\nTo recommend three machine learning models for your problem of predicting whether a borrower is not fully paid (a binary classification problem), I'll suggest some commonly used models known for their effectiveness in such tasks. You can try these and compare their performance to see which one works best for your specific dataset:\n\n**Logistic Regression:**\n\n1. Logistic Regression is a simple yet powerful algorithm for binary classification problems like yours.\n2. It's interpretable, easy to implement, and serves as a good baseline model.\n3. Start with Logistic Regression to establish a baseline performance.  \n\n**Random Forest Classifier:**\n\n1. Random Forest is an ensemble learning method that combines multiple decision trees to make predictions.\n2. It often performs well \"out of the box\" without much tuning.\n3. Random Forest can handle both numerical and categorical features, making it suitable for your dataset.  \n\n**Gradient Boosting Classifier (e.g., XGBoost or LightGBM):**\n\n1. Gradient Boosting is another ensemble technique that builds an ensemble of decision trees in a sequential manner.\n2. Models like XGBoost or LightGBM are popular choices and are known for their excellent predictive performance.\n3. They can handle complex relationships in the data and often yield top-tier results.  \n\nYou can start by implementing and training these three models on your dataset. Be sure to perform proper preprocessing, feature engineering, and hyperparameter tuning to get the best performance. Additionally, consider evaluating the models using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC AUC to assess their performance.\n\nRemember that the choice of the best model depends on your dataset's characteristics, so it's a good practice to experiment with different algorithms and select the one that provides the highest predictive accuracy and meets your specific business or research goals.","metadata":{}},{"cell_type":"markdown","source":"**Prompt:**  \nPlease run 3 models that recommend me and evaluate thats.","metadata":{}},{"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.ensemble import RandomForestClassifier\nfrom xgboost import XGBClassifier\nfrom sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score\n\n# Split the data into training and test sets (if not already done)\n# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n# Initialize and train the models\nmodels = {\n    'Logistic Regression': LogisticRegression(random_state=42),\n    'Random Forest': RandomForestClassifier(random_state=42),\n    'XGBoost': XGBClassifier(random_state=42)\n}\n\nfor name, model in models.items():\n    model.fit(X_train, y_train)\n    y_pred = model.predict(X_test)\n    \n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    precision = precision_score(y_test, y_pred)\n    recall = recall_score(y_test, y_pred)\n    f1 = f1_score(y_test, y_pred)\n    roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])\n    \n    # Print the evaluation metrics\n    print(f\"Model: {name}\")\n    print(f\"Accuracy: {accuracy:.4f}\")\n    print(f\"Precision: {precision:.4f}\")\n    print(f\"Recall: {recall:.4f}\")\n    print(f\"F1 Score: {f1:.4f}\")\n    print(f\"ROC AUC Score: {roc_auc:.4f}\")\n    print(\"-----------------------------\")\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:05.228800Z","iopub.execute_input":"2023-11-25T17:51:05.229156Z","iopub.status.idle":"2023-11-25T17:51:09.062164Z","shell.execute_reply.started":"2023-11-25T17:51:05.229127Z","shell.execute_reply":"2023-11-25T17:51:09.061155Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nPlease show me confusion matrix for test set.","metadata":{}},{"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\n# Assuming you have already trained a model (e.g., Logistic Regression)\nmodel = LogisticRegression(random_state=42)\nmodel.fit(X_train, y_train)\n\n# Make predictions on the test data\ny_pred = model.predict(X_test)\n\n# Generate a confusion matrix\ncm = confusion_matrix(y_test, y_pred)\n\n# Create a heatmap for the confusion matrix\nplt.figure(figsize=(8, 6))\nsns.heatmap(cm, annot=True, fmt=\"d\", cmap=\"Blues\", cbar=False)\nplt.xlabel('Predicted')\nplt.ylabel('Actual')\nplt.title('Confusion Matrix')\nplt.show()\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:09.066434Z","iopub.execute_input":"2023-11-25T17:51:09.067452Z","iopub.status.idle":"2023-11-25T17:51:09.379741Z","shell.execute_reply.started":"2023-11-25T17:51:09.067408Z","shell.execute_reply":"2023-11-25T17:51:09.378487Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**   \nI want over sampling on minor class of training set, and then running all models and evaluate again.","metadata":{}},{"cell_type":"code","source":"from imblearn.over_sampling import SMOTE\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.ensemble import RandomForestClassifier\nfrom xgboost import XGBClassifier\nfrom sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score\n\n# Split the data into training and test sets (if not already done)\n# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n# Apply SMOTE to oversample the minority class in the training data\nsmote = SMOTE(random_state=42)\nX_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)\n\n# Initialize and train the models\nmodels = {\n    'Logistic Regression': LogisticRegression(random_state=42),\n    'Random Forest': RandomForestClassifier(random_state=42),\n    'XGBoost': XGBClassifier(random_state=42)\n}\n\nfor name, model in models.items():\n    model.fit(X_train_resampled, y_train_resampled)\n    y_pred = model.predict(X_test)\n    \n    # Evaluate the model\n    accuracy = accuracy_score(y_test, y_pred)\n    precision = precision_score(y_test, y_pred)\n    recall = recall_score(y_test, y_pred)\n    f1 = f1_score(y_test, y_pred)\n    roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])\n    \n    # Print the evaluation metrics\n    print(f\"Model: {name}\")\n    print(f\"Accuracy: {accuracy:.4f}\")\n    print(f\"Precision: {precision:.4f}\")\n    print(f\"Recall: {recall:.4f}\")\n    print(f\"F1 Score: {f1:.4f}\")\n    print(f\"ROC AUC Score: {roc_auc:.4f}\")\n    print(\"-----------------------------\")\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:09.381212Z","iopub.execute_input":"2023-11-25T17:51:09.381580Z","iopub.status.idle":"2023-11-25T17:51:15.005268Z","shell.execute_reply.started":"2023-11-25T17:51:09.381551Z","shell.execute_reply":"2023-11-25T17:51:15.004309Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Prompt:**  \nPlease show new confusion matrix for 3 models again on test set.","metadata":{}},{"cell_type":"code","source":"from imblearn.over_sampling import SMOTE\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.ensemble import RandomForestClassifier\nfrom xgboost import XGBClassifier\nfrom sklearn.metrics import confusion_matrix\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\n# Split the data into training and test sets (if not already done)\n# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n# Apply SMOTE to oversample the minority class in the training data\nsmote = SMOTE(random_state=42)\nX_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)\n\n# Initialize and train the models\nmodels = {\n    'Logistic Regression': LogisticRegression(random_state=42),\n    'Random Forest': RandomForestClassifier(random_state=42),\n    'XGBoost': XGBClassifier(random_state=42)\n}\n\nfor name, model in models.items():\n    model.fit(X_train_resampled, y_train_resampled)\n    y_pred = model.predict(X_test)\n    \n    # Generate a confusion matrix\n    cm = confusion_matrix(y_test, y_pred)\n    \n    # Create a heatmap for the confusion matrix\n    plt.figure(figsize=(8, 6))\n    sns.heatmap(cm, annot=True, fmt=\"d\", cmap=\"Blues\", cbar=False)\n    plt.xlabel('Predicted')\n    plt.ylabel('Actual')\n    plt.title(f'Confusion Matrix for {name}')\n    plt.show()\n","metadata":{"execution":{"iopub.status.busy":"2023-11-25T17:51:15.006895Z","iopub.execute_input":"2023-11-25T17:51:15.007769Z","iopub.status.idle":"2023-11-25T17:51:20.822502Z","shell.execute_reply.started":"2023-11-25T17:51:15.007732Z","shell.execute_reply":"2023-11-25T17:51:20.821293Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Guideline Map: \n\n##### Previous Notebook:\n###### [Project 1: Prediction of drug prescription](https://www.kaggle.com/code/rouzbeh/prediction-of-drug-prescription)\n\n##### Next Notebook:\n###### [Basic Programming in Python: Dive into Python-Section 1](https://www.kaggle.com/code/rouzbeh/dive-into-python-section-1)   \n\n##### Course Content:   \n###### [1. Dive into Data Science: Content -->](https://www.kaggle.com/rouzbeh/1-dive-into-ds-content)  \n\n##### Next Courses:   \n###### [2. Statistics for Data Science -->](https://www.kaggle.com/code/rouzbeh/2-statistics-for-ds-content)  \n###### [3. Data Mining and Applied Machine Learning -->](https://www.kaggle.com/code/rouzbeh/3-data-mining-applied-machine-learning-content)","metadata":{}}]}