{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"Naive Bayes is a popular and **simple** machine learning algorithm based on **Bayes' theorem**, with a **strong assumption of feature independence** given the class labels. Despite its simplicity, Naive Bayes often performs **surprisingly well** in practice, particularly on **text classification** tasks such as spam detection and sentiment analysis. One of its key advantages is its **efficiency**, as it requires relatively **few training data** to estimate parameters. Moreover, Naive Bayes is **computationally efficient** and can handle **large feature spaces**. Another advantage is its **interpretability**, as the probabilistic nature of the model allows for easy understanding of predictions. However, Naive Bayes may suffer from its overly simplistic assumption of feature independence, which **might not hold true in real-world** datasets. This can lead to suboptimal performance, especially in domains with highly correlated features. Additionally, Naive Bayes tends to be outperformed by more complex models when the dataset is large and diverse. Despite its limitations, Naive Bayes remains a **useful and widely used algorithm**, particularly in scenarios where simplicity and **speed are prioritized over predictive accuracy**.","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19"}},{"cell_type":"code","source":"import pandas as pd\n\n# Read the training datasets\nX_train_original = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/X_train_original.csv')\nX_train_original = X_train_original.set_index('Unnamed: 0')\n\nX_train_transformed = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/X_train_transformed.csv')\nX_train_transformed = X_train_transformed.set_index('Unnamed: 0')\n\n# Read the test datasets\nX_test_original = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/X_test_original.csv')\nX_test_original = X_test_original.set_index('Unnamed: 0')\n\nX_test_transformed = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/X_test_transformed.csv')\nX_test_transformed = X_test_transformed.set_index('Unnamed: 0')\n\n# Read the target variable for training and testing\ny_train = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/y_train.csv')\ny_train = y_train.set_index('Unnamed: 0').squeeze()\n\ny_test = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/y_test.csv')\ny_test = y_test.set_index('Unnamed: 0').squeeze() ","metadata":{"trusted":true},"execution_count":68,"outputs":[{"execution_count":68,"output_type":"execute_result","data":{"text/plain":"                 age        ed    employ   address   debtinc  income_ratio  \\\nUnnamed: 0                                                                   \n286        -0.757182 -0.781351  0.421454 -0.214252 -0.625682 -3.943046e-01   \n146        -0.882261  2.385406 -1.105097 -0.785197  0.323823 -5.843492e-01   \n214        -0.131785 -0.781351  1.184729 -0.785197  0.027102  9.676817e-01   \n528         1.994565  0.274235  3.474556  0.784903 -0.358634  6.478975e+00   \n165         0.618692  0.274235  0.726764  0.356694  1.288163  1.822882e+00   \n...              ...       ...       ...       ...       ...           ...   \n144         0.743771 -0.781351  1.184729  1.213112 -0.714698  7.459630e-01   \n645        -1.507658  0.274235 -1.257752 -1.070670 -0.937238 -7.756359e-02   \n72          1.494247 -0.781351  2.711280  1.784057  0.383167  1.759534e+00   \n235        -1.382579 -0.781351 -0.189166 -1.213406 -0.551502 -8.377420e-01   \n37         -0.381943  0.274235  0.574109 -1.070670  0.620543  1.563056e-16   \n\n            creddebt_ratio  othdebt_ratio  \nUnnamed: 0                                 \n286              -0.316469      -0.657086  \n146              -0.679458      -0.030961  \n214               1.680901       0.306257  \n528               1.890876       4.123403  \n165               3.180092       3.414949  \n...                    ...            ...  \n144              -0.632760       0.096423  \n645              -0.256113      -0.783204  \n72                2.095123       1.782697  \n235              -0.580956      -0.774986  \n37                1.180335       0.554345  \n\n[485 rows x 8 columns]","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>age</th>\n      <th>ed</th>\n      <th>employ</th>\n      <th>address</th>\n      <th>debtinc</th>\n      <th>income_ratio</th>\n      <th>creddebt_ratio</th>\n      <th>othdebt_ratio</th>\n    </tr>\n    <tr>\n      <th>Unnamed: 0</th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>286</th>\n      <td>-0.757182</td>\n      <td>-0.781351</td>\n      <td>0.421454</td>\n      <td>-0.214252</td>\n      <td>-0.625682</td>\n      <td>-3.943046e-01</td>\n      <td>-0.316469</td>\n      <td>-0.657086</td>\n    </tr>\n    <tr>\n      <th>146</th>\n      <td>-0.882261</td>\n      <td>2.385406</td>\n      <td>-1.105097</td>\n      <td>-0.785197</td>\n      <td>0.323823</td>\n      <td>-5.843492e-01</td>\n      <td>-0.679458</td>\n      <td>-0.030961</td>\n    </tr>\n    <tr>\n      <th>214</th>\n      <td>-0.131785</td>\n      <td>-0.781351</td>\n      <td>1.184729</td>\n      <td>-0.785197</td>\n      <td>0.027102</td>\n      <td>9.676817e-01</td>\n      <td>1.680901</td>\n      <td>0.306257</td>\n    </tr>\n    <tr>\n      <th>528</th>\n      <td>1.994565</td>\n      <td>0.274235</td>\n      <td>3.474556</td>\n      <td>0.784903</td>\n      <td>-0.358634</td>\n      <td>6.478975e+00</td>\n      <td>1.890876</td>\n      <td>4.123403</td>\n    </tr>\n    <tr>\n      <th>165</th>\n      <td>0.618692</td>\n      <td>0.274235</td>\n      <td>0.726764</td>\n      <td>0.356694</td>\n      <td>1.288163</td>\n      <td>1.822882e+00</td>\n      <td>3.180092</td>\n      <td>3.414949</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>144</th>\n      <td>0.743771</td>\n      <td>-0.781351</td>\n      <td>1.184729</td>\n      <td>1.213112</td>\n      <td>-0.714698</td>\n      <td>7.459630e-01</td>\n      <td>-0.632760</td>\n      <td>0.096423</td>\n    </tr>\n    <tr>\n      <th>645</th>\n      <td>-1.507658</td>\n      <td>0.274235</td>\n      <td>-1.257752</td>\n      <td>-1.070670</td>\n      <td>-0.937238</td>\n      <td>-7.756359e-02</td>\n      <td>-0.256113</td>\n      <td>-0.783204</td>\n    </tr>\n    <tr>\n      <th>72</th>\n      <td>1.494247</td>\n      <td>-0.781351</td>\n      <td>2.711280</td>\n      <td>1.784057</td>\n      <td>0.383167</td>\n      <td>1.759534e+00</td>\n      <td>2.095123</td>\n      <td>1.782697</td>\n    </tr>\n    <tr>\n      <th>235</th>\n      <td>-1.382579</td>\n      <td>-0.781351</td>\n      <td>-0.189166</td>\n      <td>-1.213406</td>\n      <td>-0.551502</td>\n      <td>-8.377420e-01</td>\n      <td>-0.580956</td>\n      <td>-0.774986</td>\n    </tr>\n    <tr>\n      <th>37</th>\n      <td>-0.381943</td>\n      <td>0.274235</td>\n      <td>0.574109</td>\n      <td>-1.070670</td>\n      <td>0.620543</td>\n      <td>1.563056e-16</td>\n      <td>1.180335</td>\n      <td>0.554345</td>\n    </tr>\n  </tbody>\n</table>\n<p>485 rows × 8 columns</p>\n</div>"},"metadata":{}}]},{"cell_type":"markdown","source":"In scikit-learn, there are three main types of Naive Bayes classifiers available:  \n\n1. **Gaussian Naive Bayes (GaussianNB)**:\n   - Assumes that the continuous features follow a Gaussian distribution. Suitable for classification tasks with continuous features. Often performs well when the features are normally distributed and the assumption of feature independence holds.  \n   \n\n2. **Multinomial Naive Bayes (MultinomialNB)**:\n   - Designed for classification tasks with discrete features, such as text classification with word counts. Assumes features are generated from a multinomial distribution. Particularly effective for text classification tasks using bag-of-words or TF-IDF representations.  \n   \n\n3. **Bernoulli Naive Bayes (BernoulliNB)**:\n   - Specifically for binary/boolean features. Assumes features are independent binary variables, suitable for tasks like text classification where presence or absence of words is indicative. Effective for binary feature vectors like TF-IDF.  \n   \n\n4. **Categorical Naive Bayes (CategoricalNB)**:\n   - Introduced in scikit-learn version 0.24. Assumes features are categorical rather than continuous or binary. It's suitable for classification tasks where features are categorical with discrete levels, and it can handle both high cardinality categorical features and low cardinality categorical features.  \n   \n\n5. **Complement Naive Bayes (ComplementNB)**:\n   - Also introduced in scikit-learn version 0.24. Similar to Multinomial Naive Bayes, but it uses statistics from the complement of each class to compute the model parameters. Especially useful for imbalanced datasets where one class is more frequent than the others.\n","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.metrics import confusion_matrix, classification_report\n\nmodel = GaussianNB(priors=None)\n\nmodel.fit(X_train_original, y_train)\n\ny_pred_train = model.predict(X_train_original)\ny_pred_test = model.predict(X_test_original)\n\ncm_train = confusion_matrix(y_train, y_pred_train)\nreport_train = classification_report(y_train, y_pred_train)\n\ncm_test = confusion_matrix(y_test, y_pred_test)\nreport_test = classification_report(y_test, y_pred_test)\n\nprint(\"Evaluation the Model on Training Set\")\nprint(f\"Confusion Matrix:\\n{cm_train}\")\nprint(f\"Classification Report:\\n{report_train}\")\nprint(\"-\"*80)\nprint(\"Evaluation the Model on Testing Set\")\nprint(f\"Confusion Matrix:\\n{cm_test}\")\nprint(f\"Classification Report:\\n{report_test}\")","metadata":{"execution":{"iopub.status.busy":"2024-06-05T19:18:13.840338Z","iopub.execute_input":"2024-06-05T19:18:13.840768Z","iopub.status.idle":"2024-06-05T19:18:13.881809Z","shell.execute_reply.started":"2024-06-05T19:18:13.840735Z","shell.execute_reply":"2024-06-05T19:18:13.880603Z"},"trusted":true},"execution_count":62,"outputs":[{"name":"stdout","text":"Evaluation the Model on Training Set\nConfusion Matrix:\n[[316  48]\n [ 50  71]]\nClassification Report:\n              precision    recall  f1-score   support\n\n           0       0.86      0.87      0.87       364\n           1       0.60      0.59      0.59       121\n\n    accuracy                           0.80       485\n   macro avg       0.73      0.73      0.73       485\nweighted avg       0.80      0.80      0.80       485\n\n--------------------------------------------------------------------------------\nEvaluation the Model on Testing Set\nConfusion Matrix:\n[[133  20]\n [ 26  31]]\nClassification Report:\n              precision    recall  f1-score   support\n\n           0       0.84      0.87      0.85       153\n           1       0.61      0.54      0.57        57\n\n    accuracy                           0.78       210\n   macro avg       0.72      0.71      0.71       210\nweighted avg       0.77      0.78      0.78       210\n\n","output_type":"stream"}]},{"cell_type":"code","source":"model.class_count_\nmodel.class_prior_\nmodel.classes_\nmodel.n_features_in_","metadata":{"trusted":true},"execution_count":51,"outputs":[{"execution_count":51,"output_type":"execute_result","data":{"text/plain":"array([364., 121.])"},"metadata":{}}]},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.metrics import confusion_matrix, classification_report\n\nmodel = GaussianNB(priors=None)\n\nmodel.fit(X_train_transformed, y_train)\n\ny_pred_train = model.predict(X_train_transformed)\ny_pred_test = model.predict(X_test_transformed)\n\ncm_train = confusion_matrix(y_train, y_pred_train)\nreport_train = classification_report(y_train, y_pred_train)\n\ncm_test = confusion_matrix(y_test, y_pred_test)\nreport_test = classification_report(y_test, y_pred_test)\n\nprint(\"Evaluation the Model on Training Set\")\nprint(f\"Confusion Matrix:\\n{cm_train}\")\nprint(f\"Classification Report:\\n{report_train}\")\nprint(\"-\"*80)\nprint(\"Evaluation the Model on Testing Set\")\nprint(f\"Confusion Matrix:\\n{cm_test}\")\nprint(f\"Classification Report:\\n{report_test}\")","metadata":{"execution":{"iopub.status.busy":"2024-06-05T19:20:09.790570Z","iopub.execute_input":"2024-06-05T19:20:09.791061Z","iopub.status.idle":"2024-06-05T19:20:09.830906Z","shell.execute_reply.started":"2024-06-05T19:20:09.791029Z","shell.execute_reply":"2024-06-05T19:20:09.829711Z"},"trusted":true},"execution_count":63,"outputs":[{"name":"stdout","text":"Evaluation the Model on Training Set\nConfusion Matrix:\n[[316  48]\n [ 55  66]]\nClassification Report:\n              precision    recall  f1-score   support\n\n           0       0.85      0.87      0.86       364\n           1       0.58      0.55      0.56       121\n\n    accuracy                           0.79       485\n   macro avg       0.72      0.71      0.71       485\nweighted avg       0.78      0.79      0.79       485\n\n--------------------------------------------------------------------------------\nEvaluation the Model on Testing Set\nConfusion Matrix:\n[[135  18]\n [ 26  31]]\nClassification Report:\n              precision    recall  f1-score   support\n\n           0       0.84      0.88      0.86       153\n           1       0.63      0.54      0.58        57\n\n    accuracy                           0.79       210\n   macro avg       0.74      0.71      0.72       210\nweighted avg       0.78      0.79      0.79       210\n\n","output_type":"stream"}]},{"cell_type":"code","source":"import pandas as pd \n\nX_data = pd.read_csv('/kaggle/input/bankloan-ready-to-dt/X_data_discretized_no_scaling')\nX_data = X_data.set_index(['Unnamed: 0', 'Unnamed: 1'])\n\nX_train_discretized = X_data.loc['train']\nX_test_discretized = X_data.loc['test']","metadata":{"trusted":true},"execution_count":66,"outputs":[{"execution_count":66,"output_type":"execute_result","data":{"text/plain":"             ed  age_cat_cm  employ_cat_cm  address_cat_cm  \\\nUnnamed: 1                                                   \n286         1.0           0              2               2   \n146         4.0           0              0               1   \n214         1.0           1              3               1   \n528         2.0           3              3               3   \n165         2.0           3              2               3   \n...         ...         ...            ...             ...   \n144         1.0           3              3               4   \n645         2.0           0              0               1   \n72          1.0           3              3               4   \n235         1.0           0              2               1   \n37          2.0           1              2               1   \n\n            income_ratio_cat_cm  debtinc_cat_cm  creddebt_ratio_cat_cm  \\\nUnnamed: 1                                                               \n286                           1               0                      3   \n146                           1               1                      0   \n214                           3               1                      4   \n528                           4               0                      4   \n165                           3               2                      4   \n...                         ...             ...                    ...   \n144                           3               0                      0   \n645                           1               0                      4   \n72                            3               1                      4   \n235                           0               0                      0   \n37                            1               1                      4   \n\n            othdebt_ratio_cat_cm  \nUnnamed: 1                        \n286                            0  \n146                            1  \n214                            3  \n528                            3  \n165                            3  \n...                          ...  \n144                            1  \n645                            0  \n72                             3  \n235                            0  \n37                             3  \n\n[485 rows x 8 columns]","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ed</th>\n      <th>age_cat_cm</th>\n      <th>employ_cat_cm</th>\n      <th>address_cat_cm</th>\n      <th>income_ratio_cat_cm</th>\n      <th>debtinc_cat_cm</th>\n      <th>creddebt_ratio_cat_cm</th>\n      <th>othdebt_ratio_cat_cm</th>\n    </tr>\n    <tr>\n      <th>Unnamed: 1</th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>286</th>\n      <td>1.0</td>\n      <td>0</td>\n      <td>2</td>\n      <td>2</td>\n      <td>1</td>\n      <td>0</td>\n      <td>3</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>146</th>\n      <td>4.0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>214</th>\n      <td>1.0</td>\n      <td>1</td>\n      <td>3</td>\n      <td>1</td>\n      <td>3</td>\n      <td>1</td>\n      <td>4</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>528</th>\n      <td>2.0</td>\n      <td>3</td>\n      <td>3</td>\n      <td>3</td>\n      <td>4</td>\n      <td>0</td>\n      <td>4</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>165</th>\n      <td>2.0</td>\n      <td>3</td>\n      <td>2</td>\n      <td>3</td>\n      <td>3</td>\n      <td>2</td>\n      <td>4</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>144</th>\n      <td>1.0</td>\n      <td>3</td>\n      <td>3</td>\n      <td>4</td>\n      <td>3</td>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>645</th>\n      <td>2.0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n      <td>4</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>72</th>\n      <td>1.0</td>\n      <td>3</td>\n      <td>3</td>\n      <td>4</td>\n      <td>3</td>\n      <td>1</td>\n      <td>4</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>235</th>\n      <td>1.0</td>\n      <td>0</td>\n      <td>2</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>37</th>\n      <td>2.0</td>\n      <td>1</td>\n      <td>2</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>4</td>\n      <td>3</td>\n    </tr>\n  </tbody>\n</table>\n<p>485 rows × 8 columns</p>\n</div>"},"metadata":{}}]},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.naive_bayes import CategoricalNB\nfrom sklearn.metrics import confusion_matrix, classification_report\n\nmodel = CategoricalNB(fit_prior=True, class_prior=None)\n\nmodel.fit(X_train_discretized, y_train)\n\ny_pred_train = model.predict(X_train_discretized)\ny_pred_test = model.predict(X_test_discretized)\n\ncm_train = confusion_matrix(y_train, y_pred_train)\nreport_train = classification_report(y_train, y_pred_train)\n\ncm_test = confusion_matrix(y_test, y_pred_test)\nreport_test = classification_report(y_test, y_pred_test)\n\nprint(\"Evaluation the Model on Training Set\")\nprint(f\"Confusion Matrix:\\n{cm_train}\")\nprint(f\"Classification Report:\\n{report_train}\")\nprint(\"-\"*80)\nprint(\"Evaluation the Model on Testing Set\")\nprint(f\"Confusion Matrix:\\n{cm_test}\")\nprint(f\"Classification Report:\\n{report_test}\")","metadata":{"execution":{"iopub.status.busy":"2024-06-05T19:24:32.359112Z","iopub.execute_input":"2024-06-05T19:24:32.359531Z","iopub.status.idle":"2024-06-05T19:24:32.406481Z","shell.execute_reply.started":"2024-06-05T19:24:32.359500Z","shell.execute_reply":"2024-06-05T19:24:32.405133Z"},"trusted":true},"execution_count":67,"outputs":[{"name":"stdout","text":"Evaluation the Model on Training Set\nConfusion Matrix:\n[[327  37]\n [ 50  71]]\nClassification Report:\n              precision    recall  f1-score   support\n\n           0       0.87      0.90      0.88       364\n           1       0.66      0.59      0.62       121\n\n    accuracy                           0.82       485\n   macro avg       0.76      0.74      0.75       485\nweighted avg       0.81      0.82      0.82       485\n\n--------------------------------------------------------------------------------\nEvaluation the Model on Testing Set\nConfusion Matrix:\n[[135  18]\n [ 35  22]]\nClassification Report:\n              precision    recall  f1-score   support\n\n           0       0.79      0.88      0.84       153\n           1       0.55      0.39      0.45        57\n\n    accuracy                           0.75       210\n   macro avg       0.67      0.63      0.64       210\nweighted avg       0.73      0.75      0.73       210\n\n","output_type":"stream"}]}]}