{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":8616088,"sourceType":"datasetVersion","datasetId":4906925}],"dockerImageVersionId":30732,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"**K-Nearest Neighbors** (KNN) is a simple, **non-parametric** algorithm used for **classification** and regression. It operates on the principle that *similar data points are likely to be found near each other*. When classifying a new data point, KNN identifies the **'k' closest training examples** in the feature space and assigns the class **most common** among these neighbors. Notably, KNN is an **instance-based** learning method, meaning *it does not learn a discriminative function* from the training data. Instead, it stores all the training instances, which **requires significant memory**, especially for large datasets. As a result, KNN can be **computationally expensive** during prediction, as it must compute the distance between the query instance and all training samples to determine the nearest neighbors.","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19"}},{"cell_type":"markdown","source":"### Read the Datasets","metadata":{}},{"cell_type":"code","source":"import pandas as pd\n\n# Read the training datasets\nX_train_original = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/X_train_original.csv')\nX_train_original = X_train_original.set_index('Unnamed: 0')\n\nX_train_transformed = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/X_train_transformed.csv')\nX_train_transformed = X_train_transformed.set_index('Unnamed: 0')\n\nX_train_discretized = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/X_train_discretized.csv')\nX_train_discretized = X_train_discretized.set_index('Unnamed: 0')\n\n# Read the test datasets\nX_test_original = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/X_test_original.csv')\nX_test_original = X_test_original.set_index('Unnamed: 0')\n\nX_test_transformed = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/X_test_transformed.csv')\nX_test_transformed = X_test_transformed.set_index('Unnamed: 0')\n\nX_test_discretized = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/X_test_discretized.csv')\nX_test_discretized = X_test_discretized.set_index('Unnamed: 0')\n\n# Read the target variable for training and testing\ny_train = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/y_train.csv')\ny_train = y_train.set_index('Unnamed: 0').squeeze()\n\ny_test = pd.read_csv('/kaggle/input/bankloan-ready-to-modeling/y_test.csv')\ny_test = y_test.set_index('Unnamed: 0').squeeze() ","metadata":{"execution":{"iopub.status.busy":"2024-06-21T17:47:33.253750Z","iopub.execute_input":"2024-06-21T17:47:33.254217Z","iopub.status.idle":"2024-06-21T17:47:33.326072Z","shell.execute_reply.started":"2024-06-21T17:47:33.254183Z","shell.execute_reply":"2024-06-21T17:47:33.324952Z"},"trusted":true},"execution_count":9,"outputs":[]},{"cell_type":"markdown","source":"### Implement the KNN model  \nconsider these model parameters:    \n\n- **n_neighbors** (int, default=5): Number of neighbors to use by default for kneighbors queries.\n\n- **weights** (`{'uniform', 'distance'}`, callable or None, default=`'uniform'`):\n  Weight function used in prediction. Possible values:\n  - `'uniform'`: Uniform weights. All points in each neighborhood are weighted equally.\n  - `'distance'`: Weight points by the inverse of their distance. In this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.\n\n- **p** (float, default=2):\n  Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. This parameter is expected to be positive.\n\n- **metric** (str or callable, default=`'minkowski'`):\n  Metric to use for distance computation. Default is “minkowski”, which results in the standard Euclidean distance when p = 2. \n    \n*This markdown presents the model parameters for the KNN algorithm in a format similar to that of the sklearn documentation.*","metadata":{}},{"cell_type":"code","source":"from sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.metrics import confusion_matrix, classification_report\n\nmodel = KNeighborsClassifier(n_neighbors=5, weights='uniform', p=2, metric='minkowski', n_jobs=-1)\n\nmodel.fit(X_train_original, y_train)\n\ny_pred_train = model.predict(X_train_original)\ny_pred_test = model.predict(X_test_original)\n\ncm_train = confusion_matrix(y_train, y_pred_train)\nreport_train = classification_report(y_train, y_pred_train)\n\ncm_test = confusion_matrix(y_test, y_pred_test)\nreport_test = classification_report(y_test, y_pred_test)\n\nprint(\"Evaluation the Model on Training Set\")\nprint(f\"Confusion Matrix:\\n{cm_train}\")\nprint(f\"Classification Report:\\n{report_train}\")\nprint(\"-\"*80)\nprint(\"Evaluation the Model on Testing Set\")\nprint(f\"Confusion Matrix:\\n{cm_test}\")\nprint(f\"Classification Report:\\n{report_test}\")","metadata":{"execution":{"iopub.status.busy":"2024-06-21T17:51:25.373757Z","iopub.execute_input":"2024-06-21T17:51:25.374193Z","iopub.status.idle":"2024-06-21T17:51:25.490252Z","shell.execute_reply.started":"2024-06-21T17:51:25.374162Z","shell.execute_reply":"2024-06-21T17:51:25.489051Z"},"trusted":true},"execution_count":10,"outputs":[{"name":"stdout","text":"Evaluation the Model on Training Set\nConfusion Matrix:\n[[347  17]\n [ 59  62]]\nClassification Report:\n              precision    recall  f1-score   support\n\n           0       0.85      0.95      0.90       364\n           1       0.78      0.51      0.62       121\n\n    accuracy                           0.84       485\n   macro avg       0.82      0.73      0.76       485\nweighted avg       0.84      0.84      0.83       485\n\n--------------------------------------------------------------------------------\nEvaluation the Model on Testing Set\nConfusion Matrix:\n[[138  15]\n [ 34  23]]\nClassification Report:\n              precision    recall  f1-score   support\n\n           0       0.80      0.90      0.85       153\n           1       0.61      0.40      0.48        57\n\n    accuracy                           0.77       210\n   macro avg       0.70      0.65      0.67       210\nweighted avg       0.75      0.77      0.75       210\n\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Identify the Optimal Model through Hyperparameter Tuning","metadata":{}},{"cell_type":"markdown","source":"Since the KNN model does not inherently address class imbalance, I utilized a **sampling method** to tackle this challenge. You can find the approach used in this notebook: https://www.kaggle.com/code/zahrazolghadr/overcoming-imbalanced-data-challenges","metadata":{}},{"cell_type":"code","source":"pip install -U imbalanced-learn","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from imblearn.over_sampling import SMOTE\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.metrics import make_scorer, f1_score\nfrom sklearn.model_selection import GridSearchCV\n\ndef KNN_modeling(X_train, y_train, X_test, y_test):\n    \n    # Perform SMOTE oversampling\n    smote = SMOTE(random_state=1)\n    X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)\n\n    # Define a custom scorer for class '1' F1-score\n    def custom_f1_scorer(y_true, y_pred):\n        return f1_score(y_true, y_pred, pos_label=1)\n\n    custom_scorer = make_scorer(custom_f1_scorer)\n\n    # Initialize the KNeighborsClassifier\n    knn = KNeighborsClassifier()\n\n    # Set up the parameter grid\n    param_grid = {\n        'n_neighbors': [3, 5, 7, 9], \n        'weights': ['uniform', 'distance'],\n        'metric': ['euclidean', 'manhattan']\n    }\n\n    # Perform Grid Search with custom F1 scorer\n    grid_search = GridSearchCV(knn, param_grid, scoring=custom_scorer, cv=5, n_jobs=-1)\n\n    # Fit the model\n    grid_search.fit(X_train_smote, y_train_smote)\n\n    # Print the best parameters and the best score\n    print(\"Best parameters found: \", grid_search.best_params_)\n    print(\"Best custom F1 score: \", grid_search.best_score_)\n\n    # Evaluate the best model on the test set\n    best_knn = grid_search.best_estimator_\n    y_pred = best_knn.predict(X_test)\n    custom_f1 = custom_f1_scorer(y_test, y_pred)\n\n    print(f\"Custom F1 Score on test set: {custom_f1}\")\n    \n    return\n","metadata":{"execution":{"iopub.status.busy":"2024-06-21T17:57:11.388934Z","iopub.execute_input":"2024-06-21T17:57:11.389351Z","iopub.status.idle":"2024-06-21T17:57:11.401436Z","shell.execute_reply.started":"2024-06-21T17:57:11.389320Z","shell.execute_reply":"2024-06-21T17:57:11.399932Z"},"trusted":true},"execution_count":11,"outputs":[]},{"cell_type":"code","source":"KNN_modeling(X_train_original, y_train, X_test_original, y_test)","metadata":{"execution":{"iopub.status.busy":"2024-06-21T17:57:47.228151Z","iopub.execute_input":"2024-06-21T17:57:47.228573Z","iopub.status.idle":"2024-06-21T17:57:50.198603Z","shell.execute_reply.started":"2024-06-21T17:57:47.228540Z","shell.execute_reply":"2024-06-21T17:57:50.196498Z"},"trusted":true},"execution_count":12,"outputs":[{"name":"stdout","text":"Best parameters found:  {'metric': 'manhattan', 'n_neighbors': 3, 'weights': 'distance'}\nBest custom F1 score:  0.8656608469568337\nCustom F1 Score on test set: 0.5528455284552845\n","output_type":"stream"}]},{"cell_type":"code","source":"KNN_modeling(X_train_transformed, y_train, X_test_transformed, y_test)","metadata":{"execution":{"iopub.status.busy":"2024-06-21T17:58:30.358939Z","iopub.execute_input":"2024-06-21T17:58:30.359346Z","iopub.status.idle":"2024-06-21T17:58:30.936241Z","shell.execute_reply.started":"2024-06-21T17:58:30.359311Z","shell.execute_reply":"2024-06-21T17:58:30.934997Z"},"trusted":true},"execution_count":13,"outputs":[{"name":"stdout","text":"Best parameters found:  {'metric': 'manhattan', 'n_neighbors': 3, 'weights': 'distance'}\nBest custom F1 score:  0.868057339264564\nCustom F1 Score on test set: 0.5468750000000001\n","output_type":"stream"}]},{"cell_type":"code","source":"KNN_modeling(X_train_discretized, y_train, X_test_discretized, y_test)","metadata":{"execution":{"iopub.status.busy":"2024-06-21T17:58:44.773618Z","iopub.execute_input":"2024-06-21T17:58:44.774858Z","iopub.status.idle":"2024-06-21T17:58:45.371044Z","shell.execute_reply.started":"2024-06-21T17:58:44.774811Z","shell.execute_reply":"2024-06-21T17:58:45.369530Z"},"trusted":true},"execution_count":14,"outputs":[{"name":"stdout","text":"Best parameters found:  {'metric': 'manhattan', 'n_neighbors': 3, 'weights': 'distance'}\nBest custom F1 score:  0.861610328665692\nCustom F1 Score on test set: 0.4901960784313725\n","output_type":"stream"}]}]}