{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[],"dockerImageVersionId":30673,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"Data: https://www.kaggle.com/competitions/DontGetKicked/data.  ","metadata":{}},{"cell_type":"markdown","source":"To apply data cleaning follow the below steps:  \n1. exclude features that are not approparite for this problem:\n*  **PurchDate** : Dates by themselves are not suitable for analysis unless recurring indicators are extracted from them (such as month, day, day of the week), or the intervals between dates are calculated.\n*  **VehYear**: \"VehicleAge\" is present in the data, which is a better alternative.\n*  **Model, Trim and Submodel**: The number of classes in these features is high, and additional expertise is required to merge classes with low frequency.\n*  **WheelTypeID**: \"WheelType\" is present in data, which is better alternative.\n*  **BYRNO**: Its just an ID.\n*  **VNZIP1 and VNST**: They often do not directly contribute to predictive power unless specific location-based insights are needed.  \n2. Set **\"RefId\"** as index.\n3. Define y (target) and X (inputs), then partition the data into training and test sets. Assign 80% of the rows to    the training set and 20% to the test set, selected randomly.  \n\n**Perform the remaining data cleaning and preparation steps on the training dataset.**  \n\n4. For continuous fields, apply the logical range below and convert any out-of-range values to null:\n    - 'VehicleAge': (0,30)\n    - 'VehOdo': (0,120000)\n    - 'MMRAcquisitionAuctionAveragePrice': (800,46000)\n    - 'MMRAcquisitionAuctionCleanPrice': (1000,46000)\n    - 'MMRAcquisitionRetailAveragePrice': (1000,46000)\n    - 'MMRAcquisitonRetailCleanPrice': (1000,46000)\n    - 'MMRCurrentAuctionAveragePrice': (300,46000)  \n    - 'MMRCurrentAuctionCleanPrice': (400,46000)   \n    - 'MMRCurrentRetailAveragePrice': (800,46000)\n    - 'MMRCurrentRetailCleanPrice': (1000,46000)\n    - 'VehBCost': (1000,46000)                         \n    - 'WarrantyCost': (400,8000)    \n     \n5. Correct the inconsistencies in the codes of categorical variables. Convert the value 'NOT AVAIL' in the **'color'** variable to null.\n\n6. In the **'color'** and **'make'** variables, group classes with less than 1% frequency as 'OTHER'.\n\n7. Apply feature screening with the following steps:\n    - Features with a coefficient of variation less than **0.1** for continuous variables.\n    - Features where the mode category percentage is greater than **99%** for categorical variables.\n    - Features with a percentage of unique categories exceeding **90%** for categorical variables.\n\n8. In the **'PRIMEUNIT'** and **'AUCGUART'** variables, there are more than 95% missing values. Perform an appropriate hypothesis test on the non-null values between these two variables and the target variable to determine if a significant relationship exists. If a significant relationship is found, convert the null values to a class labeled 'unknown'. If no significant relationship is found, discard the variable.\n\n9. Discard the outliers detected using the **'IsolationForest'** method with the contamination parameter set to **0.01**. Note that for applying this multidimensional method, you should create a copy of the data that has been encoded (categorical features) and scaled (all features).\n\n10. To handle the missing values, follow these steps:\n    - Discard rows with **4 or more** null values in the variables related to prices (from 'MMRAcquisitionAuctionAveragePrice' to 'MMRCurrentRetailCleanPrice').\n    - After this step, discard rows with **50% or more** null values across all fields.\n    - Impute the remaining missing values using the **median** for continuous fields and the **mode** for categorical fields.\n\n\n","metadata":{}}]}