{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":2564,"databundleVersionId":29456,"sourceType":"competition"}],"dockerImageVersionId":30673,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"Go to this link: https://www.kaggle.com/competitions/DontGetKicked/data.  \n\nUse the 'training.csv' dataset and read the 'Carvana_Data_Dictionary.txt' for metadata.You can use the below snippet of code to read the training dataset. After reading the raw training dataset into a DataFrame, it should have 72,983 rows and 34 columns.","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19"}},{"cell_type":"code","source":"import pandas as pd\ndf = pd.read_csv('/kaggle/input/DontGetKicked/training.csv')","metadata":{"execution":{"iopub.status.busy":"2024-10-08T18:42:18.367423Z","iopub.execute_input":"2024-10-08T18:42:18.367817Z","iopub.status.idle":"2024-10-08T18:42:20.137822Z","shell.execute_reply.started":"2024-10-08T18:42:18.367759Z","shell.execute_reply":"2024-10-08T18:42:20.136764Z"},"trusted":true},"execution_count":1,"outputs":[]},{"cell_type":"markdown","source":"To perform a comprehensive Exploratory Data Analysis (EDA), follow these steps to thoroughly understand and analyze the data.  \n1. Review the 'Carvana_Data_Dictionary.txt' to understand the meaning of each column. Use `df.info()` to inspect the data types.\n2. Perform Exploratory Data Analysis (EDA) using `ydata_profiling` on the training dataset.\n3. Conduct a thorough analysis for deeper insights into the data:\n   - Check the minimum and maximum values of continuous fields to identify any out-of-range values.\n   - Examine categorical fields for inconsistent or unexpected categories.\n   - Analyze the distribution of continuous fields to detect potential outliers.\n   - Identify missing values in both categorical and continuous fields.\n   - Assess the normality of continuous fields.\n   - Review the distribution of categorical fields to spot rare categories (e.g., less than 1% in large datasets) or target imbalance.\n   - Compare the distribution of fields separated by target field classes.\n   - Evaluate the correlations between continuous fields using a heatmap to          visualize correlation values with chatGPT. \n4. Create a checklist for data cleaning and preparation based on the findings from univariate and multivariate assessments.\n","metadata":{}},{"cell_type":"markdown","source":"**Note:** If you encounter an error related to 'numba' while running ydata_profiling, try the following command:","metadata":{}},{"cell_type":"code","source":"!pip install --upgrade numba pandas visions ydata_profiling","metadata":{},"execution_count":null,"outputs":[]}]}