{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":7722356,"sourceType":"datasetVersion","datasetId":4511074}],"dockerImageVersionId":30673,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"## Sampling Techniques for Data Exploration and Analysis\nIn data science, working with **large datasets** is common. However, analyzing the entire dataset can be **time-consuming** and **computationally expensive**. Sampling techniques provide a way to select a **representative subset** of data that allows you to draw inferences about the entire population. This notebook explores four common sampling techniques:   \n* Random Sampling  \n* Systematic Sampling   \n* Stratified Sampling  \n* Cluster Sampling   \n\nWe'll demonstrate these techniques using Python's random and pandas libraries.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\ndf = pd.read_csv('/kaggle/input/house-price-prepared/data_prepared.csv')\n\ndf.set_index('Id', inplace=True)\n\ndf.shape","metadata":{"execution":{"iopub.status.busy":"2024-04-17T19:23:51.199890Z","iopub.execute_input":"2024-04-17T19:23:51.201859Z","iopub.status.idle":"2024-04-17T19:23:51.238391Z","shell.execute_reply.started":"2024-04-17T19:23:51.201805Z","shell.execute_reply":"2024-04-17T19:23:51.237322Z"},"trusted":true},"execution_count":30,"outputs":[{"execution_count":30,"output_type":"execute_result","data":{"text/plain":"(1001, 66)"},"metadata":{}}]},{"cell_type":"markdown","source":"### 1. Random Sampling\n\nRandom sampling involves selecting data points from the population with an **equal probability** of being chosen. This ensures that all elements have a **fair chance** of being included in the sample.\n\n**Explanation:**\n\nRandomly selects a specific number of elements from the entire dataset.\nSuitable when you want a **general overview** of the population without focusing on specific subgroups.  \n\n**Applications:**\n\nInitial exploration of a dataset to get a sense of its central tendencies (mean, median) and spread (variance, standard deviation).\nCreating training and testing sets for machine learning models when the data is well-mixed.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\n\n# Sample size or ratio\nsample_size = 150\nsample_frac = 0.15\n\n# Randomly select a sample\nrandom_sample1 = df.sample(n=sample_size, random_state=42)   \nrandom_sample1.shape\n\n# n: Number of items from axis to return. Cannot be used with `frac`\n# frac: Fraction of axis items to return. Cannot be used with `n`.\n# replace: Allow or disallow sampling of the same row more than once.\n\nrandom_sample2 = df.sample(frac=sample_frac, replace=True, random_state=42)  \nrandom_sample2.shape","metadata":{"trusted":true},"execution_count":32,"outputs":[{"execution_count":32,"output_type":"execute_result","data":{"text/plain":"(150, 66)"},"metadata":{}}]},{"cell_type":"markdown","source":"### 2. Systematic Sampling\n\nSystematic sampling involves selecting elements from the population at a **fixed interval**. This interval is calculated by dividing the population size by the desired sample size. You randomly choose a starting point within this interval and then select elements at the calculated intervals.\n\n**Explanation:**\n\nElements are chosen at regular intervals throughout the **sorted data**.  \n\nUseful when the data is ordered and you want to ensure even **coverage across the range**.\n\n**Applications:**\n\nWhen the population is too large to handle with random sampling.  \n\nWhen the population is sorted based on a certain criterion.","metadata":{}},{"cell_type":"code","source":"import random \n\nsorted_df = df.sort_values(by='SalePrice')  # Sort the data by a column\nsorted_df.index[:50]\n\n# Sample size \nsample_size = 100\n\n\n# Calculate the sampling interval\nsampling_interval = int(len(sorted_df) / sample_size)\n\n# Choose a random starting point within the interval\nrandom.seed(717)\nrandom_start = random.randint(0, sampling_interval-1)\n\n# Systematic sample\nsystematic_sample = sorted_df.iloc[random_start::sampling_interval, :]\nsystematic_sample.index[:5]\n\nsystematic_sample.shape","metadata":{"trusted":true},"execution_count":41,"outputs":[{"execution_count":41,"output_type":"execute_result","data":{"text/plain":"Index([1381, 1322, 69, 99, 1023], dtype='int64', name='Id')"},"metadata":{}}]},{"cell_type":"markdown","source":"### 3. Stratified Sampling\n\nStratified sampling involves dividing the population into **subgroups (strata)** based on a specific characteristic (e.g., neighborhood, price range). Then, a proportional sample is drawn from each stratum, **ensuring representation** of all subgroups in the final sample.\n\n**Explanation:**\n\nCreates a sample that reflects the same proportions of subgroups as in the population.  \n\nUseful when you want to ensure representativeness of subgroups that might be underrepresented in random sampling.\n\n**Applications:**\n\n**Customer segmentation:** When studying customer behavior, stratified sampling can ensure your sample includes customers from each segment (e.g., age groups, income levels) in the same proportion as the overall customer base.  \n\n**Opinion polls:** In polls, stratified sampling helps ensure the sample reflects the population demographics (e.g., age, gender, location) for more accurate results.","metadata":{}},{"cell_type":"code","source":"# Consider 'Neighborhood' for stratification\nstrata = df['Neighborhood'].value_counts().sort_values(ascending=False)\n\n# Sample size per stratum \nsample_ratio_per_strata = 0.2  # Adjust for desired sample size from each stratum\n\n# Stratified sample\nstratified_sample = pd.DataFrame()\n\nfor neighborhood, count in strata.items():\n    sample = df[df['Neighborhood'] == neighborhood].sample(frac=sample_ratio_per_strata, random_state=42)\n    stratified_sample = pd.concat([stratified_sample, sample])\n\n# Explore the stratified sample\nstratified_sample.shape\n\nstratified_sample['Neighborhood'].value_counts().sort_values(ascending=False)","metadata":{"trusted":true},"execution_count":45,"outputs":[{"execution_count":45,"output_type":"execute_result","data":{"text/plain":"Neighborhood\nNAmes      31\nCollgCr    21\nOldTown    14\nEdwards    14\nSomerst    13\nGilbert    11\nNWAmes     10\nNridgHt    10\nSawyer     10\nBrkSide     9\nSawyerW     9\nMitchel     7\nCrawfor     6\nNoRidge     6\nTimber      5\nIDOTRR      4\nClearCr     4\nSWISU       3\nStoneBr     3\nBlmngtn     2\nBrDale      2\nMeadowV     2\nVeenker     1\nNPkVill     1\nName: count, dtype: int64"},"metadata":{}}]},{"cell_type":"markdown","source":"### 4. Cluster Sampling  \n\nCluster sampling involves dividing the population into clusters, then randomly selecting some clusters and sampling all individuals within those clusters.  \n\n**Explanation:**\n\nUseful when population units naturally cluster together.  \n\nCost-effective compared to other sampling methods when clusters are well-defined.\n\n**Applications:**\n\n**Market research:** When studying consumer preferences in a large geographical area, cluster sampling can be used to select representative cities or towns and then survey all residents within those chosen locations.  \n\n**Medical research:** In clinical trials, hospitals or clinics might be clustered based on patient demographics. Then, a random sample of these clusters can be selected for the study.","metadata":{}},{"cell_type":"code","source":"import random\n\nsample_size = 5\n\nclusters = df['Neighborhood'].unique()\n\n# Select a random sample of clusters (adjust sample_size)\nrandom.seed(717)\nclusters_rand = random.sample(list(clusters), sample_size)\n\n# Filter data for clusters in the sample\ncluster_sample = df[df['Neighborhood'].isin(clusters_rand)]\ncluster_sample.shape\n\ncluster_sample['Neighborhood'].unique()","metadata":{"trusted":true},"execution_count":51,"outputs":[{"execution_count":51,"output_type":"execute_result","data":{"text/plain":"array(['OldTown', 'Edwards', 'Mitchel', 'SWISU', 'Blueste'], dtype=object)"},"metadata":{}}]}]}