Overview

Introduction

The assignment involved us to wrangle data from a real-life dataset to understand different data wrangling techniques. We were tasked
• To conduct data exploration, preparation and transformation through different methods
• To prepare the data ready for modeling, build and evaluate a simple linear regression model.
• To document the analysis, comparison and findings

Steps Taken

  1. Outlier Handling: I detected outliers and managed them by utilizing the Windsorizing technique. This approach effectively mitigated the impact of extreme values on subsequent analyses.
  2. Distribution: Addressing variable distribution, I employed the Yeo-Johnson transformation. This adjustment contributed to improved variable distribution and better-suited the data for subsequent modeling.
  3. Missing Data Imputation: To deal with missing values within the dataset, I opted for mean imputation for numerical variables and mode imputation for categorical variables. This strategy helped maintain the integrity of the dataset.
  4. Categorical Data Encoding: Given the presence of categorical variables, I executed ordinal and one-hot encoding techniques. This process facilitated the integration of categorical information into the analysis.
  5. Binning: In specific instances, I employed the equal width binning technique to discretize variables. This enabled a more structured representation of data and simplified subsequent analysis.
  6. Scaling: Implementing max absolute scaling, I standardized the data for modeling purposes. This scaling method retained the data's essential characteristics while making it more conducive to analysis.
  7. Polynomial Expansion: Through polynomial expansion, I generated new features by multiplying existing ones. This augmentation technique added complexity to the model and potentially captured higher-order relationships.

Final Video

Slides