Overview

Introduction

The assignment involved us to wrangle data from a real-life dataset to understand different data wrangling techniques. We were tasked
• To conduct data exploration, preparation and transformation through different methods
• To prepare the data ready for modeling, build and evaluate a simple linear regression model.
• To document the analysis, comparison and findings

Steps Taken

Outlier Handling: I detected outliers and managed them by utilizing the Windsorizing technique. This approach effectively mitigated the impact of extreme values on subsequent analyses.
Distribution: Addressing variable distribution, I employed the Yeo-Johnson transformation. This adjustment contributed to improved variable distribution and better-suited the data for subsequent modeling.
Missing Data Imputation: To deal with missing values within the dataset, I opted for mean imputation for numerical variables and mode imputation for categorical variables. This strategy helped maintain the integrity of the dataset.
Categorical Data Encoding: Given the presence of categorical variables, I executed ordinal and one-hot encoding techniques. This process facilitated the integration of categorical information into the analysis.
Binning: In specific instances, I employed the equal width binning technique to discretize variables. This enabled a more structured representation of data and simplified subsequent analysis.
Scaling: Implementing max absolute scaling, I standardized the data for modeling purposes. This scaling method retained the data's essential characteristics while making it more conducive to analysis.
Polynomial Expansion: Through polynomial expansion, I generated new features by multiplying existing ones. This augmentation technique added complexity to the model and potentially captured higher-order relationships.

Data Wrangling

Overview

Introduction

Steps Taken

Final Video

Slides