Overview
Introduction
The assignment involved us to wrangle data from a real-life dataset to understand different
data wrangling techniques. We were tasked
• To conduct data exploration, preparation and transformation through different
methods
• To prepare the data ready for modeling, build and evaluate a simple linear regression
model.
• To document the analysis, comparison and findings
Steps Taken
- Outlier Handling: I detected outliers and managed them by utilizing the Windsorizing technique. This approach effectively mitigated the impact of extreme values on subsequent analyses.
- Distribution: Addressing variable distribution, I employed the Yeo-Johnson transformation. This adjustment contributed to improved variable distribution and better-suited the data for subsequent modeling.
- Missing Data Imputation: To deal with missing values within the dataset, I opted for mean imputation for numerical variables and mode imputation for categorical variables. This strategy helped maintain the integrity of the dataset.
- Categorical Data Encoding: Given the presence of categorical variables, I executed ordinal and one-hot encoding techniques. This process facilitated the integration of categorical information into the analysis.
- Binning: In specific instances, I employed the equal width binning technique to discretize variables. This enabled a more structured representation of data and simplified subsequent analysis.
- Scaling: Implementing max absolute scaling, I standardized the data for modeling purposes. This scaling method retained the data's essential characteristics while making it more conducive to analysis.
- Polynomial Expansion: Through polynomial expansion, I generated new features by multiplying existing ones. This augmentation technique added complexity to the model and potentially captured higher-order relationships.