Home Credit Default Risk Prediction: Data Science Final Project

Leave a Comment / Data Science, Final Project, Portfolio / By Hijir

Best Group and Best Student Awarded for Data Science Final Project predicting Home Credit customer default risk. Involves data preprocessing, EDA, feature engineering, and ML with XGBoost & Stacking. Provides actionable insights to enhance credit risk management and reduce default rates.

Project Overview

The Data Science Final Project focuses on predicting the risk of customer default using historical loan application data. By leveraging robust data preprocessing, feature engineering, and advanced machine learning techniques, this project aims to provide actionable insights to improve credit risk management for Home Credit.

Workflow Stages

Stage 0: Preparation

Tasks: Data collection, initial data inspection, and understanding business requirements.
Deliverables: Raw datasets and project goals.

Stage 1: Exploratory Data Analysis (EDA)

Techniques:
- Statistical analysis of key variables.
- Visualizations to identify patterns and correlations.
Objective: Identify trends and potential predictors for default risk.

Stage 2: Preprocessing

1. Data Cleansing: Handling missing values, removing duplicates, and transforming features.
2. Feature Engineering: Adding new features such as credit ratios and tenure groups.
3. Encoding: Binary and one-hot encoding for categorical variables.
4. Handling Class Imbalance: Using oversampling (SMOTE) and undersampling techniques.
5. Normalization: Standardizing numerical features to improve model performance.

Stage 3: Machine Learning

Models Used:

- AdaBoost: Combines weak classifiers into a strong classifier adaptively, achieving high accuracy without requiring complex models.
- Decision Tree: A flexible algorithm for classification and regression, capable of handling numeric and categorical data while generating intuitive tree structures.
- Random Forest: An ensemble of decision trees trained on random data subsets, suitable for classification and regression tasks.
- KNN (K-Nearest Neighbors): A lazy learning algorithm that classifies data points based on their neighbors, effective for handling local data patterns.
- XGBoost: Known for its high speed and efficiency, XGBoost excels at processing sparse data and controlling overfitting through regularization.
- Stacking: Combines predictions from multiple base models through meta-modeling to enhance prediction accuracy.
- Logistic Regression: A proven method for predicting loan defaults, with a focus on interest rates as a key predictor.

Evaluation Metrics: ROC-AUC, Recall, Precision, F1-Score, and F2-Score.

Tuning: Hyperparameter optimization for improved model performance.

Model Performance

Below are the key evaluation metrics for the best-performing model, XGBoost, before and after hyperparameter tuning:

Metric	XGBoost Before Tuning	XGBoost After Tuning
ROC AUC (Train Set)	0.9646	0.9861
ROC AUC (Test Set)	0.9469	0.9470
Accuracy (Train Set)	0.9466	0.9397
Accuracy (Test Set)	0.9387	0.9397
F2-Score (Test Set)	0.8483	0.8519
Recall (Cross-Validation Train)	0.8197	0.8239
Recall (Test Set)	0.8176	0.8220
Recall (Cross-Validation Test)	0.8165	0.8208

Home Credit Default Risk Prediction: Data Science Final Project

Project Overview

Workflow Stages

Stage 0: Preparation

Stage 1: Exploratory Data Analysis (EDA)

Stage 2: Preprocessing

Stage 3: Machine Learning

Model Performance

Stage 4: Final Presentation

Key Insights and Results

Business Recommendations

Code & Documentation

Leave a Comment Cancel Reply