Home Credit Default Risk Prediction: Data Science Final Project

Home Credit Logo

Best Group and Best Student Awarded for Data Science Final Project predicting Home Credit customer default risk. Involves data preprocessing, EDA, feature engineering, and ML with XGBoost & Stacking. Provides actionable insights to enhance credit risk management and reduce default rates.

Project Overview

The Data Science Final Project focuses on predicting the risk of customer default using historical loan application data. By leveraging robust data preprocessing, feature engineering, and advanced machine learning techniques, this project aims to provide actionable insights to improve credit risk management for Home Credit.

Workflow Stages

Stage 0: Preparation

  • Tasks: Data collection, initial data inspection, and understanding business requirements.
  • Deliverables: Raw datasets and project goals.

Stage 1: Exploratory Data Analysis (EDA)

  • Techniques:
    • Statistical analysis of key variables.
    • Visualizations to identify patterns and correlations.
  • Objective: Identify trends and potential predictors for default risk.

Stage 2: Preprocessing

    1. Data Cleansing: Handling missing values, removing duplicates, and transforming features.
    2. Feature Engineering: Adding new features such as credit ratios and tenure groups.
    3. Encoding: Binary and one-hot encoding for categorical variables.
    4. Handling Class Imbalance: Using oversampling (SMOTE) and undersampling techniques.
    5. Normalization: Standardizing numerical features to improve model performance.

Stage 3: Machine Learning

Models Used:

    • AdaBoost: Combines weak classifiers into a strong classifier adaptively, achieving high accuracy without requiring complex models.
    • Decision Tree: A flexible algorithm for classification and regression, capable of handling numeric and categorical data while generating intuitive tree structures.
    • Random Forest: An ensemble of decision trees trained on random data subsets, suitable for classification and regression tasks.
    • KNN (K-Nearest Neighbors): A lazy learning algorithm that classifies data points based on their neighbors, effective for handling local data patterns.
    • XGBoost: Known for its high speed and efficiency, XGBoost excels at processing sparse data and controlling overfitting through regularization.
    • Stacking: Combines predictions from multiple base models through meta-modeling to enhance prediction accuracy.
    • Logistic Regression: A proven method for predicting loan defaults, with a focus on interest rates as a key predictor.

Evaluation Metrics: ROC-AUC, Recall, Precision, F1-Score, and F2-Score.

Tuning: Hyperparameter optimization for improved model performance.

Model Performance

Below are the key evaluation metrics for the best-performing model, XGBoost, before and after hyperparameter tuning:

MetricXGBoost Before TuningXGBoost After Tuning
ROC AUC (Train Set)0.96460.9861
ROC AUC (Test Set)0.94690.9470
Accuracy (Train Set)0.94660.9397
Accuracy (Test Set)0.93870.9397
F2-Score (Test Set)0.84830.8519
Recall (Cross-Validation Train)0.81970.8239
Recall (Test Set)0.81760.8220
Recall (Cross-Validation Test)0.81650.8208

These results indicate that hyperparameter tuning significantly improved the recall and AUC scores, enhancing the model’s ability to correctly identify high-risk borrowers.

Stage 4: Final Presentation

Deliverables:

    • Summary of model results and business insights.
    • Recommendations for improving credit policy and reducing default rates.

Key Insights and Results

  • The XGBoost model provided the best balance between recall and precision, making it suitable for identifying high-risk customers.
  • The optimized model successfully reduced the predicted default rate from 8.04% to 0.96%, aligning with Home Credit’s business goals.

Business Recommendations

  1. Focus on Customer Segmentation and Retargeting:
    • Target high-risk sectors like Consumer Electronics and Connectivity with specialized credit offers.
  2. Enhance Default Prediction Accuracy:
    • Use demographic and behavioral features for better risk profiling.
  3. Flexible Credit Policies:
    • Offer competitive interest rates for low-risk customers and incentivize loyalty.
Certificate of Awardee - Hijir Della Wirasti - The Best Group of Final Project (Byte Me)
Certificate of Awardee - Hijir Della Wirasti - The Best Student of Final Project (Byte Me)

Code & Documentation

🔗 Find the complete code and documentation on my GitHub:

Leave a Comment

Your email address will not be published. Required fields are marked *