You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
Classification models to predict 2-year serious delinquency risk using raw and WOE-transformed datasets with logistic regression and machine learning techniques.
A Comparative Study of Classification Models and WOE-Based Feature Transformation
This project focuses on predicting whether a client will experienceserious delinquency within the next two years using classification models. The dataset was preprocessed to ensure high data quality, including imputation, outlier treatment, and transformation usingWeight of Evidence (WOE) to enhance interpretability and model effectiveness—especially for logistic regression.
Three machine learning models were evaluated to identify the best-performing approach in terms of accuracy, recall, and business relevance.
🧠 Models Compared
Logistic Regression (with and without WOE)
Random Forest Classifier
XGBoost Classifier
🧩 Feature Strategy
Raw Features – Cleaned but untransformed dataset
WOE-Transformed Features – Variables transformed using Weight of Evidence to support interpretability and improve logistic regression performance
🔍 Key Findings
Random Forest delivered the best balance of performance metrics:
ROC AUC: 0.8569
Recall: 0.72
Precision: 0.22
Logistic Regression performed well with WOE-transformed features:
Recall: 0.70
Precision: 0.19
Enabled creation of an interpretablescorecard
XGBoost had the highest precision for non-delinquents but a low recall (0.16), making it less suitable for minimizing false negatives.
Key predictors across models included:
Revolving Utilization of Unsecured Lines
Past Due Counts (30–59, 60–89, 90+)
Age
🛠️ Tools & Libraries Used
Python, Jupyter Notebook
pandas, numpy, seaborn, matplotlib
scikit-learn, XGBoost
WOE Binning tools, scikit-plot
📁 Repository Structure
credit-risk-prediction/
data/ # Raw dataset
notebooks/ # Model training and evaluation (LR, RF, XGBoost)
Random Forest is the most effective model for credit delinquency prediction, offering high recall and balanced precision.
Logistic Regression with WOE remains highly interpretable and practical for deployment via scorecards.
CombiningRandom Forest's accuracy withLogistic Regression's explainability provides a strong, business-ready solution for credit scoring systems.
About
Classification models to predict 2-year serious delinquency risk using raw and WOE-transformed datasets with logistic regression and machine learning techniques.