Credit_Risk_Analysis

Using Supervised Machine Learning to Predict Credit Risk

credit_risk

Image source

Background

Overview of Analysis

This project consists of three technical analysis deliverables and a written report.

Purpose

Credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. Therefore, we needed to employ different techniques to train and evaluate models with unbalanced classes.

Using the credit card credit dataset from LendingClub, a peer-to-peer lending services company, we’ll oversample the data using the RandomOverSampler and SMOTE algorithms, and undersample the data using the ClusterCentroids algorithm. Then, we’ll use a combinatorial approach of over- and undersampling using the SMOTEENN algorithm. Next, we’ll compare two new machine learning models that reduce bias, BalancedRandomForestClassifier and EasyEnsembleClassifier, to predict credit risk.

Resources

Data source:

Software:


Methodology

D1: Use Resampling Models to Predict Credit Risk

Using the imbalanced-learn and scikit-learn libraries, we evaluated three machine learning models by using resampling to determine which was better at predicting credit risk. First, we used the oversampling RandomOverSampler and SMOTE algorithms, and then we used the undersampling ClusterCentroids algorithm. Using these algorithms, we resampled the dataset, viewed the count of the targeted classes, trained a logistic regression classifier, calculated the balanced accuracy score, generated a confusion matrix, and generated a classification report.


D2: Use the SMOTEENN algorithm to Predict Credit Risk

Using the imbalanced-learn and scikit-learn libraries, we used a combinatorial approach of over- and undersampling with the SMOTEENN algorithm to determine if the results from the combinatorial approach were better at predicting credit risk than the resampling algorithms from Deliverable 1. Using the SMOTEENN algorithm, we resampled the dataset, viewed the count of the targeted classes, trained a logistic regression classifier, calculated the balanced accuracy score, generated a confusion matrix, and generated a classification report.


D3: Use Ensemble Classifiers to Predict Credit Risk

Using the imblearn.ensemble library, we train and compare two different ensemble classifiers, BalancedRandomForestClassifier and EasyEnsembleClassifier, to predict credit risk and evaluate each model. Using both algorithms, we resampled the dataset, viewed the count of the targeted classes, trained the ensemble classifier, calculated the balanced accuracy score, generated a confusion matrix, and generated a classification report.


Results:

D1. For all three algorithms, the following have been completed:

RandomOverSampler

Findings:

An accuracy score for the model is calculated:

amazon_dataset_example

Figure (1.1) RandomOverSampler balanced accuracy report


A confusion matrix has been generated:

amazon_dataset_example

Figure (1.2) RandomOverSampler matrix


#### An imbalanced classification report has been generated:

amazon_dataset_example

Figure (1.3) RandomOverSampler imbalanced classification report


SMOTE

Findings:

An accuracy score for the model is calculated:

amazon_dataset_example

Figure (1.4) SMOTE balanced accuracy report


A confusion matrix has been generated:

amazon_dataset_example

Figure (1.5) SMOTE confusion matrix


An imbalanced classification report has been generated:

amazon_dataset_example

Figure (1.6) SMOTE imbalanced classification report


ClusterCentroids

Findings:

An accuracy score for the model is calculated:

amazon_dataset_example

Figure (1.7) ClusterCentroids balanced accuracy report


A confusion matrix has been generated:

amazon_dataset_example

Figure (1.8) ClusterCentroids_confusion_matrix


An imbalanced classification report has been generated:

amazon_dataset_example

Figure (1.9) ClusterCentroids imbalanced classification report


D2. The combinatorial SMOTEENN algorithm does the following:

SMOTEENN

Findings:

An accuracy score for the model is calculated:

amazon_dataset_example

Figure (1.10) SMOTEEN balanced accuracy report


A confusion matrix has been generated:

amazon_dataset_example

Figure (1.11) SMOTEENN matrix


An imbalanced classification report has been generated:

amazon_dataset_example

Figure (1.12) SMOTEENN imbalanced classification report


D3. The algorithm does the following:

BalancedRandomForestClassifier

Findings:

#### An accuracy score for the model is calculated:

amazon_dataset_example

Figure (1.13) BalancedRandomForestClassifier balanced accuracy report


#### A confusion matrix has been generated

amazon_dataset_example

Figure (n) BalancedRandomForestClassifier matrix


An imbalanced classification report has been generated:

amazon_dataset_example

Figure (1.14) BalancedRandomForestClassifier imbalanced classification report


The features are sorted in descending order by feature importance:

amazon_dataset_example

Figure (1.15) features_sorted_in_descending_order


EasyEnsembleClassifier

Findings:

An accuracy score of the model is calculated:

amazon_dataset_example

Figure (1.16) EasyEnsembleClassifier balanced accuracy_ report


#### A confusion matrix has been generated:

amazon_dataset_example

Figure (1.17) EasyEnsembleClassifier_matrix


An imbalanced classification report has been generated:

amazon_dataset_example

Figure (1.18) EasyEnsembleClassifier imbalanced classification report


Summary

The recall (sensitivity) for prediction of the high risk and low_risk are in line with each other for most of the models. However, the precision for predicting high risk is much lower than it is for predicting low risk. The lower precision for high risk is reflected in the dropped F1 score.

In this scenario, the sensitivity is very high, while the precision is very low. Clearly, this is not a useful algorithm, so let’s take a look at the F1 value. A pronounced imbalance between sensitivity and precision will yield a low F1 score.

The F1 values for our models:

The accuracy scores for our models:

To summarize our results, we’ll focus on our targeted class (high_risk), out of the 6 models:

In general the models were not very good at predicting high risk since the F1 values for most models were between 0.06-0.07. However, from our 6 models we would recommend the EasyEnsembleClassifier since the model did a better job classifying the data, improved the F1 value from 0.06 to 0.16, and had a better accuracy score with 93.2%.

For future evaluations we may want to explore the use of Precision-Recall Curve to compare the model performances in imbalanced data sets.

References

Markdown

scikit-learn

imbalanced-learn