Project: ML - Logistic Regression, Random Forests (Lending Club)


Problem:

  • Predict outcome of lending club loans to inform conservative investor from: stats on lenders & borrowers for approved and declined loan applications (loan amount, interest rate, employment length, income, installments)
  • Binary Classification (loan payed or defaulted) using LogisticRegression, KNNClassifier and RandomForestClassifier


Tools:

  • Feature Preparation, Selection and Engineering (transforming and processing):
    • drop columns that leak information about the future, don't affect borrower's ability to pay the loan, need to be clean up, require a lot of processing, contain redundant information
    • identify and clean the target colum (deal with inbalance in target with class_weight penalty)
    • identify numerical needing conversion
    • HMV: remove columns with more than 1% missing values, drop rows with missing values
    • categorical features
      • transform numerical to categorical (for those where numbers have no meaning)
      • identify text columns to make categorical:
        • only those with only a few unique values
        • remove low varianve columns (more than 95% of the values are the same)
      • deal with NaNs, then dummy code
    • reshufle DF
    • Feature Selection using RFECV
  • Models: Logistic Regression, KNN, Random Forest Classifier
  • Model validation and hyperparameter search: GridSearchCV, K-fold validation and predictions
  • Error Metrics: TPR and FPR, f1_score, recall_score, precision_score, accuracy and ROC_AUC


load defaults

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import re
import requests 

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from matplotlib import rcParams
import matplotlib.dates as mdates
from datetime import datetime
from IPython.display import display, Math

from functions import *

plt.rcParams.update({'axes.titlepad': 20, 'font.size': 12, 'axes.titlesize':20})

colors = [(0/255,107/255,164/255), (255/255, 128/255, 14/255), 'red', 'green', '#9E80BA', '#8EDB8E', '#58517A']
Ncolors = 10
color_map = plt.cm.Blues_r(np.linspace(0.2, 0.5, Ncolors))
#color_map = plt.cm.tab20c_r(np.linspace(0.2, 0.5, Ncolors))


#specific to this project
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import make_scorer, f1_score, recall_score, precision_score
from sklearn.metrics import classification_report, confusion_matrix

print("Defaults Loaded")
Defaults Loaded


Dataset: Lending Club stats

In [77]:
loans = pd.read_csv('./data/loans_2007.csv', low_memory=False)
display(loans.iloc[:3,:13])
print("Number of Columns: {:d}".format(len(loans.columns)))
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership
0 1077501 1296599.0 5000.0 5000.0 4975.0 36 months 10.65% 162.87 B B2 NaN 10+ years RENT
1 1077430 1314167.0 2500.0 2500.0 2500.0 60 months 15.27% 59.83 C C4 Ryder < 1 year RENT
2 1077175 1313524.0 2400.0 2400.0 2400.0 36 months 15.96% 84.33 C C5 NaN 10+ years RENT
Number of Columns: 52


1- Feature Preparation, Selection and Engineering:

  • identify and clean the target colum (lookout for class inbalance)
  • identify columns that:
    • leak information about the future (after he loan was funded)
    • don't affect the borrower's ability to pay back the loan
    • formatted poorly and need to be clean up
    • require more data or a lot of processing
    • contain redundant information
  • drop columns with only one unique value (after removing nan)
  • handle missing values:
    • remove columns with more than 1% missing values, keep employment length as it is likely a good predictor
    • drop rows with missing values
  • identify numerical needing conversion and extraneous columns
  • dummy coding for categorical features
    • transform numerical to categorical (for those where numbers have no meaning)
    • identify text columns to make categorical:
      • only those with only a few unique values
      • remove low varianve columns (more than 95% of the values are the same)
    • deal with NaNs, then dummy code

target column

  • loan_status is the only column that describes if the loan was paid in time, delayed or defaulted
  • it contains text values and needs to be converted to numerical
In [78]:
print(loans['loan_status'].value_counts())
Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

only fully paid and charged off characterize loans with a final outcome (the others are ongoing)

  • select only rows wih loan_status - fully paid or charged off
  • assign fully paid to 1 and charged of to 0, then we can have a binary classification problem
In [79]:
sel = (loans['loan_status']== 'Fully Paid') | (loans['loan_status']== 'Charged Off')
loans = loans[sel]

mapping_dict = {
    "loan_status": {
        "Charged Off": 0,
        "Fully Paid": 1
    }
}
loans = loans.replace(mapping_dict)
print(loans['loan_status'].value_counts())
1    33136
0     5634
Name: loan_status, dtype: int64

class inbalance in the target column, keep in mind

clean features

In [80]:
cols_to_drop = ['id', 'member_id', 'funded_amnt', 'funded_amnt_inv', 'grade', 'sub_grade', 'emp_title', 
                'issue_d', 'zip_code', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 
                'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 
                'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt']
loans.drop(cols_to_drop, axis=1, inplace=True)
print(len(loans.columns))


#drop columns with only one unique value (after removing nan)
drop_columns = []

for element in loans.columns:
    if(len(loans[element].dropna().unique())==1):
        drop_columns.append(element)
        
loans.drop(drop_columns, axis=1, inplace=True)
print(len(loans.columns))

#remove columns with more than 1% missing values (keep employment length as it is likely a good predictor)
null_counts = loans.isnull().sum()
#print(null_counts[null_counts>len(loans)*0.01])
print(null_counts[null_counts>0.])
loans.drop('pub_rec_bankruptcies', axis=1, inplace=True)

#drop rows with missing values
loans.dropna(axis=0, inplace=True)
null_counts = loans.isnull().sum()
#print(null_counts[null_counts>0])

#write file
loans.to_csv('./data/filtered_loans_2007.csv', index=False)
32
23
emp_length              1036
title                     11
revol_util                50
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

identify numerical needing conversion and extraneous columns

In [81]:
loans = pd.read_csv('./data/filtered_loans_2007.csv')
object_columns_df = loans.select_dtypes(include=['object'])
display(object_columns_df[:1])
term int_rate emp_length home_ownership verification_status purpose title addr_state earliest_cr_line revol_util last_credit_pull_d
0 36 months 10.65% 10+ years RENT Verified credit_card Computer AZ Jan-1985 83.7% Jun-2016
In [82]:
#convert to numeric:
loans['int_rate'] = loans['int_rate'].str.rstrip('%').astype(float)
loans['revol_util'] = loans['revol_util'].str.rstrip('%').astype(float)

#remove extraneous
cols_to_drop = ['last_credit_pull_d', 'earliest_cr_line']
loans.drop(cols_to_drop, axis=1, inplace=True)

dummy coding:

  • home_ownership, verification_status, emp_length, term: few discrete values; encode as dummy
  • purpose, title: overlaping information; keep purpose since title contains 18881 discrete values
  • addr_state: 50 different values; remove
In [83]:
#cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state', 'purpose','title']

#for element in cols:    
#    print(len(object_columns_df[element].value_counts()))
#    print(object_columns_df[element].value_counts()[:5])
#    print('\n')

#drop 'addr_state' and 'title'
cols_to_drop = ['addr_state', 'title']
loans.drop(cols_to_drop, axis=1, inplace=True)

#map emp_length
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
loans = loans.replace(mapping_dict)

#dummy code 'home_ownership', 'verification_status', 'purpose' and 'term'
cols = ['home_ownership', 'verification_status', 'purpose', 'term']
for element in cols:
    loans[element] = loans[element].astype('category')

dummy_df = pd.get_dummies(loans[cols])
loans = pd.concat([loans, dummy_df], axis=1)
loans.drop(cols, axis=1, inplace=True)

print(len(loans.columns))

#Shuffle DF
np.random.seed(1) 
loans = loans.iloc[np.random.permutation(len(loans))]

#write file
loans.to_csv('./data/cleaned_loans_2007.csv', index=False)
38


2 - RFECV for Logistic Regression & Random Forest Classifier

load cleaned data:

In [ ]:
loans = pd.read_csv('./data/cleaned_loans_2007.csv')

feature_columns = loans.drop('loan_status', axis=1).columns.tolist()
target = 'loan_status'
In [60]:
def select_features(df, target, model):    
    #select numeric and drop NaNs
    df_new = df.select_dtypes([np.number]).dropna(axis=1)
    #drop survived and ID
    all_X = df_new.drop(target,axis=1)
    all_y = df_new[target]
        
    #cv is the number of folds
    selector = RFECV(model, cv=10)
    selector.fit(all_X, all_y)  
    optimized_columns = list(all_X.columns[selector.support_])
    
    print("Best Columns \n"+"-"*12+"\n{}\n".format(optimized_columns))    
    return optimized_columns

model = RandomForestClassifier(n_estimators=50, random_state=1, min_samples_leaf=5, class_weight='balanced')
optimized_columns_RFC = select_features(loans[:int(len(loans)/5.)], target, model)

model = LogisticRegression(solver = 'lbfgs', class_weight='balanced', max_iter=3000)
optimized_columns_LR = select_features(loans[:int(len(loans)/5.)], target, model)
Best Columns 
------------
['loan_amnt', 'int_rate', 'installment', 'annual_inc', 'dti', 'open_acc', 'revol_bal', 'revol_util', 'total_acc', 'term_ 60 months']

Best Columns 
------------
['term_ 36 months']


3 - Model Selection with GridSearchCV

In [85]:
import warnings
from sklearn.exceptions import UndefinedMetricWarning
warnings.filterwarnings(action='ignore', category=UndefinedMetricWarning)

def select_model(df, features_list, target, models_to_fit):    
    
    dicts= [ {
               "name": "LogisticRegression",
               "estimator": LogisticRegression(max_iter = 5000),
               "hyperparameters": 
                 {                
                   "solver": ["newton-cg", "lbfgs", "liblinear"],  
                   "class_weight": ["balanced", ""]                      
                 }
             },
             {
               "name": "KNeighborsClassifier",
               "estimator": KNeighborsClassifier(),
               "hyperparameters": 
                 {
                   "n_neighbors": range(1,20,2),
                   "weights": ["distance", "uniform"],
                   "algorithm": ["ball_tree", "kd_tree", "brute"],
                   "p": [1,2]
                 }
             },
             {
               "name": "RandomForestClassifier",
               "estimator": RandomForestClassifier(),
               "hyperparameters": 
                 {
                   "n_estimators": [5, 20, 100],
                   "criterion": ["entropy", "gini"],
                   "max_depth": [2, 5, 10],
                   "max_features": ["log2", "sqrt"],
                   "min_samples_leaf": [1, 5, 8],
                   "min_samples_split": [2, 3, 5], 
                   "class_weight": [None, "balanced", {0: 3, 1: 1}, {0: 5, 1: 1}]               
                 }
             } ]    
    
    scoring = {'ROC_AUC':'roc_auc', 'Accuracy':'accuracy', 
               'precision_1': make_scorer(precision_score, pos_label=1),
               'recall_1': make_scorer(recall_score, pos_label=1),
               'f1_1': make_scorer(f1_score, pos_label=1),
               'precision_0': make_scorer(precision_score, pos_label=0),
               'recall_0': make_scorer(recall_score, pos_label=0),
               'f1_0': make_scorer(f1_score, pos_label=0)}
    
    all_y = df[target]
    for element in dicts:
        if(element['name'] not in models_to_fit):
            continue
        print(element['name'])
        print('-'*len(element['name']))
        
        all_X = df[features_list[element['name']]]
        model = element['estimator']
        grid = GridSearchCV(model, element['hyperparameters'], cv=10, scoring=scoring, refit='ROC_AUC', iid=True)
        grid.fit(all_X, all_y)
        
        element['best_params'] = grid.best_params_
        element['best_score'] = grid.best_score_
        element['best_estimator'] = grid.best_estimator_          
        for scorer in scoring:          
            print(f"{scorer}: {max(grid.cv_results_['mean_test_'+scorer]):0.3f}")
        print("Best Parameters: {}".format(grid.best_params_))
        print("Best Score: {:0.3f}\n\n".format(grid.best_score_))
        
        
        #for scorer in scoring:
        #    print(cv_results_'_<scorer_name>')
       
    return dicts
        
models_to_fit = ['LogisticRegression','KNeighborsClassifier','RandomForestClassifier']
optimized_columns = {'LogisticRegression': optimized_columns_LR, 
                     'KNeighborsClassifier': optimized_columns_LR, 
                     'RandomForestClassifier': optimized_columns_RFC}
 
model_dicts = select_model(loans[:int(len(loans)/10.)], optimized_columns, target, models_to_fit)


print("model selection finished")
LogisticRegression
------------------
ROC_AUC: 0.615
Accuracy: 0.853
precision_1: 0.891
recall_1: 1.000
f1_1: 0.920
precision_0: 0.262
recall_0: 0.447
f1_0: 0.330
Best Parameters: {'class_weight': 'balanced', 'solver': 'newton-cg'}
Best Score: 0.615


KNeighborsClassifier
--------------------
ROC_AUC: 0.615
Accuracy: 0.853
precision_1: 0.864
recall_1: 1.000
f1_1: 0.920
precision_0: 0.076
recall_0: 0.134
f1_0: 0.097
Best Parameters: {'algorithm': 'brute', 'n_neighbors': 19, 'p': 1, 'weights': 'distance'}
Best Score: 0.615


RandomForestClassifier
----------------------
ROC_AUC: 0.699
Accuracy: 0.854
precision_1: 0.909
recall_1: 1.000
f1_1: 0.921
precision_0: 0.588
recall_0: 0.591
f1_0: 0.364
Best Parameters: {'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 5, 'max_features': 'log2', 'min_samples_leaf': 8, 'min_samples_split': 2, 'n_estimators': 20}
Best Score: 0.699


model selection finished


4 - Cross_Val Predictions for best model (RandomForestClassifier)

In [86]:
kf = KFold(10, shuffle=True, random_state=1)

best_model =  {'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 5, 'max_features': 'log2', 
               'min_samples_leaf': 8, 'min_samples_split': 2, 'n_estimators': 20}
model = RandomForestClassifier()
model.set_params(**best_model)  

predictions = cross_val_predict(model, loans[optimized_columns_RFC], loans[target], cv=kf, )
predictions = pd.Series(predictions)

#classification report
print(classification_report(loans[target],predictions))

c_matrix = confusion_matrix(loans[target],predictions)
print(c_matrix)
tp = c_matrix[0][0]
fp = c_matrix[0][1]
fn = c_matrix[1][0]
tn = c_matrix[1][1]

tpr = tp/(tp+fn)
fpr = fp/(fp+tn)

print("TPR:{:0.3f}, FPR:{:0.3f}".format(tpr, fpr))
              precision    recall  f1-score   support

           0       0.22      0.64      0.33      5389
           1       0.91      0.63      0.75     32286

   micro avg       0.63      0.63      0.63     37675
   macro avg       0.57      0.63      0.54     37675
weighted avg       0.81      0.63      0.69     37675

[[ 3435  1954]
 [11878 20408]]
TPR:0.224, FPR:0.087


5 - Logistic Regression Model with different weights

  • good first model for binary classification problems:
    • quick to train and we can iterate more quickly
    • less prone to overfitting than more complex models like decision trees
    • easy to interpret
  • Error Metric
    • In this problem we are mostly concerned with false positives (cost money) and false negatives (loose potential money)
    • because of class inbalance a classifier can predict one for every row and still have high accuracy, use false positives and negatives
    • optimize for high recall (true positive rate) and low fall-out (false positive rate):
    • $TPR=\frac{\mathrm{True Positives}}{\mathrm{True Positives+False Negatives}}$
    • $FPR=\frac{\mathrm{False Positives}}{\mathrm{False Positives+True Negatives}}$

Logistic Regression Model with kfold predictions

In [8]:
loans = pd.read_csv('./data/cleaned_loans_2007.csv')

features_df = loans.drop('loan_status', axis=1)
target = 'loan_status'

lr = LogisticRegression(solver='lbfgs', max_iter=30000)
kf = KFold(10, random_state=1)

predictions = cross_val_predict(lr, features_df, loans[target], cv=kf)
predictions = pd.Series(predictions)

tp = len(predictions[(loans[target]==1) & (predictions==1)])
tn = len(predictions[(loans[target]==0) & (predictions==0)])
fp = len(predictions[(loans[target]==0) & (predictions==1)])
fn = len(predictions[(loans[target]==1) & (predictions==0)])

tpr = tp/(tp+fn)
fpr = fp/(fp+tn)

print("TPR:{:0.3f}, FPR:{:0.3f}".format(tpr, fpr))
TPR:0.999, FPR:0.997

Logistic Regression Model with class_weight='balanced'

correct for inbalance by penalizing misclassifications of the less prevalent class more than the other class:

  • class_weight='balanced'
In [4]:
lr = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=3000)
#cross validation accross all the rows of the training data (kfold=n, leave one out validation)
#kf = KFold(features.shape[0], random_state=1)
kf = KFold(10, random_state=1)
predictions = cross_val_predict(lr, features_df, loans[target], cv=kf, )
predictions = pd.Series(predictions)

tp = len(predictions[(loans[target]==1) & (predictions==1)])
tn = len(predictions[(loans[target]==0) & (predictions==0)])
fp = len(predictions[(loans[target]==0) & (predictions==1)])
fn = len(predictions[(loans[target]==1) & (predictions==0)])

tpr = tp/(tp+fn)
fpr = fp/(fp+tn)

print("TPR:{:0.3f}, FPR:{:0.3f}".format(tpr, fpr))
TPR:0.540, FPR:0.334

Logistic Regression Model with class_weight=penalty dictionary

setting class_wight to balanced assign a penalty of ~5.89 for misclassifying 0 (there as 5.89 times more 1's than 0's)

  • we can set the penalty manually to try to lower the false positive rates (higher penalty for misclassifying the negative class)
In [5]:
penalty = {0: 10, 1: 1}
lr = LogisticRegression(solver = 'lbfgs', class_weight=penalty, max_iter=3000)
#kf = KFold(features.shape[0], random_state=1)
kf = KFold(10, random_state=1)
predictions = cross_val_predict(lr, features_df, loans[target], cv=kf, )
predictions = pd.Series(predictions)

tp = len(predictions[(loans[target]==1) & (predictions==1)])
tn = len(predictions[(loans[target]==0) & (predictions==0)])
fp = len(predictions[(loans[target]==0) & (predictions==1)])
fn = len(predictions[(loans[target]==1) & (predictions==0)])

tpr = tp/(tp+fn)
fpr = fp/(fp+tn)

print("TPR:{:0.3f}, FPR:{:0.3f}".format(tpr, fpr))
TPR:0.162, FPR:0.067

lower false positive rate at the expense of true positives

In [ ]: