Project: ML - Logistic Regression, One-versus-all Method for MultiClass Classification + RFC & GBC (Car Origins)

Problem:

  • Predict continent of origin from technical car properties: year built, cylinders, horsepower, weight, mpg, acceleration, etc
  • MultiClass Classification (USA - 1, Europe - 2 or Asia - 3) with LogisticRegression + one-vs-all-method, RandomForestClassifier & GradientBoostingClassifier


Tools:

  • Feature Engineering: Dummy coding, RFECV
  • Models: LogisticRegression + one-vs-all-method, RandomForestClassifier & GradientBoostingClassifier
  • Model validation and hyperparameter search: GridSearchCV, K-fold validation and predictions
  • Error Metrics: Accuracy, TPR, TNR, MCC, classification_report, confusion_matrix


load defaults

In [103]:
import numpy as np
import pandas as pd
import seaborn as sns
import re
import requests 

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from matplotlib import rcParams
import matplotlib.dates as mdates
from datetime import datetime
from IPython.display import display, Math

from functions import *
import myML_functions as myML_functions

plt.rcParams.update({'axes.titlepad': 20, 'font.size': 12, 'axes.titlesize':20})

colors = [(0/255,107/255,164/255), (255/255, 128/255, 14/255), 'red', 'green', '#9E80BA', '#8EDB8E', '#58517A']
Ncolors = 10
color_map = plt.cm.Blues_r(np.linspace(0.2, 0.5, Ncolors))
#color_map = plt.cm.tab20c_r(np.linspace(0.2, 0.5, Ncolors))


#specific to this project
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import make_scorer, f1_score, recall_score, precision_score

print("Defaults Loaded")
Defaults Loaded


Dataset: car properties and continent of origin

In [37]:
cars = pd.read_csv("./data/auto.csv")

#find unique values
unique_regions = cars['origin'].unique()
print(cars['origin'].value_counts())
1    245
3     79
2     68
Name: origin, dtype: int64

dummy coding

In [38]:
#categoric columns: cylinders, year, origin

#set prefix
dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
cars = pd.concat([cars, dummy_cylinders], axis=1)

dummy_years = pd.get_dummies(cars['year'], prefix='year')
cars = pd.concat([cars, dummy_years], axis = 1)
cars.drop(['cylinders','year'], axis=1, inplace=True)

display(cars.iloc[:3,:15])
mpg displacement horsepower weight acceleration origin cyl_3 cyl_4 cyl_5 cyl_6 cyl_8 year_70 year_71 year_72 year_73
0 18.0 307.0 130.0 3504.0 12.0 1 0 0 0 0 1 1 0 0 0
1 15.0 350.0 165.0 3693.0 11.5 1 0 0 0 0 1 1 0 0 0
2 18.0 318.0 150.0 3436.0 11.0 1 0 0 0 0 1 1 0 0 0

shuffle rows and split into train and test

In [100]:
np.random.seed(1)
shuffled_cars = cars.iloc[np.random.permutation(len(cars))]

train = shuffled_cars[0: int(0.7*len(shuffled_cars))]
test = shuffled_cars[int(0.7*len(shuffled_cars)):]

df_clean = shuffled_cars
target = 'origin'
target_df = df_clean['origin']

print("done")
done


1 - Logistic Regression Model: One-versus-all method

  • Logistic Regression outputs a probability value (that a given row should be labeled 1). In binary classification we can set a threshold and assign 1 to probability values above and 0 below. treshold = 0.5 in LogisticRegression by default

One versus all method for Multiclass Classification: choose one category as the positive case and group all others in the False case

  • convert the problem into n binary classification problems
  • train 3 models:
    • cars made in America = 1, Europe and Asia = 0
    • cars made in Europe = 1, America and Asia = 0
    • cars made in Asia = 1, America and Europe = 0

then, for each observations choose the model the label (prediction) with highest probability

1.1 - Fit model

In [6]:
unique_origins = cars["origin"].unique()
unique_origins.sort()

models = {}

#train model just with year and cylinder dummy columns
cols = train.columns
cols_to_keep = (cols.str.contains('cyl') | cols.str.contains('year'))

testing_probs = pd.DataFrame(columns=unique_origins)

#train on train set
X = train[cols[cols_to_keep]]
for element in unique_origins:
    #select only rows with current label
    y = train['origin'] == element
    models[element] = LogisticRegression(solver='lbfgs')
    models[element].fit(X, y)

print("models fitted")
models fitted
In [7]:
#predict
y = test[cols[cols_to_keep]]    
for element in unique_origins:
    testing_probs[element]=models[element].predict_proba(y)[:,1]
    
display(testing_probs[:5])
1 2 3
0 0.958401 0.035453 0.018950
1 0.979050 0.013642 0.023675
2 0.277985 0.356383 0.351180
3 0.983759 0.013213 0.018183
4 0.340172 0.261673 0.377821
In [8]:
predicted_origins = testing_probs.idxmax(axis=1)
cars['predicted_label'] = predicted_origins
cars.dropna(axis=0, inplace=True, how='any')
cars['predicted_label'] = cars['predicted_label'].astype('int')
display(cars['predicted_label'].iloc[:5])
0    1
1    1
2    2
3    1
4    3
Name: predicted_label, dtype: int64


1.2 - Error Metrics

Accuracy

In [9]:
cars = cars.rename(columns={'origin': 'actual_label'})
matches = cars['actual_label']==cars['predicted_label']
correct_predictions = cars[matches]

accuracy = len(correct_predictions)/len(cars)
print("Accuracy = {:0.3f}".format(accuracy))
Accuracy = 0.492

Recall/Sensitivity(True Positive Rate), $TPR = \frac{True\,Positives}{True\,Positives+False\,Negatives}$

In [10]:
true_positives = 0
for element in  cars["actual_label"].unique():
    my_filter = (cars["predicted_label"] == element) & (cars["actual_label"] == element)
    true_positives += len(cars[my_filter])
    
false_negatives = 0
for element in  cars["actual_label"].unique():
    my_filter = (cars["predicted_label"] != element) & (cars["actual_label"] == element)
    false_negatives += len(cars[my_filter])    
    
sensitivity = (true_positives)/(true_positives+false_negatives)
print("Sensitivity = {:0.3f}".format(sensitivity))
Sensitivity = 0.492

Specificity (True Negative rate), $TNR = \frac{True\,Negatives}{False\,Positives+True\,Negatives}$

In [11]:
true_negatives = 0
for element in  cars["actual_label"].unique():
    my_filter = (cars["predicted_label"] != element) & (cars["actual_label"] != element)
    true_negatives += len(cars[my_filter])
    
false_positives = 0
for element in  cars["actual_label"].unique():
    my_filter = (cars["predicted_label"] == element) & (cars["actual_label"] != element)
    false_positives += len(cars[my_filter])  
    
specificity = true_negatives/(true_negatives+false_positives)
print("Specificity = {:0.3f}".format(specificity))
Specificity = 0.746

Matthew's Correlation Coefficient, $MCC = \frac{TP.TN-FP.FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$

In [12]:
MCC = ((true_positives*true_negatives-false_positives*false_negatives)/
       np.sqrt((true_positives+false_positives)*(true_positives+false_negatives)*
        (true_negatives+false_positives)*(true_negatives+false_negatives)))
print("MCC = {:0.3f}".format(MCC))
MCC = 0.237
In [29]:
print(classification_report(cars['actual_label'],cars['predicted_label']))
cm = confusion_matrix(cars['actual_label'], cars['predicted_label'], labels=[1, 2, 3])
myML_functions.print_cm(cm, labels=['USA (1)', 'Europe (2)', ' Asia (3)'])
              precision    recall  f1-score   support

           1       0.75      0.62      0.68        86
           2       0.11      0.17      0.13        18
           3       0.10      0.14      0.12        14

   micro avg       0.49      0.49      0.49       118
   macro avg       0.32      0.31      0.31       118
weighted avg       0.57      0.49      0.53       118

    true / pred    USA (1) Europe (2)   Asia (3) 
       USA (1)       53.0       19.0       14.0 
    Europe (2)       11.0        3.0        4.0 
      Asia (3)        7.0        5.0        2.0 


2 - Feature Selection with RFECV

In [49]:
model = LogisticRegression(solver = 'lbfgs', class_weight='balanced', max_iter=3000, multi_class='ovr')
optimized_columns_LR = myML_functions.select_features_RFECV(df_clean, target, model)

model = RandomForestClassifier(n_estimators=50, random_state=1, min_samples_leaf=5, class_weight='balanced')
optimized_columns_RFC = myML_functions.select_features_RFECV(df_clean, target, model)

model = GradientBoostingClassifier(learning_rate=0.01, n_estimators=50,subsample=0.6,random_state=42)  
optimized_columns_GBC = myML_functions.select_features_RFECV(df_clean, target, model)
Best Columns, LogisticRegression model: ['mpg', 'displacement', 'horsepower', 'weight', 'acceleration', 'cyl_3', 'cyl_4', 'cyl_5', 'cyl_6', 'cyl_8', 'year_70', 'year_71', 'year_72', 'year_73', 'year_74', 'year_75', 'year_76', 'year_77', 'year_78', 'year_79', 'year_80', 'year_81', 'year_82']

----------------------------------------------------

Best Columns, RandomForestClassifier model: ['mpg', 'displacement', 'horsepower', 'weight', 'acceleration', 'cyl_4']

----------------------------------------------------

Best Columns, GradientBoostingClassifier model: ['mpg', 'displacement', 'horsepower', 'weight', 'acceleration', 'cyl_3', 'cyl_4', 'cyl_5', 'cyl_6', 'year_82']

----------------------------------------------------


3 - Model Selection with GridSearchCV

In [151]:
def select_model(df, target, models_to_fit):                              
                              
    dicts= [ {
               "name": "LogisticRegression",
               "estimator": LogisticRegression(max_iter = 100000, multi_class='auto'),
               "hyperparameters": 
                 {                
                   "solver": ["lbfgs", "liblinear"],  
                   "class_weight": ["balanced", ""]                      
                 }
             },             
             {
               "name": "RandomForestClassifier",
               "estimator": RandomForestClassifier(),
               "hyperparameters": 
                 {
                   "n_estimators": [5, 20, 100],
                   "criterion": ["entropy", "gini"],
                   "max_depth": [2, 5, 10],
                   "max_features": ["log2", "sqrt"],
                   "min_samples_leaf": [1, 5, 8],
                   "min_samples_split": [2, 3, 5], 
                   "class_weight": [None, "balanced"]               
                 }
             },
             {
               "name": "GradientBoostingClassifier",
               "estimator": GradientBoostingClassifier(),
               "hyperparameters": 
                 {
                   "n_estimators": [5, 2, 10],  
                   "max_features": ["auto", "log2", "sqrt"],
                   "learning_rate":[0.01, 0.05, 0.1, 0.5],
                   "subsample":[0.1, 0.5, 1.0],  
                   "random_state":[1]     
                 }
             }]    
    
    scoring = {'Accuracy':'accuracy', 
               'precision': make_scorer(precision_score, average='weighted'),
               'recall': make_scorer(recall_score, average='weighted'),
               'f1': make_scorer(f1_score, average='weighted')}
    
    all_y = df[target]
    
    for key, models_list in models_to_fit.items():        
        print(key)
        print('-'*len(key))
        start = time.time()
        for element in dicts:
            if models_list[0] == element['name']:                
                all_X = df[models_list[1]]              
                model = element['estimator']
                grid = GridSearchCV(model, element['hyperparameters'], cv=10, scoring=scoring, 
                                    refit='f1', iid=True, n_jobs=1)
                grid.fit(all_X, all_y)
        
                element['best_params'] = grid.best_params_
                element['best_score'] = grid.best_score_
                element['best_estimator'] = grid.best_estimator_          
                for scorer in scoring:          
                    print(f"{scorer}: {max(grid.cv_results_['mean_test_'+scorer]):0.3f}")
                print("Best Parameters: {}".format(grid.best_params_))
                print("Best Score: {:0.3f}\n".format(grid.best_score_))
        
        print(f"Time elapsed: {(time.time()-start)/60.:0.2f} mins\n\n")
        #for scorer in scoring:
        #    print(cv_results_'_<scorer_name>')
       
    return dicts
In [152]:
from sklearn.exceptions import UndefinedMetricWarning
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)

import time
  
models_to_fit = {'LogisticRegression': ['LogisticRegression', optimized_columns_LR],                    
                 'RandomForestClassifier': ['RandomForestClassifier', optimized_columns_RFC],
                 'GradientBoostingClassifier': ['GradientBoostingClassifier', optimized_columns_GBC]}

model_dicts = select_model(df_clean[:int(len(df_clean))], target, models_to_fit)

print("model selection finished")
LogisticRegression
------------------
Accuracy: 0.765
precision: 0.800
recall: 0.765
f1: 0.769
Best Parameters: {'class_weight': 'balanced', 'solver': 'liblinear'}
Best Score: 0.769

Time elapsed: 0.23 mins


RandomForestClassifier
----------------------
Accuracy: 0.890
precision: 0.900
recall: 0.890
f1: 0.887
Best Parameters: {'class_weight': None, 'criterion': 'entropy', 'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best Score: 0.887

Time elapsed: 13.87 mins


GradientBoostingClassifier
--------------------------
Accuracy: 0.852
precision: 0.863
recall: 0.852
f1: 0.849
Best Parameters: {'learning_rate': 0.5, 'max_features': 'auto', 'n_estimators': 10, 'random_state': 1, 'subsample': 1.0}
Best Score: 0.849

Time elapsed: 0.63 mins


model selection finished


4 - Cross_Val Predictions for best model (RandomForestClassifier)

In [153]:
kf = KFold(10, shuffle=True, random_state=1)

best_model =  {'class_weight': None, 'criterion': 'entropy', 'max_depth': 10, 'max_features': 'log2', 
               'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100, 'n_jobs': -1}
model = RandomForestClassifier()
model.set_params(**best_model)  

predictions = cross_val_predict(model, df_clean[optimized_columns_RFC], target_df, cv=kf, )
predictions = pd.Series(predictions)

print(classification_report(target_df, predictions))
cm = confusion_matrix(target_df, predictions, labels=[1, 2, 3])
myML_functions.print_cm(cm, labels=['USA (1)', 'Europe (2)', ' Asia (3)'])
              precision    recall  f1-score   support

           1       0.92      0.93      0.92       245
           2       0.79      0.68      0.73        68
           3       0.80      0.87      0.84        79

   micro avg       0.88      0.88      0.88       392
   macro avg       0.84      0.83      0.83       392
weighted avg       0.87      0.88      0.87       392

    true / pred    USA (1) Europe (2)   Asia (3) 
       USA (1)      228.0       11.0        6.0 
    Europe (2)       11.0       46.0       11.0 
      Asia (3)        9.0        1.0       69.0