Predicting High Risk Credit Card Customers using Linear Support Vector Machine Model - Paolo G. Hilado

Situationer

This case presents a model for financial institutions to predict high risk credit card customers based on lifestyle variables. It uses the support vector machine model to check whether applicant is a considered high risk credit card customer or not.

Load Data

#load kernlab package
library(kernlab)
#Open file
data <- read.table("credit_card_data.txt")

Check Data Structure

#Check the structure of the data
str(data)
'data.frame':   654 obs. of  11 variables:
 $ V1 : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 1 2 2 ...
 $ V2 : num  30.8 58.7 24.5 27.8 20.2 ...
 $ V3 : num  0 4.46 0.5 1.54 5.62 ...
 $ V4 : num  1.25 3.04 1.5 3.75 1.71 ...
 $ V5 : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ V6 : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 2 2 2 2 ...
 $ V7 : int  1 6 0 5 0 0 0 0 0 0 ...
 $ V8 : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 1 2 2 1 ...
 $ V9 : int  202 43 280 100 120 360 164 80 180 52 ...
 $ V10: int  0 560 824 3 0 0 31285 1349 314 1442 ...
 $ V11: Factor w/ 2 levels "Passed","High-Risk": 2 2 2 2 2 2 2 2 2 2 ...
#Check the first and last 6 rows
head(data); tail(data)
   V1    V2    V3   V4  V5  V6 V7  V8  V9 V10       V11
1 Yes 30.83 0.000 1.25 Yes  No  1 Yes 202   0 High-Risk
2  No 58.67 4.460 3.04 Yes  No  6 Yes  43 560 High-Risk
3  No 24.50 0.500 1.50 Yes Yes  0 Yes 280 824 High-Risk
4 Yes 27.83 1.540 3.75 Yes  No  5  No 100   3 High-Risk
5 Yes 20.17 5.625 1.71 Yes Yes  0 Yes 120   0 High-Risk
6 Yes 32.08 4.000 2.50 Yes Yes  0  No 360   0 High-Risk
     V1    V2     V3   V4 V5  V6 V7  V8  V9 V10    V11
649 Yes 40.58  3.290 3.50 No Yes  0  No 400   0 Passed
650 Yes 21.08 10.085 1.25 No Yes  0 Yes 260   0 Passed
651  No 22.67  0.750 2.00 No  No  2  No 200 394 Passed
652  No 25.25 13.500 2.00 No  No  1  No 200   1 Passed
653 Yes 17.92  0.205 0.04 No Yes  0 Yes 280 750 Passed
654 Yes 35.00  3.375 8.29 No Yes  0  No   0   0 Passed
# Put appropriate labels for response variable

Preprocess continuous variables via normalization

nvar <- preProcess(data[,-11], method = c("center", "scale"))
data[,-11] <- predict(nvar, data[,-11])
# Factor variables have been skipped out in this process
nvar
Created from 654 samples and 10 variables

Pre-processing:
  - centered (6)
  - ignored (4)
  - scaled (6)

Data Partition; 70% Training and 30% Testing

library(caret)
inTrain <- createDataPartition(data$V11, p=.70, list=0)
set.seed(2); train <- data[inTrain,]
set.seed(3); test <- data[-inTrain,]

Applying Support Vector Machine Linear Scaled

#Apply ksvm model with scale
set.seed(789); Smod <- ksvm(V11~., data = train, type = "C-svc", kernel = "vanilladot", C = 100, scaled = TRUE)
 Setting default kernel parameters  

Checking the Coefficients

#This is the code to show the coefficients. However, knowing that there's a lot to it, I have set
#r command to not show the results as it is very long (13 pages). Feel free to run the code on your
#machine at anytime. 
attributes(Smod)

Perform linear kernel with equation of a*scaled(x) + a0

#Given the equation a*scaled(x) is:
scaled_a <- colSums(Smod@xmatrix[[1]] * Smod@coef[[1]])
#Compute for constant a which is the negative intercept of b in the model
#Using the same equation, a0 is:
scaled_a0 <- -Smod@b
#Let's check out our EQUATION with sum of scaled_a coefficients equals to 0
scaled_a
         V1No         V1Yes            V2            V3            V4 
 1.111791e-04 -1.111791e-04  2.359807e-05 -9.056175e-06 -3.390583e-05 
        V5Yes         V6Yes            V7         V8Yes            V9 
 2.000095e+00  4.860085e-05  3.953579e-05  1.081370e-04 -1.769631e-05 
          V10 
 3.338148e-04 
scaled_a0
[1] -1.000053
#Obtain predicted values via the model we have fitted
set.seed(156); pred <- predict(Smod, test[,1:10])
#Here is a sample of how the model predicts given the variables
pred
  [1] High-Risk High-Risk High-Risk High-Risk Passed    High-Risk High-Risk
  [8] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
 [15] High-Risk High-Risk High-Risk Passed    High-Risk High-Risk Passed   
 [22] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
 [29] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
 [36] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
 [43] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
 [50] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
 [57] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
 [64] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
 [71] High-Risk Passed    Passed    Passed    Passed    Passed    Passed   
 [78] Passed    Passed    Passed    Passed    Passed    Passed    Passed   
 [85] Passed    Passed    Passed    Passed    Passed    Passed    Passed   
 [92] Passed    Passed    Passed    Passed    Passed    Passed    Passed   
 [99] Passed    Passed    Passed    Passed    Passed    Passed    Passed   
[106] Passed    Passed    Passed    Passed    Passed    Passed    Passed   
[113] Passed    Passed    Passed    Passed    Passed    Passed    Passed   
[120] Passed    Passed    Passed    Passed    Passed    Passed    Passed   
[127] Passed    Passed    Passed    Passed    Passed    Passed    Passed   
[134] Passed    High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[141] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[148] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[155] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[162] Passed    High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[169] High-Risk High-Risk High-Risk Passed    Passed    Passed    Passed   
[176] Passed    Passed    Passed    Passed    Passed    Passed    Passed   
[183] Passed    Passed    Passed    Passed    Passed    Passed    Passed   
[190] Passed    Passed    Passed    Passed    Passed    Passed   
Levels: Passed High-Risk

Check Model Accuracy

#see what fraction of the model's predictions match the
#actual classification (Check model accuracy)
acc <- sum(pred == test[,11])/nrow(test)
round(acc*100, 2)
[1] 84.62

We see that our Linear SVM gives us an accuracy that can be considered as good forecasting.

Paolo G. Hilado

August 31, 2019