Predicting High Risk Credit Card Customers using Linear Support Vector Machine Model - Paolo G. Hilado

Situationer

This case presents a model for financial institutions to predict high risk credit card customers based on lifestyle variables. It uses the support vector machine model to check whether applicant is a considered high risk credit card customer or not.

Load Data

#load kernlab package
library(kernlab)
#Open file
data <- read.table("credit_card_data.txt")
# Check data frame structure
str(data)
'data.frame':   654 obs. of  11 variables:
 $ V1 : int  1 0 0 1 1 1 1 0 1 1 ...
 $ V2 : num  30.8 58.7 24.5 27.8 20.2 ...
 $ V3 : num  0 4.46 0.5 1.54 5.62 ...
 $ V4 : num  1.25 3.04 1.5 3.75 1.71 ...
 $ V5 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V6 : int  0 0 1 0 1 1 1 1 1 1 ...
 $ V7 : int  1 6 0 5 0 0 0 0 0 0 ...
 $ V8 : int  1 1 1 0 1 0 0 1 1 0 ...
 $ V9 : int  202 43 280 100 120 360 164 80 180 52 ...
 $ V10: int  0 560 824 3 0 0 31285 1349 314 1442 ...
 $ V11: int  1 1 1 1 1 1 1 1 1 1 ...
# Setup categorical variables properly
data[,c(1,5,6,7,8,11)] <- lapply(data[,c(1,5,6,7,8,11)], factor)
str(data)
'data.frame':   654 obs. of  11 variables:
 $ V1 : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 1 2 2 ...
 $ V2 : num  30.8 58.7 24.5 27.8 20.2 ...
 $ V3 : num  0 4.46 0.5 1.54 5.62 ...
 $ V4 : num  1.25 3.04 1.5 3.75 1.71 ...
 $ V5 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ V6 : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 2 2 2 2 ...
 $ V7 : Factor w/ 23 levels "0","1","2","3",..: 2 7 1 6 1 1 1 1 1 1 ...
 $ V8 : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 2 2 1 ...
 $ V9 : int  202 43 280 100 120 360 164 80 180 52 ...
 $ V10: int  0 560 824 3 0 0 31285 1349 314 1442 ...
 $ V11: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Preprocess continuous variables via normalization

library(caret)
nvar <- preProcess(data[,-11], method = c("center", "scale"))
# Factor variables have been skipped out in this process
nvar
Created from 654 samples and 10 variables

Pre-processing:
  - centered (5)
  - ignored (5)
  - scaled (5)

Data Partition; 70% Training and 30% Testing

inTrain <- createDataPartition(data$V11, p=.70, list=0)
set.seed(2); train <- data[inTrain,]
set.seed(3); test <- data[-inTrain,]

Applying Support Vector Machine Linear Scaled

#Apply ksvm model with scale
set.seed(789); Smod <- ksvm(V11~., data = train, type = "C-svc", kernel = "vanilladot", C = 100, scaled = TRUE)
 Setting default kernel parameters  

Checking the Coefficients

#This is the code to show the coefficients. However, knowing that there's a lot to it, I have set
#r command to not show the results as it is very long (13 pages). Feel free to run the code on your
#machine at anytime. 
attributes(Smod)

Perform linear kernel with equation of a*scaled(x) + a0

#The given equation a*scaled(x) is:
scaled_a <- colSums(Smod@xmatrix[[1]] * Smod@coef[[1]])
#Compute for constant which is the negative intercept of b in the model
#Using the same equation, a0 is:
scaled_a0 <- -Smod@b
#Let's check out our EQUATION with sum of scaled_a coefficients equals to 0
scaled_a
         V10          V11           V2           V3           V4          V51 
 0.002379988 -0.002379988 -0.007717582 -0.004641372  0.013255107  2.011036840 
         V61          V71          V72          V73          V74          V75 
 0.010451678  0.027300780  0.014492811  0.022311870  0.032874312  0.021441293 
         V76          V77          V78          V79         V710         V711 
 0.025629810  0.007092351  0.032509436  0.026753560  0.010881315  0.033839039 
        V712         V713         V714         V715         V716         V717 
 0.022215982  0.000000000  0.000000000  0.000000000  0.000000000  0.005494827 
        V719         V720         V723         V740         V767          V81 
 0.000000000 -0.319768354  0.000000000  0.000000000  0.026479287 -0.003445462 
          V9        V10.1 
-0.001186389  0.126561046 
scaled_a0
[1] -1.008793
#Obtain predicted values via the model we have fitted
set.seed(156); pred <- predict(Smod, test[,1:10])
#Here is a sample of how the model predicts given the variables
pred
  [1] 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
 [75] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[186] 0 0 0 0 0 0 0 0 0 0
Levels: 0 1

Check Model Accuracy

#see what fraction of the model's predictions match the
#actual classification (Check model accuracy)
acc <- sum(pred == test[,11])/nrow(test)
round(acc*100, 2)
[1] 87.69

We see that our Linear SVM gives us an accuracy that can be considered as good forecasting. Time to save our model for future use.

saveRDS(Smod, "HiRiskMod.rds")