Predicting High Risk Credit Card Customers using Linear Support Vector Machine Model - Paolo G. Hilado
Predicting High Risk Credit Card Customers using Linear Support Vector Machine Model - Paolo G. Hilado
Situationer
This case presents a model for financial institutions to predict high risk credit card customers based on lifestyle variables. It uses the support vector machine model to check whether applicant is a considered high risk credit card customer or not.
Load Data
#load kernlab package
library(kernlab)
#Open file
data <- read.table("credit_card_data.txt")
Check Data Structure
#Check the structure of the data
str(data)
'data.frame': 654 obs. of 11 variables:
$ V1 : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 1 2 2 ...
$ V2 : num 30.8 58.7 24.5 27.8 20.2 ...
$ V3 : num 0 4.46 0.5 1.54 5.62 ...
$ V4 : num 1.25 3.04 1.5 3.75 1.71 ...
$ V5 : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
$ V6 : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 2 2 2 2 ...
$ V7 : int 1 6 0 5 0 0 0 0 0 0 ...
$ V8 : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 1 2 2 1 ...
$ V9 : int 202 43 280 100 120 360 164 80 180 52 ...
$ V10: int 0 560 824 3 0 0 31285 1349 314 1442 ...
$ V11: Factor w/ 2 levels "Passed","High-Risk": 2 2 2 2 2 2 2 2 2 2 ...
#Check the first and last 6 rows
head(data); tail(data)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 Yes 30.83 0.000 1.25 Yes No 1 Yes 202 0 High-Risk
2 No 58.67 4.460 3.04 Yes No 6 Yes 43 560 High-Risk
3 No 24.50 0.500 1.50 Yes Yes 0 Yes 280 824 High-Risk
4 Yes 27.83 1.540 3.75 Yes No 5 No 100 3 High-Risk
5 Yes 20.17 5.625 1.71 Yes Yes 0 Yes 120 0 High-Risk
6 Yes 32.08 4.000 2.50 Yes Yes 0 No 360 0 High-Risk
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
649 Yes 40.58 3.290 3.50 No Yes 0 No 400 0 Passed
650 Yes 21.08 10.085 1.25 No Yes 0 Yes 260 0 Passed
651 No 22.67 0.750 2.00 No No 2 No 200 394 Passed
652 No 25.25 13.500 2.00 No No 1 No 200 1 Passed
653 Yes 17.92 0.205 0.04 No Yes 0 Yes 280 750 Passed
654 Yes 35.00 3.375 8.29 No Yes 0 No 0 0 Passed
# Put appropriate labels for response variable
Preprocess continuous variables via normalization
nvar <- preProcess(data[,-11], method = c("center", "scale"))
data[,-11] <- predict(nvar, data[,-11])
# Factor variables have been skipped out in this process
nvar
Created from 654 samples and 10 variables
Pre-processing:
- centered (6)
- ignored (4)
- scaled (6)
Data Partition; 70% Training and 30% Testing
library(caret)
inTrain <- createDataPartition(data$V11, p=.70, list=0)
set.seed(2); train <- data[inTrain,]
set.seed(3); test <- data[-inTrain,]
Applying Support Vector Machine Linear Scaled
#Apply ksvm model with scale
set.seed(789); Smod <- ksvm(V11~., data = train, type = "C-svc", kernel = "vanilladot", C = 100, scaled = TRUE)
Setting default kernel parameters
Checking the Coefficients
#This is the code to show the coefficients. However, knowing that there's a lot to it, I have set
#r command to not show the results as it is very long (13 pages). Feel free to run the code on your
#machine at anytime.
attributes(Smod)
Perform linear kernel with equation of a*scaled(x) + a0
#Given the equation a*scaled(x) is:
scaled_a <- colSums(Smod@xmatrix[[1]] * Smod@coef[[1]])
#Compute for constant a which is the negative intercept of b in the model
#Using the same equation, a0 is:
scaled_a0 <- -Smod@b
#Let's check out our EQUATION with sum of scaled_a coefficients equals to 0
scaled_a
V1No V1Yes V2 V3 V4
1.111791e-04 -1.111791e-04 2.359807e-05 -9.056175e-06 -3.390583e-05
V5Yes V6Yes V7 V8Yes V9
2.000095e+00 4.860085e-05 3.953579e-05 1.081370e-04 -1.769631e-05
V10
3.338148e-04
scaled_a0
[1] -1.000053
#Obtain predicted values via the model we have fitted
set.seed(156); pred <- predict(Smod, test[,1:10])
#Here is a sample of how the model predicts given the variables
pred
[1] High-Risk High-Risk High-Risk High-Risk Passed High-Risk High-Risk
[8] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[15] High-Risk High-Risk High-Risk Passed High-Risk High-Risk Passed
[22] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[29] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[36] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[43] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[50] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[57] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[64] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[71] High-Risk Passed Passed Passed Passed Passed Passed
[78] Passed Passed Passed Passed Passed Passed Passed
[85] Passed Passed Passed Passed Passed Passed Passed
[92] Passed Passed Passed Passed Passed Passed Passed
[99] Passed Passed Passed Passed Passed Passed Passed
[106] Passed Passed Passed Passed Passed Passed Passed
[113] Passed Passed Passed Passed Passed Passed Passed
[120] Passed Passed Passed Passed Passed Passed Passed
[127] Passed Passed Passed Passed Passed Passed Passed
[134] Passed High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[141] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[148] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[155] High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[162] Passed High-Risk High-Risk High-Risk High-Risk High-Risk High-Risk
[169] High-Risk High-Risk High-Risk Passed Passed Passed Passed
[176] Passed Passed Passed Passed Passed Passed Passed
[183] Passed Passed Passed Passed Passed Passed Passed
[190] Passed Passed Passed Passed Passed Passed
Levels: Passed High-Risk
Check Model Accuracy
#see what fraction of the model's predictions match the
#actual classification (Check model accuracy)
acc <- sum(pred == test[,11])/nrow(test)
round(acc*100, 2)
[1] 84.62