Predicting House Prices using Gradient Boost Machine Learning - Paolo G. Hilado

Situationer

This case presents how machine learning models like gradient boost can be used in real estate via the prediction of house price given variables such as lot area, presence of garage, no. of stories, no. of bathrooms and etc.

Load Data

library(xlsx)
library(caret)
#Open file
data <- read.xlsx("Housing Data.xlsx", sheetIndex = "Sheet1")
#Check the Structure of the Data
str(data)
'data.frame':   546 obs. of  12 variables:
 $ price   : num  42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
 $ lotsize : num  5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
 $ bedrooms: num  3 2 3 3 2 3 3 3 3 3 ...
 $ bathrms : num  1 1 1 1 1 1 2 1 1 2 ...
 $ stories : num  2 1 1 2 1 1 2 3 1 4 ...
 $ driveway: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ recroom : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 2 2 ...
 $ fullbase: Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 2 1 2 1 ...
 $ gashw   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ airco   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 2 ...
 $ garagepl: num  1 0 0 0 0 0 2 0 0 1 ...
 $ prefarea: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
#Check the first and last 6 rows
head(data); tail(data)
  price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
1 42000    5850        3       1       2      yes      no      yes    no
2 38500    4000        2       1       1      yes      no       no    no
3 49500    3060        3       1       1      yes      no       no    no
4 60500    6650        3       1       2      yes     yes       no    no
5 61000    6360        2       1       1      yes      no       no    no
6 66000    4160        3       1       1      yes     yes      yes    no
  airco garagepl prefarea
1    no        1       no
2    no        0       no
3    no        0       no
4    no        0       no
5    no        0       no
6   yes        0       no
     price lotsize bedrooms bathrms stories driveway recroom fullbase
541  85000    6525        3       2       4      yes      no       no
542  91500    4800        3       2       4      yes     yes       no
543  94000    6000        3       2       4      yes      no       no
544 103000    6000        3       2       4      yes     yes       no
545 105000    6000        3       2       2      yes     yes       no
546 105000    6000        3       1       2      yes      no       no
    gashw airco garagepl prefarea
541    no    no        1       no
542    no   yes        0       no
543    no   yes        0       no
544    no   yes        1       no
545    no   yes        1       no
546    no   yes        1       no

Check for Multicollinearity

library(car)
mod1 <- lm(price~., data = data)
vif(mod1)
 lotsize bedrooms  bathrms  stories driveway  recroom fullbase    gashw 
1.321632 1.365633 1.282494 1.478584 1.163091 1.210501 1.316543 1.038246 
   airco garagepl prefarea 
1.201397 1.200839 1.147639 
## Given the variance inflation factor coefficients, we have no worries with multicollinearity

Preprocess continuous variables via normalization

nvar <- preProcess(data[,-1], method = c("center", "scale"))
data[,-1] <- predict(nvar, data[,-1])
# Factor variables have been skipped out in this process
nvar
Created from 546 samples and 11 variables

Pre-processing:
  - centered (5)
  - ignored (6)
  - scaled (5)

Data Partition; 70% Training and 30% Testing

inTrain <- createDataPartition(y=data$price, p=.70, list=0)
set.seed(2); train <- data[inTrain,]
set.seed(3); test <- data[-inTrain,]

Setting up the cross validation with 5-fold cross validation

Ctrl <- trainControl(method = "cv", number = 5)

Creating the Gradient Boost Machine Learning Model

modGBM <- train(price~., data = train, method = "gbm", trControl = Ctrl, verbose=0)

Checking out Influential Variables

summary(modGBM$finalModel)

                    var   rel.inf
lotsize         lotsize 54.326086
bathrms         bathrms 16.649759
aircoyes       aircoyes  9.671063
stories         stories  6.134566
prefareayes prefareayes  2.887274
garagepl       garagepl  2.752367
recroomyes   recroomyes  1.708425
fullbaseyes fullbaseyes  1.666379
drivewayyes drivewayyes  1.546283
gashwyes       gashwyes  1.461250
bedrooms       bedrooms  1.196548

Sample of how the model predicts given the variables

predGBM <- predict(modGBM, test)
predGBM
  [1]  60805.56  64731.47  34731.97  35356.82  45755.08  42220.44  65347.27
  [8]  43677.72  46526.51  64916.61  47700.05  64923.79  53217.18  70054.73
 [15]  35308.83  36100.53  62826.29  73295.09  68450.53  61232.40  74066.37
 [22]  46102.45  76860.34  76305.49  65708.21  50627.31  51944.65  85756.93
 [29]  54581.47  49549.65  50646.34  70878.44  42902.88  62505.03  62237.89
 [36]  84370.42  50589.78  67742.32  52990.03  42331.28  58361.99  40191.52
 [43]  50656.50  53377.02  38330.76  46726.91  58771.75  51865.83  61901.42
 [50]  37091.88  51434.14  33256.05  54450.64  42689.66 102532.41  40211.58
 [57]  54859.85  48470.98  78387.16  43677.72  54919.86  46589.11  54103.71
 [64]  66712.01  51396.61  52493.31  60741.53  54760.02  60868.80  55810.21
 [71]  50619.01  42689.66  57760.09  78131.05  50837.96  60661.46  61232.40
 [78]  45110.17  58343.44  51963.87  85821.21  56045.71  68769.43  73458.27
 [85]  90660.61  72116.87  84539.71  75553.07 106527.82 109901.34  69371.36
 [92] 114135.63  48742.84  78178.36  57424.26  56045.71  70408.57  78409.42
 [99]  74018.72  69994.17  93270.75 105046.21  97005.64  99743.58  99159.37
[106]  76934.89 109901.34  91702.66 109884.20  95230.04  77750.61  84751.39
[113] 100103.04  89766.74  83961.47 105473.04  78589.02 105046.21  65090.82
[120]  90639.49  81589.80 121115.26  58083.66  59248.58  65751.85  54371.79
[127]  69417.03  64619.06  77964.84  77561.83  67589.03  81385.85 116106.12
[134]  67549.53  52011.11 100793.16  52929.44  72824.00  71708.65  55453.71
[141]  64217.62  65933.91  65162.06  76439.30  92121.36  91315.51  68736.35
[148]  57492.86  64313.21 101420.38 102499.30  87475.46 109271.05 105389.04
[155]  97281.07 103846.20  92172.51  69425.37  87325.10  95541.38  97475.13
[162] 100956.47

Checking Model Accuracy using Mean Absolute Percentage Error

library(ie2misc)
error <- mape(predGBM, test$price)
Accuracy <- 100 - error
round(Accuracy, 2)
[1] 81.69

Given the accuracy for the Gradient Boost Machine Learning Model, we can say that it can be used for good forecasting.

Paolo G. Hilado

2019-09-01