Predicting House Prices using Gradient Boost Machine Learning - Paolo G. Hilado
Predicting House Prices using Gradient Boost Machine Learning - Paolo G. Hilado
- Situationer
- Load Data
- Check for Multicollinearity
- Preprocess continuous variables via normalization
- Data Partition; 70% Training and 30% Testing
- Setting up the cross validation with 5-fold cross validation
- Creating the Gradient Boost Machine Learning Model
- Checking out Influential Variables
- Sample of how the model predicts given the variables
- Checking Model Accuracy using Mean Absolute Percentage Error
Situationer
This case presents how machine learning models like gradient boost can be used in real estate via the prediction of house price given variables such as lot area, presence of garage, no. of stories, no. of bathrooms and etc.
Load Data
library(xlsx)
library(caret)
#Open file
data <- read.xlsx("Housing Data.xlsx", sheetIndex = "Sheet1")
#Check the Structure of the Data
str(data)
'data.frame': 546 obs. of 12 variables:
$ price : num 42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
$ lotsize : num 5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
$ bedrooms: num 3 2 3 3 2 3 3 3 3 3 ...
$ bathrms : num 1 1 1 1 1 1 2 1 1 2 ...
$ stories : num 2 1 1 2 1 1 2 3 1 4 ...
$ driveway: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ recroom : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 2 2 ...
$ fullbase: Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 2 1 2 1 ...
$ gashw : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ airco : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 2 ...
$ garagepl: num 1 0 0 0 0 0 2 0 0 1 ...
$ prefarea: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
#Check the first and last 6 rows
head(data); tail(data)
price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
1 42000 5850 3 1 2 yes no yes no
2 38500 4000 2 1 1 yes no no no
3 49500 3060 3 1 1 yes no no no
4 60500 6650 3 1 2 yes yes no no
5 61000 6360 2 1 1 yes no no no
6 66000 4160 3 1 1 yes yes yes no
airco garagepl prefarea
1 no 1 no
2 no 0 no
3 no 0 no
4 no 0 no
5 no 0 no
6 yes 0 no
price lotsize bedrooms bathrms stories driveway recroom fullbase
541 85000 6525 3 2 4 yes no no
542 91500 4800 3 2 4 yes yes no
543 94000 6000 3 2 4 yes no no
544 103000 6000 3 2 4 yes yes no
545 105000 6000 3 2 2 yes yes no
546 105000 6000 3 1 2 yes no no
gashw airco garagepl prefarea
541 no no 1 no
542 no yes 0 no
543 no yes 0 no
544 no yes 1 no
545 no yes 1 no
546 no yes 1 no
Check for Multicollinearity
library(car)
mod1 <- lm(price~., data = data)
vif(mod1)
lotsize bedrooms bathrms stories driveway recroom fullbase gashw
1.321632 1.365633 1.282494 1.478584 1.163091 1.210501 1.316543 1.038246
airco garagepl prefarea
1.201397 1.200839 1.147639
## Given the variance inflation factor coefficients, we have no worries with multicollinearity
Preprocess continuous variables via normalization
nvar <- preProcess(data[,-1], method = c("center", "scale"))
data[,-1] <- predict(nvar, data[,-1])
# Factor variables have been skipped out in this process
nvar
Created from 546 samples and 11 variables
Pre-processing:
- centered (5)
- ignored (6)
- scaled (5)
Data Partition; 70% Training and 30% Testing
inTrain <- createDataPartition(y=data$price, p=.70, list=0)
set.seed(2); train <- data[inTrain,]
set.seed(3); test <- data[-inTrain,]
Setting up the cross validation with 5-fold cross validation
Ctrl <- trainControl(method = "cv", number = 5)
Creating the Gradient Boost Machine Learning Model
modGBM <- train(price~., data = train, method = "gbm", trControl = Ctrl, verbose=0)
Checking out Influential Variables
summary(modGBM$finalModel)
var rel.inf
lotsize lotsize 54.326086
bathrms bathrms 16.649759
aircoyes aircoyes 9.671063
stories stories 6.134566
prefareayes prefareayes 2.887274
garagepl garagepl 2.752367
recroomyes recroomyes 1.708425
fullbaseyes fullbaseyes 1.666379
drivewayyes drivewayyes 1.546283
gashwyes gashwyes 1.461250
bedrooms bedrooms 1.196548
Sample of how the model predicts given the variables
predGBM <- predict(modGBM, test)
predGBM
[1] 60805.56 64731.47 34731.97 35356.82 45755.08 42220.44 65347.27
[8] 43677.72 46526.51 64916.61 47700.05 64923.79 53217.18 70054.73
[15] 35308.83 36100.53 62826.29 73295.09 68450.53 61232.40 74066.37
[22] 46102.45 76860.34 76305.49 65708.21 50627.31 51944.65 85756.93
[29] 54581.47 49549.65 50646.34 70878.44 42902.88 62505.03 62237.89
[36] 84370.42 50589.78 67742.32 52990.03 42331.28 58361.99 40191.52
[43] 50656.50 53377.02 38330.76 46726.91 58771.75 51865.83 61901.42
[50] 37091.88 51434.14 33256.05 54450.64 42689.66 102532.41 40211.58
[57] 54859.85 48470.98 78387.16 43677.72 54919.86 46589.11 54103.71
[64] 66712.01 51396.61 52493.31 60741.53 54760.02 60868.80 55810.21
[71] 50619.01 42689.66 57760.09 78131.05 50837.96 60661.46 61232.40
[78] 45110.17 58343.44 51963.87 85821.21 56045.71 68769.43 73458.27
[85] 90660.61 72116.87 84539.71 75553.07 106527.82 109901.34 69371.36
[92] 114135.63 48742.84 78178.36 57424.26 56045.71 70408.57 78409.42
[99] 74018.72 69994.17 93270.75 105046.21 97005.64 99743.58 99159.37
[106] 76934.89 109901.34 91702.66 109884.20 95230.04 77750.61 84751.39
[113] 100103.04 89766.74 83961.47 105473.04 78589.02 105046.21 65090.82
[120] 90639.49 81589.80 121115.26 58083.66 59248.58 65751.85 54371.79
[127] 69417.03 64619.06 77964.84 77561.83 67589.03 81385.85 116106.12
[134] 67549.53 52011.11 100793.16 52929.44 72824.00 71708.65 55453.71
[141] 64217.62 65933.91 65162.06 76439.30 92121.36 91315.51 68736.35
[148] 57492.86 64313.21 101420.38 102499.30 87475.46 109271.05 105389.04
[155] 97281.07 103846.20 92172.51 69425.37 87325.10 95541.38 97475.13
[162] 100956.47
Checking Model Accuracy using Mean Absolute Percentage Error
library(ie2misc)
error <- mape(predGBM, test$price)
Accuracy <- 100 - error
round(Accuracy, 2)
[1] 81.69