Overview
I recently had the great pleasure to meet with Professor Allan Just and he introduced me to eXtreme Gradient Boosting (XGBoost). I have extended the earlier work on my old blog by comparing the results across XGBoost, Gradient Boosting (GBM), Random Forest, Lasso, and Best Subset. The ensemble method is powerful as it combines the predictions from multiple machine learning algorithms together to make more accurate predictions than an individual model. Random Forest is among the most famous ones and it is easy to use. Random Forest is based on bagging (bootstrap aggregation) which averages the results over many decision trees from sub-samples. It further limits its search to only 1/3 of the features (in regression) to fit each tree, weakening the correlations among decision trees. It is easy to get a lower testing error compared to linear models.
Boosting takes slower steps, making predictors sequentially instead of independently. It repetitively leverages the patterns in residuals, strengthens the model with weak predictions, and make it better. By combining the advantages from both random forest and gradient boosting, XGBoost gave the a prediction error ten times lower than boosting or random forest in my case.
(Correction!) 18.10.3. When I further test this dataset I realized it was a mistake. There is indeed no magic in machine learning. When I got a testing error ten-time smaller than other methods, I should question first if this is a mistake. In the correct result XGBoost still gave the lowest testing rmse but was close to other two methods.
Link to the earlier version: Model Selection using Lasso and Best Subset
About the Data
In sub-Saharan Africa, the region where deprivations in terms of living conditions are the most severe, slum dwellers represent an estimated 56% of the regions’ urban population (UN Habitat, 2016). Measuring informal settlements in a reliable way is a critical challenge for the United Nations to monitor the Sustainable Development Goals (SDGs) towards its 2030 Agenda for Sustainable Development. This data was collected by Slum Dwellers International (SDI), which was nominated for the Nobel Peace Prize in 2014. By extracting valuable information from the data, We will help the UN to create better SDG indicators by making full use of available data.
In this exercise, we only model Share_Temporary: Share of Temporary Structure in Slums as the dependent variable. The independent variables are monitoring indicators like water, sanitation, housing conditions and overcrowding in African slum settlements. Dataset dimension is 973 x 153.
1. Extreme Gradient Boosting
- Random search: randomized parameters and update the record with best ones.
- It turns out to be a very interesting method to scan for hyperparameters. It will take a while for 100 iters.
- The package
xgboost
is really fast.
library(xgboost)
# Randomize and bound
best_param <- list()
best_seednumber <- 1234
best_rmse <- Inf
best_rmse_index <- 0
set.seed(1234)
# In reality, might need 100 or 200 iters
for (iter in 1:10) {
param <- list(objective = "reg:linear", # For regression
eval_metric = "rmse", # rmse is used for regression
max_depth = sample(6:10, 1),
eta = runif(1, .01, .1), # Learning rate, default: 0.3
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(5:10, 1), # These two are important
max_delta_step = sample(5:10, 1) # Can help to focus error
# into a small range.
)
cv.nround <- 1000
cv.nfold <- 5 # 5-fold cross-validation
seed.number <- sample.int(10000, 1) # set seed for the cv
set.seed(seed.number)
mdcv <- xgb.cv(data = dtrain, params = param,
nfold = cv.nfold, nrounds = cv.nround,
verbose = F, early_stopping_rounds = 8, maximize = FALSE)
min_rmse_index <- mdcv$best_iteration
min_rmse <- mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
if (min_rmse < best_rmse) {
best_rmse <- min_rmse
best_rmse_index <- min_rmse_index
best_seednumber <- seed.number
best_param <- param
}
}
- The best tuning parameters
## objective eval_metric max_depth eta subsample colsample_bytree
## 1 reg:linear rmse 9 0.09822 0.64 0.6853
## min_child_weight max_delta_step best_rmse_index best_rmse best_seednumber
## 1 6 8 56 0.2102 3660
- MSE
## [22:09:43] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
## [1] 0.04237
- Feature Importance
importance_matrix <- xgb.importance(colnames(X_train), model = xg_mod)
# Use `xgb.plot.importance`, which create a _barplot_ or use `xgb.ggplot.importance`
library(Ckmeans.1d.dp) # for xgb.ggplot.importance
xgb.ggplot.importance(importance_matrix, top_n = 15, measure = "Gain")
- Plot only 2 trees as an example (use
trees
= 1)
library("DiagrammeR")
xgb.plot.tree(model = xg_mod, trees = 1, feature_names = colnames(X_train))
- Plot all trees on one tree and plot it: A huge plot
xgb.plot.multi.trees(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train))
2. Gradient boosting
- Use library
gbm
- Tuning Method: use
train
function fromcaret
to scan a grid of parameters.
library(gbm) # for Gradient boosting
library(caret) # scan the parameter grid using `train` function
# time_now <- Sys.time()
para_grid <- expand.grid(n.trees = (20*c(50:100)),
shrinkage = c(0.1, 0.05, 0.01),
interaction.depth = c(1,3,5),
n.minobsinnode = 10)
trainControl <- trainControl(method = "cv", number = 10)
set.seed(123)
gbm_caret <- train(Share_Temporary ~ ., mydata[train_idx,],
distribution = "gaussian", method = "gbm",
trControl = trainControl, verbose = FALSE,
tuneGrid = para_grid, metric = "RMSE", bag.fraction = 0.75)
# Sys.time() - time_now
## Time difference of 2.283 mins
- The tuning parameters that give the lowest MSE in training set CV.
## n.trees interaction.depth shrinkage n.minobsinnode
## 36 1700 1 0.01 10
- MSE
## [1] 0.04838
3. Random Forest
- Use library
randomForest
.
library(randomForest)
rf.fit <- randomForest(Share_Temporary ~ ., data = mydata2, subset = train_idx)
# Test on test data: mydata[-train_idx,]
yhat_bag <- predict(rf.fit, newdata = mydata2[-train_idx,])
- MSE on the testing dataset:
## [1] 0.04359
- Feature Importance (showing top 15)
- The variables high on rank show the relative importance of features in the tree model
- For example,
Monthly Water Cost
,Resettled Housing
, andPopulation Estimate
are the most influential features.
varImpPlot(rf.fit, n.var=15)
4. Lasso
Use library
glmnet
.
Lasso is a shrinkage approach for feature selection. The tuning parameter lambda is the magnitudes of penalty. A increasing penalty shrinks coefficients towards zero. The advantage of a linear model is that the result is highly interpretable.We use cross-validation to choose the lambda and corresponding features
The dotted line on the left is lambda.min, the lambda that generates the lowest MSE in the testing dataset. The dotted line on the right is lambda.1se, its corresponding MSE is not the lowest but acceptable, and it has even fewer features in the model. We use
lambda.1se
in our case.
# Use cross-validation to select the lambda
cv_lasso = cv.glmnet(X_train, Y_train, alpha=1) # Lasso regression
plot(cv_lasso)
# lambda selected by 1se rule
(best_lam <- cv_lasso$lambda.1se)
## [1] 0.03845
- MSE
# Check prediction error in the testing dataset
lasso_pred <- predict(lasso_mod, s = best_lam, newx = X_test)
# The Mean squared error (MSE)
(MSE_Lasso <- mean((lasso_pred - Y_test)^2))
## [1] 0.06751
The regression model for the selected lambda (lasso). We extract the coefficients from the selected model and run a linear regression.
The model has used 17 variables.
The most useful predictors selected by lasso include
Water_MonthlyCost
,Water_Sources: shared_taps
,Resettled Housing
andEviction Threats
. For these variables, higher values or binary variables being Yes are associated with fewer temporary structures in slums.Relative importance of coefficients by showing standardized regression coefficients in decreasing order of their absolute values.
coef_table2 <- data.frame(reg_lasso_summary$coefficients, stb = c(0, lm.beta(reg_lasso_mod)))
coef_table2[order(abs(coef_table2$stb), decreasing = T),]
## Estimate Std..Error t.value Pr...t..
## B14__resettled -1.500e-01 3.232e-02 -4.641 4.404e-06
## DD1_Location_Problemscanal 1.896e-01 3.278e-02 5.785 1.261e-08
## FF11_Water_MonthlyCost -3.354e-06 6.940e-07 -4.832 1.788e-06
## FF1_8_Water_Sourceswells -1.146e-01 2.294e-02 -4.995 8.084e-07
## Eviction_Threats 1.053e-01 2.428e-02 4.337 1.736e-05
## B14__declared_legal_protected 8.688e-02 2.967e-02 2.928 3.561e-03
## DD1_Location_Problemsslope 8.235e-02 2.366e-02 3.481 5.422e-04
## EE2B_Current_Eviction_Seriousnessmedium -2.149e-01 6.549e-02 -3.282 1.101e-03
## GG1_Sewer_Line 7.801e-02 2.500e-02 3.120 1.908e-03
## FF1_8_Water_Sourceswater_tankers -1.203e-01 3.949e-02 -3.046 2.434e-03
## GG7_Managerprivate -5.983e-02 2.417e-02 -2.475 1.363e-02
## DD1_Location_Problemsflood_prone_area 4.860e-02 2.254e-02 2.156 3.152e-02
## FF1_8_Water_Sourcescommunity_taps 4.666e-02 2.929e-02 1.593 1.118e-01
## (Intercept) 3.949e-01 3.957e-02 9.979 1.479e-21
## stb
## B14__resettled -0.19503
## DD1_Location_Problemscanal 0.18221
## FF11_Water_MonthlyCost -0.17854
## FF1_8_Water_Sourceswells -0.15557
## Eviction_Threats 0.14234
## B14__declared_legal_protected 0.11390
## DD1_Location_Problemsslope 0.10926
## EE2B_Current_Eviction_Seriousnessmedium -0.10030
## GG1_Sewer_Line 0.09478
## FF1_8_Water_Sourceswater_tankers -0.09144
## GG7_Managerprivate -0.07741
## DD1_Location_Problemsflood_prone_area 0.06608
## FF1_8_Water_Sourcescommunity_taps 0.05158
## (Intercept) 0.00000
5. Best Subset
Use library
leaps
.
Best subset is a subset selection approach for feature selection. Not like stepwise or forward selection, best subset check all the possible feature combinations in theory. Since I select from 49 predictors but set the maximum size of subsets to be 25, there are C(49,25) + C(49,24) + …+ C(49,0) = 345 trillion models to check. As I discussed in my post, it won’t be possible to scan all of them. Both R and SAS use the branch and bound algorithm to speed up the calculation.If without cross-validation we can use the traditional way to choose model: Adjusted R-squared, Cp(AIC), or BIC.
The turning parameter is to decide how many predictors to use. The selected number of feature also happens to be 17.
Cross-validation selects more features than BIC but fewer than Adj Rsq or Cp(AIC).The regression model selected and Standardized parameter estimates showing relative feature importance in decreasing order.
## b.Estimate b.Std..Error b.t.value
## B14__resettled -2.275e-01 4.111e-02 -5.5342
## DD1_Location_Problemscanal 2.076e-01 4.235e-02 4.9029
## FF11_Water_MonthlyCost -3.553e-06 9.722e-07 -3.6545
## Eviction_Threats 1.227e-01 4.541e-02 2.7018
## B14__declared_legal_protected 1.203e-01 4.067e-02 2.9583
## FF1_8_Water_Sourceswater_tankers -1.820e-01 5.533e-02 -3.2893
## FF1_8_Water_Sourcesshared_taps -1.117e-01 4.894e-02 -2.2831
## DD1_Location_Problemsflood_prone_area 6.877e-02 3.054e-02 2.2516
## GG7_10_Toilet_Typesindividual_toilets -6.214e-02 4.268e-02 -1.4561
## FF1_8_Water_Sourcessprings -5.563e-02 4.252e-02 -1.3085
## JJ1_Electricity_Availableyes 5.644e-02 4.699e-02 1.2012
## DD1_Location_Problemsgarbage_dump -3.611e-02 3.347e-02 -1.0791
## EE2A_Current_Eviction_Threat 2.456e-02 4.528e-02 0.5425
## FF1_8_Water_Sourcesrivers -4.108e-02 5.889e-02 -0.6976
## FF1_8_Water_Sourcesdams -2.846e-02 7.258e-02 -0.3921
## FF12_Water_CollectionTime30_minutes 1.275e-02 4.478e-02 0.2847
## DD1_Location_Problemsroad_side -1.899e-03 3.160e-02 -0.0601
## (Intercept) 3.932e-01 7.437e-02 5.2871
## b.Pr...t.. stb
## B14__resettled 6.856e-08 -0.294884
## DD1_Location_Problemscanal 1.554e-06 0.205337
## FF11_Water_MonthlyCost 3.044e-04 -0.185969
## Eviction_Threats 7.293e-03 0.164439
## B14__declared_legal_protected 3.341e-03 0.156849
## FF1_8_Water_Sourceswater_tankers 1.125e-03 -0.135517
## FF1_8_Water_Sourcesshared_taps 2.313e-02 -0.096201
## DD1_Location_Problemsflood_prone_area 2.508e-02 0.093098
## GG7_10_Toilet_Typesindividual_toilets 1.464e-01 -0.058293
## FF1_8_Water_Sourcessprings 1.917e-01 -0.056320
## JJ1_Electricity_Availableyes 2.306e-01 0.050303
## DD1_Location_Problemsgarbage_dump 2.814e-01 -0.044732
## EE2A_Current_Eviction_Threat 5.879e-01 0.031196
## FF1_8_Water_Sourcesrivers 4.860e-01 -0.027726
## FF1_8_Water_Sourcesdams 6.953e-01 -0.015868
## FF12_Water_CollectionTime30_minutes 7.760e-01 0.011364
## DD1_Location_Problemsroad_side 9.521e-01 -0.002456
## (Intercept) 2.408e-07 0.000000
- MSE
## [1] 0.06979
Compare MSEs
The advantage of XGboost is highly distinguishing.
- XGBoost shows advantage in rmse but not too distinguishing
- XGBoost’s real advantages include its speed and ability to handle missing values
## MSE_xgb MSE_boost MSE_Lasso MSE_rForest MSE_best.subset
## 1 0.04237 0.04838 0.06751 0.04359 0.06979