It is always a good idea to study the packaged algorithm with a simple example. Inspired by my colleague Kodi’s excellent work showing how xgboost handles missing values, I tried a simple 5x2 dataset to show how shrinkage and DART influence the growth of trees in the model.

Data

set.seed(123)
n0 <- 5
X <-  data.frame(x1 = runif(n0), x2 = runif(n0))
Y <-  c(1, 5, 20, 50, 100)
cbind(X, Y)
##       x1      x2   Y
## 1 0.2876 0.04556   1
## 2 0.7883 0.52811   5
## 3 0.4090 0.89242  20
## 4 0.8830 0.55144  50
## 5 0.9405 0.45661 100

Shrinkage

  1. Step size shrinkage was the major tool designed to prevents overfitting (over-specialization).
  • The R document says that the learning rate eta has range [0, 1] but xgboost takes any value of \(eta\ge0\). Here I select eta = 2, then the model can perfectly predict in two steps, the train rmse from iter 2 was 0, only two trees were used.
    Of course, it is a bad idea to use a very large eta in real applications as the tree will not be helpful and the predicted value will be very wrong.
  • The max_depth is the maximum depth of a tree, I set 10 but it won’t be reached
  • By default, there is no other regularization
  • By setting base_score = 0 in xgboost we can add up the values in the leaves of two trees to get every number in Y: 1, 5, 20, 50, 100
# non-zero skip_drop has higher priority than rate_drop or one_drop
param_gbtree <- list(objective = 'reg:linear', nrounds = 3, 
                   eta = 2,
                   max_depth = 10,
                   min_child_weight = 0,
                   booster = 'gbtree'
)

simple.xgb.output <- function(param,...){
  set.seed(1234)
  m = xgboost(data = as.matrix(X), label = Y, params = param,
              nrounds = param$nround, 
              base_score = 0)
  cat('Evaluation log showing testing error:\n')
  print(m$evaluation_log)
  pred <- predict(m, as.matrix(X), ntreelimit = param$nrounds)
  cat('Predicted values of Y: \n')
  print(pred)
  pred2 <- predict(m, as.matrix(X), predcontrib = TRUE)
  cat("SHAP value for X: \n")
  print(pred2)
  p <- xgb.plot.tree(model = m)
  p
}
simple.xgb.output(param_gbtree)
## [1]  train-rmse:22.360680 
## [2]  train-rmse:0.000000 
## [3]  train-rmse:0.000000 
## Evaluation log showing testing error:
##    iter train_rmse
## 1:    1      22.36
## 2:    2       0.00
## 3:    3       0.00
## Predicted values of Y: 
## [1]   1   5  20  50 100
## SHAP value for X: 
##          x1     x2 BIAS
## [1,] -29.33 -4.867 35.2
## [2,] -26.00 -4.200 35.2
## [3,] -24.93  9.733 35.2
## [4,]  16.50 -1.700 35.2
## [5,]  66.50 -1.700 35.2
  • If eta = 1
    Then no perfect prediction could be made and the trees grow in a more conservative manner
# non-zero skip_drop has higher priority than rate_drop or one_drop
param_gbtree <- list(objective = 'reg:linear', nrounds = 3, 
                   eta = 1,
                   max_depth = 10,
                   min_child_weight = 0,
                   booster = 'gbtree'
)
simple.xgb.output(param_gbtree)
## [1]  train-rmse:22.831995 
## [2]  train-rmse:11.415998 
## [3]  train-rmse:5.707999 
## Evaluation log showing testing error:
##    iter train_rmse
## 1:    1     22.832
## 2:    2     11.416
## 3:    3      5.708
## Predicted values of Y: 
## [1]  0.875  4.375 17.500 50.000 87.500
## SHAP value for X: 
##          x1     x2  BIAS
## [1,] -27.18 -3.999 32.05
## [2,] -24.20 -3.478 32.05
## [3,] -24.11  9.564 32.05
## [4,]  20.41 -2.463 32.05
## [5,]  56.98 -1.525 32.05

DART: Dropout - MART

DART (paper on JMLR) adopted dropout method from neural networks to boosted regression rees (i.e.,MART: Multiple Additive Regression Trees). DART aims to further prevent over-specialization. It requires select booster = 'dart' in xgboost and tune several hyper-parameters (Ref: R documents).

skip_drop

  • The skip_drop(default = 0, range [0, 1]) is the probability of skipping dropout. It has a higher priority than other DART parameters. If skip_drop = 1, the dropout procedure would be skipped and dart is the same as gbtree. The setting below gives the same result as the gbtree above (results omitted):
param_gbtree <- list(objective = 'reg:linear', nrounds = 3, 
                   eta = 2,
                   max_depth = 10,
                   booster = 'dart',
                   skip_drop = 1,  # = 1 means always skip, = gbtree
                   rate_drop = 1,  # doesn't matter since drop is always skipped
                   one_drop = 1
)

rate_drop

  • If skip_drop\(\ne0\), rate_drop (default = 0, range [0, 1]) will drop a fraction of the trees before the model update in every iteration.
  • The DART paper JMLR said the dropout makes DART between gbtree and random forest: “If no tree is dropped, DART is the same as MART (gbtree); if all the trees are dropped, DART is no different than random forest.”
  • If rate_drop = 1 then all the trees are dropped, a random forest of trees is built. In our case of a very simple dataset, the ‘random forest’ just repeats the same tree nrounds times:
param_dart1 <- list(objective = 'reg:linear', nrounds = 3, 
                   eta = 2,
                   max_depth = 10,
                   booster = 'dart',
                   skip_drop = 0,  
                   rate_drop = 1,  # doesn't matter since drop is always skipped
                   one_drop = 1
)
simple.xgb.output(param_dart1)
## [1]  train-rmse:22.360680 
## [2]  train-rmse:16.948286 
## [3]  train-rmse:19.388212 
## Evaluation log showing testing error:
##    iter train_rmse
## 1:    1      22.36
## 2:    2      16.95
## 3:    3      19.39
## Predicted values of Y: 
## [1]  0.5833  2.9167 11.6667 58.3333 58.3333
## SHAP value for X: 
##          x1      x2  BIAS
## [1,] -22.94 -2.8389 26.37
## [2,] -21.00 -2.4500 26.37
## [3,] -20.38  5.6778 26.37
## [4,]  32.96 -0.9917 26.37
## [5,]  32.96 -0.9917 26.37

one_drop

  • If one_drop = 1 then at least one tree is always dropped. If I let rate_drop=0, but one_drop = 1, the dropping was still working, and the trees were built in a more conservative manner. Since the first tree will be dropped, the second tree is the same as the first one
param_dart2 <- list(objective = 'reg:linear', nrounds = 5, 
                   eta = 2,
                   max_depth = 10,
                   booster = 'dart',
                   skip_drop = 0,  
                   rate_drop = 0,  # doesn't matter since drop is always skipped
                   one_drop = 1
)
simple.xgb.output(param_dart2)
## [1]  train-rmse:22.360680 
## [2]  train-rmse:16.948286 
## [3]  train-rmse:15.066435 
## [4]  train-rmse:15.237643 
## [5]  train-rmse:14.225216 
## Evaluation log showing testing error:
##    iter train_rmse
## 1:    1      22.36
## 2:    2      16.95
## 3:    3      15.07
## 4:    4      15.24
## 5:    5      14.23
## Predicted values of Y: 
## [1]  0.7037  5.8848 14.7737 53.2510 68.8066
## SHAP value for X: 
##          x1     x2  BIAS
## [1,] -24.09 -3.893 28.68
## [2,] -19.28 -3.522 28.68
## [3,] -18.88  4.974 28.68
## [4,]  22.36  2.211 28.68
## [5,]  41.70 -1.578 28.68
  • Similar conservative effect if I set skip_drop to be non-zero:
param_dart3 <- list(objective = 'reg:linear', nrounds = 5, 
                   eta = 2,
                   max_depth = 10,
                   booster = 'dart',
                   skip_drop = 0,  
                   rate_drop = 0.5,  # doesn't matter since drop is always skipped
                   one_drop = 0
)
simple.xgb.output(param_dart3)
## [1]  train-rmse:22.360680 
## [2]  train-rmse:0.000000 
## [3]  train-rmse:7.453556 
## [4]  train-rmse:9.316948 
## [5]  train-rmse:9.823564 
## Evaluation log showing testing error:
##    iter train_rmse
## 1:    1     22.361
## 2:    2      0.000
## 3:    3      7.454
## 4:    4      9.317
## 5:    5      9.824
## Predicted values of Y: 
## [1]  0.8333  4.1667 16.6667 63.8889 83.3333
## SHAP value for X: 
##          x1     x2  BIAS
## [1,] -28.89 -4.056 33.78
## [2,] -26.11 -3.500 33.78
## [3,] -25.22  8.111 33.78
## [4,]  31.53 -1.417 33.78
## [5,]  50.97 -1.417 33.78

Letting one_drop = 1 also gives result more conservative, and smaller train-rmse if use same rounds.