It is always a good idea to study the packaged algorithm with a simple example. Inspired by my colleague Kodi’s excellent work showing how `xgboost`

handles missing values, I tried a simple 5x2 dataset to show how shrinkage and DART influence the growth of trees in the model.

# Data

```
set.seed(123)
n0 <- 5
X <- data.frame(x1 = runif(n0), x2 = runif(n0))
Y <- c(1, 5, 20, 50, 100)
cbind(X, Y)
```

```
## x1 x2 Y
## 1 0.2876 0.04556 1
## 2 0.7883 0.52811 5
## 3 0.4090 0.89242 20
## 4 0.8830 0.55144 50
## 5 0.9405 0.45661 100
```

# Shrinkage

- Step size shrinkage was the major tool designed to prevents overfitting (over-specialization).

- The R document says that the learning rate
`eta`

has range [0, 1] but`xgboost`

takes any value of \(eta\ge0\). Here I select eta = 2, then the model can perfectly predict in two steps, the train rmse from iter 2 was 0, only two trees were used.

Of course, it is a bad idea to use a very large*eta*in real applications as the tree will not be helpful and the predicted value will be very wrong.

- The
`max_depth`

is the maximum depth of a tree, I set 10 but it won’t be reached

- By default, there is no other regularization
- By setting
`base_score = 0`

in`xgboost`

we can add up the values in the leaves of two trees to get every number in Y: 1, 5, 20, 50, 100

```
# non-zero skip_drop has higher priority than rate_drop or one_drop
param_gbtree <- list(objective = 'reg:linear', nrounds = 3,
eta = 2,
max_depth = 10,
min_child_weight = 0,
booster = 'gbtree'
)
simple.xgb.output <- function(param,...){
set.seed(1234)
m = xgboost(data = as.matrix(X), label = Y, params = param,
nrounds = param$nround,
base_score = 0)
cat('Evaluation log showing testing error:\n')
print(m$evaluation_log)
pred <- predict(m, as.matrix(X), ntreelimit = param$nrounds)
cat('Predicted values of Y: \n')
print(pred)
pred2 <- predict(m, as.matrix(X), predcontrib = TRUE)
cat("SHAP value for X: \n")
print(pred2)
p <- xgb.plot.tree(model = m)
p
}
simple.xgb.output(param_gbtree)
```

```
## [1] train-rmse:22.360680
## [2] train-rmse:0.000000
## [3] train-rmse:0.000000
## Evaluation log showing testing error:
## iter train_rmse
## 1: 1 22.36
## 2: 2 0.00
## 3: 3 0.00
## Predicted values of Y:
## [1] 1 5 20 50 100
## SHAP value for X:
## x1 x2 BIAS
## [1,] -29.33 -4.867 35.2
## [2,] -26.00 -4.200 35.2
## [3,] -24.93 9.733 35.2
## [4,] 16.50 -1.700 35.2
## [5,] 66.50 -1.700 35.2
```

- If
`eta`

= 1

Then no perfect prediction could be made and the trees grow in a more conservative manner

```
# non-zero skip_drop has higher priority than rate_drop or one_drop
param_gbtree <- list(objective = 'reg:linear', nrounds = 3,
eta = 1,
max_depth = 10,
min_child_weight = 0,
booster = 'gbtree'
)
simple.xgb.output(param_gbtree)
```

```
## [1] train-rmse:22.831995
## [2] train-rmse:11.415998
## [3] train-rmse:5.707999
## Evaluation log showing testing error:
## iter train_rmse
## 1: 1 22.832
## 2: 2 11.416
## 3: 3 5.708
## Predicted values of Y:
## [1] 0.875 4.375 17.500 50.000 87.500
## SHAP value for X:
## x1 x2 BIAS
## [1,] -27.18 -3.999 32.05
## [2,] -24.20 -3.478 32.05
## [3,] -24.11 9.564 32.05
## [4,] 20.41 -2.463 32.05
## [5,] 56.98 -1.525 32.05
```

# DART: Dropout - MART

DART (paper on JMLR) adopted dropout method from neural networks to boosted regression rees (i.e.,MART: Multiple Additive Regression Trees). DART aims to further prevent over-specialization. It requires select `booster = 'dart'`

in `xgboost`

and tune several hyper-parameters (Ref: R documents).

`skip_drop`

- The
`skip_drop`

(default = 0, range [0, 1]) is the probability of skipping dropout. It has a higher priority than other DART parameters. If`skip_drop`

= 1, the dropout procedure would be skipped and`dart`

is the same as`gbtree`

. The setting below gives the same result as the`gbtree`

above (results omitted):

```
param_gbtree <- list(objective = 'reg:linear', nrounds = 3,
eta = 2,
max_depth = 10,
booster = 'dart',
skip_drop = 1, # = 1 means always skip, = gbtree
rate_drop = 1, # doesn't matter since drop is always skipped
one_drop = 1
)
```

`rate_drop`

- If
`skip_drop`

\(\ne0\),`rate_drop (default = 0, range [0, 1])`

will drop a fraction of the trees before the model update in every iteration. - The DART paper JMLR said the dropout makes DART between gbtree and random forest: “If no tree is dropped, DART is the same as MART (
`gbtree`

); if all the trees are dropped, DART is no different than random forest.” - If
`rate_drop`

= 1 then all the trees are dropped, a random forest of trees is built. In our case of a very simple dataset, the ‘random forest’ just repeats the same tree`nrounds`

times:

```
param_dart1 <- list(objective = 'reg:linear', nrounds = 3,
eta = 2,
max_depth = 10,
booster = 'dart',
skip_drop = 0,
rate_drop = 1, # doesn't matter since drop is always skipped
one_drop = 1
)
simple.xgb.output(param_dart1)
```

```
## [1] train-rmse:22.360680
## [2] train-rmse:16.948286
## [3] train-rmse:19.388212
## Evaluation log showing testing error:
## iter train_rmse
## 1: 1 22.36
## 2: 2 16.95
## 3: 3 19.39
## Predicted values of Y:
## [1] 0.5833 2.9167 11.6667 58.3333 58.3333
## SHAP value for X:
## x1 x2 BIAS
## [1,] -22.94 -2.8389 26.37
## [2,] -21.00 -2.4500 26.37
## [3,] -20.38 5.6778 26.37
## [4,] 32.96 -0.9917 26.37
## [5,] 32.96 -0.9917 26.37
```

`one_drop`

- If
`one_drop`

= 1 then at least one tree is always dropped. If I let`rate_drop`

=0, but`one_drop`

= 1, the dropping was still working, and the trees were built in a more conservative manner. Since the first tree will be dropped, the second tree is the same as the first one

```
param_dart2 <- list(objective = 'reg:linear', nrounds = 5,
eta = 2,
max_depth = 10,
booster = 'dart',
skip_drop = 0,
rate_drop = 0, # doesn't matter since drop is always skipped
one_drop = 1
)
simple.xgb.output(param_dart2)
```

```
## [1] train-rmse:22.360680
## [2] train-rmse:16.948286
## [3] train-rmse:15.066435
## [4] train-rmse:15.237643
## [5] train-rmse:14.225216
## Evaluation log showing testing error:
## iter train_rmse
## 1: 1 22.36
## 2: 2 16.95
## 3: 3 15.07
## 4: 4 15.24
## 5: 5 14.23
## Predicted values of Y:
## [1] 0.7037 5.8848 14.7737 53.2510 68.8066
## SHAP value for X:
## x1 x2 BIAS
## [1,] -24.09 -3.893 28.68
## [2,] -19.28 -3.522 28.68
## [3,] -18.88 4.974 28.68
## [4,] 22.36 2.211 28.68
## [5,] 41.70 -1.578 28.68
```

- Similar conservative effect if I set
`skip_drop`

to be non-zero:

```
param_dart3 <- list(objective = 'reg:linear', nrounds = 5,
eta = 2,
max_depth = 10,
booster = 'dart',
skip_drop = 0,
rate_drop = 0.5, # doesn't matter since drop is always skipped
one_drop = 0
)
simple.xgb.output(param_dart3)
```

```
## [1] train-rmse:22.360680
## [2] train-rmse:0.000000
## [3] train-rmse:7.453556
## [4] train-rmse:9.316948
## [5] train-rmse:9.823564
## Evaluation log showing testing error:
## iter train_rmse
## 1: 1 22.361
## 2: 2 0.000
## 3: 3 7.454
## 4: 4 9.317
## 5: 5 9.824
## Predicted values of Y:
## [1] 0.8333 4.1667 16.6667 63.8889 83.3333
## SHAP value for X:
## x1 x2 BIAS
## [1,] -28.89 -4.056 33.78
## [2,] -26.11 -3.500 33.78
## [3,] -25.22 8.111 33.78
## [4,] 31.53 -1.417 33.78
## [5,] 50.97 -1.417 33.78
```

Letting `one_drop`

= 1 also gives result more conservative, and smaller train-rmse if use same rounds.