SlideShare une entreprise Scribd logo
1  sur  128
Télécharger pour lire hors ligne
8/30/15, 10:09 PMXGBoost
Page 1 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoostXGBoost
eXtremeGradientBoosting
Tong He
8/30/15, 10:09 PMXGBoost
Page 2 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Overview
Introduction
Basic Walkthrough
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
·
·
·
·
·
·
·
2/128
8/30/15, 10:09 PMXGBoost
Page 3 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Introduction
3/128
8/30/15, 10:09 PMXGBoost
Page 4 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Introduction
Nowadays we have plenty of machine learning models. Those most well-knowns are
Linear/Logistic Regression
k-Nearest Neighbours
Support Vector Machines
Tree-based Model
Neural Networks
·
·
·
·
Decision Tree
Random Forest
Gradient Boosting Machine
-
-
-
·
4/128
8/30/15, 10:09 PMXGBoost
Page 5 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Introduction
XGBoost is short for eXtreme Gradient Boosting. It is
An open-sourced tool
A variant of the gradient boosting machine
The winning model for several kaggle competitions
·
Computation in C++
R/python/Julia interface provided
-
-
·
Tree-based model-
·
5/128
8/30/15, 10:09 PMXGBoost
Page 6 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Introduction
XGBoost is currently host on github.
The primary author of the model and the c++ implementation is Tianqi Chen.
The author for the R-package is Tong He.
·
·
6/128
8/30/15, 10:09 PMXGBoost
Page 7 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Introduction
XGBoost is widely used for kaggle competitions. The reason to choose XGBoost includes
Easy to use
Efficiency
Accuracy
Feasibility
·
Easy to install.
Highly developed R/python interface for users.
-
-
·
Automatic parallel computation on a single machine.
Can be run on a cluster.
-
-
·
Good result for most data sets.-
·
Customized objective and evaluation
Tunable parameters
-
-
7/128
8/30/15, 10:09 PMXGBoost
Page 8 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Basic Walkthrough
We introduce the R package for XGBoost. To install, please run
This command downloads the package from github and compile it automatically on your
machine. Therefore we need RTools installed on Windows.
devtools::install_github('dmlc/xgboost',subdir='R-package')
8/128
8/30/15, 10:09 PMXGBoost
Page 9 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Basic Walkthrough
XGBoost provides a data set to demonstrate its usages.
This data set includes the information for some kinds of mushrooms. The features are binary,
indicate whether the mushroom has this characteristic. The target variable is whether they are
poisonous.
require(xgboost)
## Loading required package: xgboost
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train = agaricus.train
test = agaricus.test
9/128
8/30/15, 10:09 PMXGBoost
Page 10 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Basic Walkthrough
Let's investigate the data first.
We can see that the data is a dgCMatrix class object. This is a sparse matrix class from the
package Matrix. Sparse matrix is more memory efficient for some specific data.
str(train$data)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
## ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
## ..@ Dim : int [1:2] 6513 126
## ..@ Dimnames:List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=fla
## ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
## ..@ factors : list()
10/128
8/30/15, 10:09 PMXGBoost
Page 11 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Basic Walkthrough
To use XGBoost to classify poisonous mushrooms, the minimum information we need to
provide is:
1. Input features
2. Target variable
3. Objective
4. Number of iteration
XGBoost allows dense and sparse matrix as the input.·
A numeric vector. Use integers starting from 0 for classification, or real values for
regression
·
For regression use 'reg:linear'
For binary classification use 'binary:logistic'
·
·
The number of trees added to the model·
11/128
8/30/15, 10:09 PMXGBoost
Page 12 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Basic Walkthrough
To run XGBoost, we can use the following command:
The output is the classification error on the training data set.
bst = xgboost(data = train$data, label = train$label,
nround = 2, objective = "binary:logistic")
## [0] train-error:0.000614
## [1] train-error:0.001228
12/128
8/30/15, 10:09 PMXGBoost
Page 13 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Basic Walkthrough
Sometimes we might want to measure the classification by 'Area Under the Curve':
bst = xgboost(data = train$data, label = train$label, nround = 2,
objective = "binary:logistic", eval_metric = "auc")
## [0] train-auc:0.999238
## [1] train-auc:0.999238
13/128
8/30/15, 10:09 PMXGBoost
Page 14 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Basic Walkthrough
To predict, you can simply write
pred = predict(bst, test$data)
head(pred)
## [1] 0.2582498 0.7433221 0.2582498 0.2582498 0.2576509 0.2750908
14/128
8/30/15, 10:09 PMXGBoost
Page 15 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Basic Walkthrough
Cross validation is an important method to measure the model's predictive power, as well as
the degree of overfitting. XGBoost provides a convenient function to do cross validation in a line
of code.
Notice the difference of the arguments between xgb.cv and xgboost is the additional nfold
parameter. To perform cross validation on a certain set of parameters, we just need to copy
them to the xgb.cv function and add the number of folds.
cv.res = xgb.cv(data = train$data, nfold = 5, label = train$label, nround = 2,
objective = "binary:logistic", eval_metric = "auc")
## [0] train-auc:0.998668+0.000354 test-auc:0.998497+0.001380
## [1] train-auc:0.999187+0.000785 test-auc:0.998700+0.001536
15/128
8/30/15, 10:09 PMXGBoost
Page 16 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Basic Walkthrough
xgb.cv returns a data.table object containing the cross validation results. This is helpful for
choosing the correct number of iterations.
cv.res
## train.auc.mean train.auc.std test.auc.mean test.auc.std
## 1: 0.998668 0.000354 0.998497 0.001380
## 2: 0.999187 0.000785 0.998700 0.001536
16/128
8/30/15, 10:09 PMXGBoost
Page 17 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Real World Experiment
17/128
8/30/15, 10:09 PMXGBoost
Page 18 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
The debut of XGBoost was in the higgs boson competition.
Tianqi introduced the tool along with a benchmark code which achieved the top 10% at the
beginning of the competition.
To the end of the competition, it was already the mostly used tool in that competition.
18/128
8/30/15, 10:09 PMXGBoost
Page 19 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
XGBoost offers the script on github.
To run the script, prepare a data directory and download the competition data into this
directory.
19/128
8/30/15, 10:09 PMXGBoost
Page 20 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
Firstly we prepare the environment
require(xgboost)
require(methods)
testsize = 550000
20/128
8/30/15, 10:09 PMXGBoost
Page 21 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
Then we can read in the data
dtrain = read.csv("data/training.csv", header=TRUE)
dtrain[33] = dtrain[33] == "s"
label = as.numeric(dtrain[[33]])
data = as.matrix(dtrain[2:31])
weight = as.numeric(dtrain[[32]]) * testsize / length(label)
sumwpos <- sum(weight * (label==1.0))
sumwneg <- sum(weight * (label==0.0))
21/128
8/30/15, 10:09 PMXGBoost
Page 22 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
The data contains missing values and they are marked as -999. We can construct an
xgb.DMatrix object containing the information of weight and missing.
xgmat = xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
22/128
8/30/15, 10:09 PMXGBoost
Page 23 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
Next step is to set the basic parameters
param = list("objective" = "binary:logitraw",
"scale_pos_weight" = sumwneg / sumwpos,
"bst:eta" = 0.1,
"bst:max_depth" = 6,
"eval_metric" = "auc",
"eval_metric" = "ams@0.15",
"silent" = 1,
"nthread" = 16)
23/128
8/30/15, 10:09 PMXGBoost
Page 24 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
We then start the training step
bst = xgboost(params = param, data = xgmat, nround = 120)
24/128
8/30/15, 10:09 PMXGBoost
Page 25 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
Then we read in the test data
dtest = read.csv("data/test.csv", header=TRUE)
data = as.matrix(dtest[2:31])
xgmat = xgb.DMatrix(data, missing = -999.0)
25/128
8/30/15, 10:09 PMXGBoost
Page 26 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
We now can make prediction on the test data set.
ypred = predict(bst, xgmat)
26/128
8/30/15, 10:09 PMXGBoost
Page 27 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
Finally we output the prediction according to the required format.
Please submit the result to see your performance :)
rorder = rank(ypred, ties.method="first")
threshold = 0.15
ntop = length(rorder) - as.integer(threshold*length(rorder))
plabel = ifelse(rorder > ntop, "s", "b")
outdata = list("EventId" = idx,
"RankOrder" = rorder,
"Class" = plabel)
write.csv(outdata, file = "submission.csv", quote=FALSE, row.names=FALSE)
27/128
8/30/15, 10:09 PMXGBoost
Page 28 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
Besides the good performace, the efficiency is also a highlight of XGBoost.
The following plot shows the running time result on the Higgs boson data set.
28/128
8/30/15, 10:09 PMXGBoost
Page 29 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Higgs Boson Competition
After some feature engineering and parameter tuning, one can achieve around 25th with a
single model on the leaderboard. This is an article written by a former-physist introducing his
solution with a single XGboost model:
https://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-
competition-what-a-single-model-can-do/
On our post-competition attempts, we achieved 11th on the leaderboard with a single XGBoost
model.
29/128
8/30/15, 10:09 PMXGBoost
Page 30 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Model Specification
30/128
8/30/15, 10:09 PMXGBoost
Page 31 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Training Objective
To understand other parameters, one need to have a basic understanding of the model behind.
Suppose we have trees, the model is
where each is the prediction from a decision tree. The model is a collection of decision trees.
K
∑
k=1
K
fk
fk
31/128
8/30/15, 10:09 PMXGBoost
Page 32 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Training Objective
Having all the decision trees, we make prediction by
where is the feature vector for the -th data point.
Similarly, the prediction at the -th step can be defined as
= ( )yiˆ
∑
k=1
K
fk xi
xi i
t
= ( )yiˆ (t)
∑
k=1
t
fk xi
32/128
8/30/15, 10:09 PMXGBoost
Page 33 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Training Objective
To train the model, we need to optimize a loss function.
Typically, we use
Rooted Mean Squared Error for regression
LogLoss for binary classification
mlogloss for multi-classification
·
- L = ( −1
N
∑N
i=1 yi yiˆ )2
·
- L = − ( log( ) + (1 − ) log(1 − ))1
N
∑N
i=1 yi pi yi pi
·
- L = − log( )1
N
∑N
i=1 ∑M
j=1 yi,j pi,j
33/128
8/30/15, 10:09 PMXGBoost
Page 34 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Training Objective
Regularization is another important part of the model. A good regularization term controls the
complexity of the model which prevents overfitting.
Define
where is the number of leaves, and is the score on the -th leaf.
Ω = γT + λ
1
2 ∑
j=1
T
w2
j
T w2
j j
34/128
8/30/15, 10:09 PMXGBoost
Page 35 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Training Objective
Put loss function and regularization together, we have the objective of the model:
where loss function controls the predictive power, and regularization controls the simplicity.
Obj = L + Ω
35/128
8/30/15, 10:09 PMXGBoost
Page 36 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Training Objective
In XGBoost, we use gradient descent to optimize the objective.
Given an objective to optimize, gradient descent is an iterative technique which
calculate
at each iteration. Then we improve along the direction of the gradient to minimize the
objective.
Obj(y, )yˆ
Obj(y, )∂yˆ yˆ
yˆ
36/128
8/30/15, 10:09 PMXGBoost
Page 37 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Training Objective
Recall the definition of objective . For a iterative algorithm we can re-define the
objective function as
To optimize it by gradient descent, we need to calculate the gradient. The performance can also
be improved by considering both the first and the second order gradient.
Obj = L + Ω
Ob = L( , ) + Ω( ) = L( , + ( )) + Ω( )j(t)
∑
i=1
N
yi yˆ(t)
i ∑
i=1
t
fi ∑
i=1
N
yi yiˆ (t−1)
ft xi
∑
i=1
t
fi
Ob∂yiˆ (t) j(t)
Ob∂2
yiˆ (t) j(t)
37/128
8/30/15, 10:09 PMXGBoost
Page 38 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Training Objective
Since we don't have derivative for every objective function, we calculate the second order taylor
approximation of it
where
Ob ≃ [L( , ) + ( ) + ( )] + Ω( )j(t)
∑
i=1
N
yi yˆ(t−1)
gift xi
1
2
hif 2
t xi
∑
i=1
t
fi
· = l( , )gi ∂yˆ(t−1) yi yˆ(t−1)
· = l( , )hi ∂2
yˆ(t−1) yi yˆ(t−1)
38/128
8/30/15, 10:09 PMXGBoost
Page 39 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Training Objective
Remove the constant terms, we get
This is the objective at the -th step. Our goal is to find a to optimize it.
Ob = [ ( ) + ( )] + Ω( )j(t)
∑
i=1
n
gift xi
1
2
hif 2
t xi ft
t ft
39/128
8/30/15, 10:09 PMXGBoost
Page 40 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
The tree structures in XGBoost leads to the core problem:
how can we find a tree that improves the prediction along the gradient?
40/128
8/30/15, 10:09 PMXGBoost
Page 41 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
Every decision tree looks like this
Each data point flows to one of the leaves following the direction on each node.
41/128
8/30/15, 10:09 PMXGBoost
Page 42 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
The core concepts are:
Internal Nodes
Leaves
·
Each internal node split the flow of data points by one of the features.
The condition on the edge specifies what data can flow through.
-
-
·
Data points reach to a leaf will be assigned a weight.
The weight is the prediction.
-
-
42/128
8/30/15, 10:09 PMXGBoost
Page 43 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
Two key questions for building a decision tree are
1. How to find a good structure?
2. How to assign prediction score?
We want to solve these two problems with the idea of gradient descent.
43/128
8/30/15, 10:09 PMXGBoost
Page 44 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
Let us assume that we already have the solution to question 1.
We can mathematically define a tree as
where is a "directing" function which assign every data point to the -th leaf.
This definition describes the prediction process on a tree as
(x) =ft wq(x)
q(x) q(x)
Assign the data point to a leaf by
Assign the corresponding score on the -th leaf to the data point.
· x q
· wq(x) q(x)
44/128
8/30/15, 10:09 PMXGBoost
Page 45 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
Define the index set
This set contains the indices of data points that are assigned to the -th leaf.
= {i|q( ) = j}Ij xi
j
45/128
8/30/15, 10:09 PMXGBoost
Page 46 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
Then we rewrite the objective as
Since all the data points on the same leaf share the same prediction, this form sums the
prediction by leaves.
Ob = [ ( ) + ( )] + γT + λj(t)
∑
i=1
n
gift xi
1
2
hif 2
t xi
1
2 ∑
j=1
T
w2
j
= [( ) + ( + λ) ] + γT
∑
j=1
T
∑
i∈Ij
gi wj
1
2 ∑
i∈Ij
hi w2
j
46/128
8/30/15, 10:09 PMXGBoost
Page 47 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
It is a quadratic problem of , so it is easy to find the best to optimize .
The corresponding value of is
wj wj Obj
= −w∗
j
∑i∈Ij
gi
+ λ∑i∈Ij
hi
Obj
Ob = − + γTj(t) 1
2 ∑
j=1
T (∑i∈Ij
gi)2
+ λ∑i∈Ij
hi
47/128
8/30/15, 10:09 PMXGBoost
Page 48 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
The leaf score
relates to
= −wj
∑i∈Ij
gi
+ λ∑i∈Ij
hi
The first and second order of the loss function and
The regularization parameter
· g h
· λ
48/128
8/30/15, 10:09 PMXGBoost
Page 49 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
Now we come back to the first question: How to find a good structure?
We can further split it into two sub-questions:
1. How to choose the feature to split?
2. When to stop the split?
49/128
8/30/15, 10:09 PMXGBoost
Page 50 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
In each split, we want to greedily find the best splitting point that can optimize the objective.
For each feature
1. Sort the numbers
2. Scan the best splitting point.
3. Choose the best feature.
50/128
8/30/15, 10:09 PMXGBoost
Page 51 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
Now we give a definition to "the best split" by the objective.
Everytime we do a split, we are changing a leaf into a internal node.
51/128
8/30/15, 10:09 PMXGBoost
Page 52 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
Let
Recall the best value of objective on the -th leaf is
be the set of indices of data points assigned to this node
and be the sets of indices of data points assigned to two new leaves.
· I
· IL IR
j
Ob = − + γj(t) 1
2
(∑i∈Ij
gi)2
+ λ∑i∈Ij
hi
52/128
8/30/15, 10:09 PMXGBoost
Page 53 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
The gain of the split is
gain =
[
+ −
]
− γ
1
2
(∑i∈IL
gi)2
+ λ∑i∈IL
hi
(∑i∈IR
gi)2
+ λ∑i∈IR
hi
(∑i∈I gi)2
+ λ∑i∈I hi
53/128
8/30/15, 10:09 PMXGBoost
Page 54 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
To build a tree, we find the best splitting point recursively until we reach to the maximum
depth.
Then we prune out the nodes with a negative gain in a bottom-up order.
54/128
8/30/15, 10:09 PMXGBoost
Page 55 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
XGBoost can handle missing values in the data.
For each node, we guide all the data points with a missing value
Finally every node has a "default direction" for missing values.
to the left subnode, and calculate the maximum gain
to the right subnode, and calculate the maximum gain
Choose the direction with a larger gain
·
·
·
55/128
8/30/15, 10:09 PMXGBoost
Page 56 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree Building Algorithm
To sum up, the outline of the algorithm is
Iterate for nround times·
Grow the tree to the maximun depth
Prune the tree to delete nodes with negative gain
-
Find the best splitting point
Assign weight to the two new leaves
-
-
-
56/128
8/30/15, 10:09 PMXGBoost
Page 57 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Parameters
57/128
8/30/15, 10:09 PMXGBoost
Page 58 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Parameter Introduction
XGBoost has plenty of parameters. We can group them into
1. General parameters
2. Booster parameters
3. Task parameters
Number of threads·
Stepsize
Regularization
·
·
Objective
Evaluation metric
·
·
58/128
8/30/15, 10:09 PMXGBoost
Page 59 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Parameter Introduction
After the introduction of the model, we can understand the parameters provided in XGBoost.
To check the parameter list, one can look into
The documentation of xgb.train.
The documentation in the repository.
·
·
59/128
8/30/15, 10:09 PMXGBoost
Page 60 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Parameter Introduction
General parameters:
nthread
booster
·
Number of parallel threads.-
·
gbtree: tree-based model.
gblinear: linear function.
-
-
60/128
8/30/15, 10:09 PMXGBoost
Page 61 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Parameter Introduction
Parameter for Tree Booster
eta
gamma
·
Step size shrinkage used in update to prevents overfitting.
Range in [0,1], default 0.3
-
-
·
Minimum loss reduction required to make a split.
Range [0, ], default 0
-
- ∞
61/128
8/30/15, 10:09 PMXGBoost
Page 62 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Parameter Introduction
Parameter for Tree Booster
max_depth
min_child_weight
max_delta_step
·
Maximum depth of a tree.
Range [1, ], default 6
-
- ∞
·
Minimum sum of instance weight needed in a child.
Range [0, ], default 1
-
- ∞
·
Maximum delta step we allow each tree's weight estimation to be.
Range [0, ], default 0
-
- ∞
62/128
8/30/15, 10:09 PMXGBoost
Page 63 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Parameter Introduction
Parameter for Tree Booster
subsample
colsample_bytree
·
Subsample ratio of the training instance.
Range (0, 1], default 1
-
-
·
Subsample ratio of columns when constructing each tree.
Range (0, 1], default 1
-
-
63/128
8/30/15, 10:09 PMXGBoost
Page 64 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Parameter Introduction
Parameter for Linear Booster
lambda
alpha
lambda_bias
·
L2 regularization term on weights
default 0
-
-
·
L1 regularization term on weights
default 0
-
-
·
L2 regularization term on bias
default 0
-
-
64/128
8/30/15, 10:09 PMXGBoost
Page 65 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Parameter Introduction
Objectives·
"reg:linear": linear regression, default option.
"binary:logistic": logistic regression for binary classification, output probability
"multi:softmax": multiclass classification using the softmax objective, need to specify
num_class
User specified objective
-
-
-
-
65/128
8/30/15, 10:09 PMXGBoost
Page 66 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Parameter Introduction
Evaluation Metric·
"rmse"
"logloss"
"error"
"auc"
"merror"
"mlogloss"
User specified evaluation metric
-
-
-
-
-
-
-
66/128
8/30/15, 10:09 PMXGBoost
Page 67 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Guide on Parameter Tuning
It is nearly impossible to give a set of universal optimal parameters, or a global algorithm
achieving it.
The key points of parameter tuning are
Control Overfitting
Deal with Imbalanced data
Trust the cross validation
·
·
·
67/128
8/30/15, 10:09 PMXGBoost
Page 68 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Guide on Parameter Tuning
The "Bias-Variance Tradeoff", or the "Accuracy-Simplicity Tradeoff" is the main idea for
controlling overfitting.
For the booster specific parameters, we can group them as
Controlling the model complexity
Robust to noise
·
max_depth, min_child_weight and gamma-
·
subsample, colsample_bytree-
68/128
8/30/15, 10:09 PMXGBoost
Page 69 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Guide on Parameter Tuning
Sometimes the data is imbalanced among classes.
Only care about the ranking order
Care about predicting the right probability
·
Balance the positive and negative weights, by scale_pos_weight
Use "auc" as the evaluation metric
-
-
·
Cannot re-balance the dataset
Set parameter max_delta_step to a finite number (say 1) will help convergence
-
-
69/128
8/30/15, 10:09 PMXGBoost
Page 70 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Guide on Parameter Tuning
To select ideal parameters, use the result from xgb.cv.
Trust the score for the test
Use early.stop.round to detect continuously being worse on test set.
If overfitting observed, reduce stepsize eta and increase nround at the same time.
·
·
·
70/128
8/30/15, 10:09 PMXGBoost
Page 71 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Advanced Features
71/128
8/30/15, 10:09 PMXGBoost
Page 72 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Advanced Features
There are plenty of highlights in XGBoost:
Customized objective and evaluation metric
Prediction from cross validation
Continue training on existing model
Calculate and plot the variable importance
·
·
·
·
72/128
8/30/15, 10:09 PMXGBoost
Page 73 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
According to the algorithm, we can define our own loss function, as long as we can calculate the
first and second order gradient of the loss function.
Define and . We can optimize the loss function if we can calculate
these two values.
grad = l∂yt−1 hess = l∂2
yt−1
73/128
8/30/15, 10:09 PMXGBoost
Page 74 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
We can rewrite logloss for the -th data point as
The is calculated by applying logistic transformation on our prediction .
Then the logloss is
i
L = log( ) + (1 − ) log(1 − )yi pi yi pi
pi yˆi
L = log + (1 − ) logyi
1
1 + e−yˆi
yi
e−yˆi
1 + e−yˆi
74/128
8/30/15, 10:09 PMXGBoost
Page 75 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
We can see that
Next we translate them into the code.
· grad = − = −1
1+e−yˆi
yi pi yi
· hess = = (1 − )1+e−yˆi
(1+e−yˆi )2
pi pi
75/128
8/30/15, 10:09 PMXGBoost
Page 76 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
The complete code:
logregobj = function(preds, dtrain) {
labels = getinfo(dtrain, "label")
preds = 1/(1 + exp(-preds))
grad = preds - labels
hess = preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
76/128
8/30/15, 10:09 PMXGBoost
Page 77 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
The complete code:
logregobj = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
preds = 1/(1 + exp(-preds))
grad = preds - labels
hess = preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
77/128
8/30/15, 10:09 PMXGBoost
Page 78 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
The complete code:
logregobj = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# apply logistic transformation to the output
preds = 1/(1 + exp(-preds))
grad = preds - labels
hess = preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
78/128
8/30/15, 10:09 PMXGBoost
Page 79 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
The complete code:
logregobj = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# apply logistic transformation to the output
preds = 1/(1 + exp(-preds))
# Calculate the 1st gradient
grad = preds - labels
hess = preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
79/128
8/30/15, 10:09 PMXGBoost
Page 80 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
The complete code:
logregobj = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# apply logistic transformation to the output
preds = 1/(1 + exp(-preds))
# Calculate the 1st gradient
grad = preds - labels
# Calculate the 2nd gradient
hess = preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
80/128
8/30/15, 10:09 PMXGBoost
Page 81 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
The complete code:
logregobj = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# apply logistic transformation to the output
preds = 1/(1 + exp(-preds))
# Calculate the 1st gradient
grad = preds - labels
# Calculate the 2nd gradient
hess = preds * (1 - preds)
# Return the result
return(list(grad = grad, hess = hess))
}
81/128
8/30/15, 10:09 PMXGBoost
Page 82 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
We can also customize the evaluation metric.
evalerror = function(preds, dtrain) {
labels = getinfo(dtrain, "label")
err = as.numeric(sum(labels != (preds > 0)))/length(labels)
return(list(metric = "error", value = err))
}
82/128
8/30/15, 10:09 PMXGBoost
Page 83 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
We can also customize the evaluation metric.
evalerror = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
err = as.numeric(sum(labels != (preds > 0)))/length(labels)
return(list(metric = "error", value = err))
}
83/128
8/30/15, 10:09 PMXGBoost
Page 84 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
We can also customize the evaluation metric.
evalerror = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# Calculate the error
err = as.numeric(sum(labels != (preds > 0)))/length(labels)
return(list(metric = "error", value = err))
}
84/128
8/30/15, 10:09 PMXGBoost
Page 85 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
We can also customize the evaluation metric.
evalerror = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# Calculate the error
err = as.numeric(sum(labels != (preds > 0)))/length(labels)
# Return the name of this metric and the value
return(list(metric = "error", value = err))
}
85/128
8/30/15, 10:09 PMXGBoost
Page 86 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Customization
To utilize the customized objective and evaluation, we simply pass them to the arguments:
param = list(max.depth=2,eta=1,nthread = 2, silent=1,
objective=logregobj, eval_metric=evalerror)
bst = xgboost(params = param, data = train$data, label = train$label, nround = 2)
## [0] train-error:0.0465223399355136
## [1] train-error:0.0222631659757408
86/128
8/30/15, 10:09 PMXGBoost
Page 87 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Prediction in Cross Validation
"Stacking" is an ensemble learning technique which takes the prediction from several models. It
is widely used in many scenarios.
One of the main concern is avoid overfitting. The common way is use the prediction value from
cross validation.
XGBoost provides a convenient argument to calculate the prediction during the cross validation.
87/128
8/30/15, 10:09 PMXGBoost
Page 88 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Prediction in Cross Validation
res = xgb.cv(params = param, data = train$data, label = train$label, nround = 2,
nfold=5, prediction = TRUE)
## [0] train-error:0.046522+0.001347 test-error:0.046522+0.005387
## [1] train-error:0.022263+0.000637 test-error:0.022263+0.002545
str(res)
## List of 2
## $ dt :Classes 'data.table' and 'data.frame': 2 obs. of 4 variables:
## ..$ train.error.mean: num [1:2] 0.0465 0.0223
## ..$ train.error.std : num [1:2] 0.001347 0.000637
## ..$ test.error.mean : num [1:2] 0.0465 0.0223
## ..$ test.error.std : num [1:2] 0.00539 0.00254
## ..- attr(*, ".internal.selfref")=<externalptr>
## $ pred: num [1:6513] 2.58 -1.07 -1.03 2.59 -3.03 ...
88/128
8/30/15, 10:09 PMXGBoost
Page 89 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
xgb.DMatrix
XGBoost has its own class of input data xgb.DMatrix. One can convert the usual data set into
it by
It is the data structure used by XGBoost algorithm. XGBoost preprocess the input data and
label into an xgb.DMatrix object before feed it to the training algorithm.
If one need to repeat training process on the same big data set, it is good to use the
xgb.DMatrix object to save preprocessing time.
dtrain = xgb.DMatrix(data = train$data, label = train$label)
89/128
8/30/15, 10:09 PMXGBoost
Page 90 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
xgb.DMatrix
An xgb.DMatrix object contains
1. Preprocessed training data
2. Several features
Missing values
data weight
·
·
90/128
8/30/15, 10:09 PMXGBoost
Page 91 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Continue Training
Train the model for 5000 rounds is sometimes useful, but we are also taking the risk of
overfitting.
A better strategy is to train the model with fewer rounds and repeat that for many times. This
enable us to observe the outcome after each step.
91/128
8/30/15, 10:09 PMXGBoost
Page 92 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Continue Training
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0] train-error:0.0465223399355136
ptrain = predict(bst, dtrain, outputmargin = TRUE)
setinfo(dtrain, "base_margin", ptrain)
## [1] TRUE
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0] train-error:0.0222631659757408
92/128
8/30/15, 10:09 PMXGBoost
Page 93 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Continue Training
# Train with only one round
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0] train-error:0.0222631659757408
ptrain = predict(bst, dtrain, outputmargin = TRUE)
setinfo(dtrain, "base_margin", ptrain)
## [1] TRUE
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0] train-error:0.00706279748195916
93/128
8/30/15, 10:09 PMXGBoost
Page 94 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Continue Training
# Train with only one round
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0] train-error:0.00706279748195916
# margin means the baseline of the prediction
ptrain = predict(bst, dtrain, outputmargin = TRUE)
setinfo(dtrain, "base_margin", ptrain)
## [1] TRUE
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0] train-error:0.0152003684937817
94/128
8/30/15, 10:09 PMXGBoost
Page 95 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Continue Training
# Train with only one round
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0] train-error:0.0152003684937817
# margin means the baseline of the prediction
ptrain = predict(bst, dtrain, outputmargin = TRUE)
# Set the margin information to the xgb.DMatrix object
setinfo(dtrain, "base_margin", ptrain)
## [1] TRUE
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0] train-error:0.00706279748195916
95/128
8/30/15, 10:09 PMXGBoost
Page 96 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Continue Training
# Train with only one round
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0] train-error:0.00706279748195916
# margin means the baseline of the prediction
ptrain = predict(bst, dtrain, outputmargin = TRUE)
# Set the margin information to the xgb.DMatrix object
setinfo(dtrain, "base_margin", ptrain)
## [1] TRUE
# Train based on the previous result
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0] train-error:0.00122831260555811
96/128
8/30/15, 10:09 PMXGBoost
Page 97 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Importance and Tree plotting
The result of XGBoost contains many trees. We can count the number of appearance of each
variable in all the trees, and use this number as the importance score.
bst = xgboost(data = train$data, label = train$label, max.depth = 2, verbose = FALSE,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
xgb.importance(train$dataDimnames[[2]], model = bst)
## Feature Gain Cover Frequence
## 1: 28 0.67615484 0.4978746 0.4
## 2: 55 0.17135352 0.1920543 0.2
## 3: 59 0.12317241 0.1638750 0.2
## 4: 108 0.02931922 0.1461960 0.2
97/128
8/30/15, 10:09 PMXGBoost
Page 98 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Importance and Tree plotting
We can also plot the trees in the model by xgb.plot.tree.
xgb.plot.tree(agaricus.train$data@Dimnames[[2]], model = bst)
98/128
8/30/15, 10:09 PMXGBoost
Page 99 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Tree plotting
99/128
8/30/15, 10:09 PMXGBoost
Page 100 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Early Stopping
When doing cross validation, it is usual to encounter overfitting at a early stage of iteration.
Sometimes the prediction gets worse consistantly from round 300 while the total number of
iteration is 1000. To stop the cross validation process, one can use the early.stop.round
argument in xgb.cv.
bst = xgb.cv(params = param, data = train$data, label = train$label,
nround = 20, nfold = 5,
maximize = FALSE, early.stop.round = 3)
100/128
8/30/15, 10:09 PMXGBoost
Page 101 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Early Stopping
## [0] train-error:0.050783+0.011279 test-error:0.055272+0.013879
## [1] train-error:0.021227+0.001863 test-error:0.021496+0.004443
## [2] train-error:0.008483+0.003373 test-error:0.007677+0.002820
## [3] train-error:0.014202+0.002592 test-error:0.014126+0.003504
## [4] train-error:0.004721+0.002682 test-error:0.005527+0.004490
## [5] train-error:0.001382+0.000533 test-error:0.001382+0.000642
## [6] train-error:0.001228+0.000219 test-error:0.001228+0.000875
## [7] train-error:0.001228+0.000219 test-error:0.001228+0.000875
## [8] train-error:0.001228+0.000219 test-error:0.001228+0.000875
## [9] train-error:0.000691+0.000645 test-error:0.000921+0.001001
## [10] train-error:0.000422+0.000582 test-error:0.000768+0.001086
## [11] train-error:0.000192+0.000429 test-error:0.000460+0.001030
## [12] train-error:0.000192+0.000429 test-error:0.000460+0.001030
## [13] train-error:0.000000+0.000000 test-error:0.000000+0.000000
## [14] train-error:0.000000+0.000000 test-error:0.000000+0.000000
## [15] train-error:0.000000+0.000000 test-error:0.000000+0.000000
## [16] train-error:0.000000+0.000000 test-error:0.000000+0.000000
## Stopping. Best iteration: 14
101/128
8/30/15, 10:09 PMXGBoost
Page 102 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Kaggle Winning Solution
102/128
8/30/15, 10:09 PMXGBoost
Page 103 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Kaggle Winning Solution
To get a higher rank, one need to push the limit of
1. Feature Engineering
2. Parameter Tuning
3. Model Ensemble
The winning solution in the recent Otto Competition is an excellent example.
103/128
8/30/15, 10:09 PMXGBoost
Page 104 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Kaggle Winning Solution
They used a 3-layer ensemble learning model, including
33 models on top of the original data
XGBoost, neural network and adaboost on 33 predictions from the models and 8 engineered
features
Weighted average of the 3 prediction from the second step
·
·
·
104/128
8/30/15, 10:09 PMXGBoost
Page 105 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Kaggle Winning Solution
The data for this competition is special: the meanings of the featuers are hidden.
For feature engineering, they generated 8 new features:
Distances to nearest neighbours of each classes
Sum of distances of 2 nearest neighbours of each classes
Sum of distances of 4 nearest neighbours of each classes
Distances to nearest neighbours of each classes in TFIDF space
Distances to nearest neighbours of each classed in T-SNE space (3 dimensions)
Clustering features of original dataset
Number of non-zeros elements in each row
X (That feature was used only in NN 2nd level training)
·
·
·
·
·
·
·
·
105/128
8/30/15, 10:09 PMXGBoost
Page 106 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Kaggle Winning Solution
This means a lot of work. However this also implies they need to try a lot of other models,
although some of them turned out to be not helpful in this competition. Their attempts include:
A lot of training algorithms in first level as
Some preprocessing like PCA, ICA and FFT
Feature Selection
Semi-supervised learning
·
Vowpal Wabbit(many configurations)
R glm, glmnet, scikit SVC, SVR, Ridge, SGD, etc...
-
-
·
·
·
106/128
8/30/15, 10:09 PMXGBoost
Page 107 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Let's learn to use a single XGBoost model to achieve a high rank in an old competition!
The competition we choose is the Influencers in Social Networks competition.
It was a hackathon in 2013, therefore the size of data is small enough so that we can train the
model in seconds.
107/128
8/30/15, 10:09 PMXGBoost
Page 108 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
First let's download the data, and load them into R
train = read.csv('train.csv',header = TRUE)
test = read.csv('test.csv',header = TRUE)
y = train[,1]
train = as.matrix(train[,-1])
test = as.matrix(test)
108/128
8/30/15, 10:09 PMXGBoost
Page 109 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Observe the data:
colnames(train)
## [1] "A_follower_count" "A_following_count" "A_listed_count"
## [4] "A_mentions_received" "A_retweets_received" "A_mentions_sent"
## [7] "A_retweets_sent" "A_posts" "A_network_feature_1"
## [10] "A_network_feature_2" "A_network_feature_3" "B_follower_count"
## [13] "B_following_count" "B_listed_count" "B_mentions_received"
## [16] "B_retweets_received" "B_mentions_sent" "B_retweets_sent"
## [19] "B_posts" "B_network_feature_1" "B_network_feature_2"
## [22] "B_network_feature_3"
109/128
8/30/15, 10:09 PMXGBoost
Page 110 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Observe the data:
train[1,]
## A_follower_count A_following_count A_listed_count
## 2.280000e+02 3.020000e+02 3.000000e+00
## A_mentions_received A_retweets_received A_mentions_sent
## 5.839794e-01 1.005034e-01 1.005034e-01
## A_retweets_sent A_posts A_network_feature_1
## 1.005034e-01 3.621501e-01 2.000000e+00
## A_network_feature_2 A_network_feature_3 B_follower_count
## 1.665000e+02 1.135500e+04 3.446300e+04
## B_following_count B_listed_count B_mentions_received
## 2.980800e+04 1.689000e+03 1.543050e+01
## B_retweets_received B_mentions_sent B_retweets_sent
## 3.984029e+00 8.204331e+00 3.324230e-01
## B_posts B_network_feature_1 B_network_feature_2
## 6.988815e+00 6.600000e+01 7.553030e+01
## B_network_feature_3 110/128
8/30/15, 10:09 PMXGBoost
Page 111 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
The data contains information from two users in a social network service. Our mission is to
determine who is more influencial than the other one.
This type of data gives us some room for feature engineering.
111/128
8/30/15, 10:09 PMXGBoost
Page 112 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
The first trick is to increase the information in the data.
Every data point can be expressed as <y, A, B>. Actually it indicates <1-y, B, A> as well.
We can simply use extract this part of information from the training set.
new.train = cbind(train[,12:22],train[,1:11])
train = rbind(train,new.train)
y = c(y,1-y)
112/128
8/30/15, 10:09 PMXGBoost
Page 113 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
The following feature engineering steps are done on both training and test set. Therefore we
combine them together.
x = rbind(train,test)
113/128
8/30/15, 10:09 PMXGBoost
Page 114 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
The next step could be calculating the ratio between features of A and B seperately:
followers/following
mentions received/sent
retweets received/sent
followers/posts
retweets received/posts
mentions received/posts
·
·
·
·
·
·
114/128
8/30/15, 10:09 PMXGBoost
Page 115 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Considering there might be zeroes, we need to smooth the ratio by a constant.
Next we can calculate the ratio with this helper function.
calcRatio = function(dat,i,j,lambda = 1) (dat[,i]+lambda)/(dat[,j]+lambda)
115/128
8/30/15, 10:09 PMXGBoost
Page 116 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
A.follow.ratio = calcRatio(x,1,2)
A.mention.ratio = calcRatio(x,4,6)
A.retweet.ratio = calcRatio(x,5,7)
A.follow.post = calcRatio(x,1,8)
A.mention.post = calcRatio(x,4,8)
A.retweet.post = calcRatio(x,5,8)
B.follow.ratio = calcRatio(x,12,13)
B.mention.ratio = calcRatio(x,15,17)
B.retweet.ratio = calcRatio(x,16,18)
B.follow.post = calcRatio(x,12,19)
B.mention.post = calcRatio(x,15,19)
B.retweet.post = calcRatio(x,16,19)
116/128
8/30/15, 10:09 PMXGBoost
Page 117 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Combine the features into the data set.
x = cbind(x[,1:11],
A.follow.ratio,A.mention.ratio,A.retweet.ratio,
A.follow.post,A.mention.post,A.retweet.post,
x[,12:22],
B.follow.ratio,B.mention.ratio,B.retweet.ratio,
B.follow.post,B.mention.post,B.retweet.post)
117/128
8/30/15, 10:09 PMXGBoost
Page 118 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Then we can compare the difference between A and B. Because XGBoost is scale invariant,
therefore minus and division are the essentially same.
AB.diff = x[,1:17]-x[,18:34]
x = cbind(x,AB.diff)
train = x[1:nrow(train),]
test = x[-(1:nrow(train)),]
118/128
8/30/15, 10:09 PMXGBoost
Page 119 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Now comes to the modeling part. We investigate how far can we can go with a single model.
The parameter tuning step is very important in this step. We can see the performance from
cross validation.
119/128
8/30/15, 10:09 PMXGBoost
Page 120 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Here's the xgb.cv with default parameters.
set.seed(1024)
cv.res = xgb.cv(data = train, nfold = 3, label = y, nrounds = 100, verbose = FALSE,
objective='binary:logistic', eval_metric = 'auc')
120/128
8/30/15, 10:09 PMXGBoost
Page 121 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
We can see the trend of AUC on training and test sets.
121/128
8/30/15, 10:09 PMXGBoost
Page 122 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
It is obvious our model severly overfits. The direct reason is simple: the default value of eta is
0.3, which is too large for this mission.
Recall the parameter tuning guide, we need to decrease eta and inccrease nrounds based on
the result of cross validation.
122/128
8/30/15, 10:09 PMXGBoost
Page 123 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
After some trials, we get the following set of parameters:
set.seed(1024)
cv.res = xgb.cv(data = train, nfold = 3, label = y, nrounds = 3000,
objective='binary:logistic', eval_metric = 'auc',
eta = 0.005, gamma = 1,lambda = 3, nthread = 8,
max_depth = 4, min_child_weight = 1, verbose = F,
subsample = 0.8,colsample_bytree = 0.8)
123/128
8/30/15, 10:09 PMXGBoost
Page 124 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
We can see the trend of AUC on training and test sets.
124/128
8/30/15, 10:09 PMXGBoost
Page 125 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Next we extract the best number of iterations.
We calculate the AUC minus the standard deviation, and choose the iteration with the largest
value.
bestRound = which.max(as.matrix(cv.res)[,3]-as.matrix(cv.res)[,4])
bestRound
## [1] 2442
cv.res[bestRound,]
## train.auc.mean train.auc.std test.auc.mean test.auc.std
## 1: 0.934967 0.00125 0.876629 0.002073
125/128
8/30/15, 10:09 PMXGBoost
Page 126 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Then we train the model with the same set of parameters:
set.seed(1024)
bst = xgboost(data = train, label = y, nrounds = 3000,
objective='binary:logistic', eval_metric = 'auc',
eta = 0.005, gamma = 1,lambda = 3, nthread = 8,
max_depth = 4, min_child_weight = 1,
subsample = 0.8,colsample_bytree = 0.8)
preds = predict(bst,test,ntreelimit = bestRound)
126/128
8/30/15, 10:09 PMXGBoost
Page 127 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
Influencers in Social Networks
Finally we submit our solution
This wins us top 10 on the leaderboard!
result = data.frame(Id = 1:nrow(test),
Choice = preds)
write.csv(result,'submission.csv',quote=FALSE,row.names=FALSE)
127/128
8/30/15, 10:09 PMXGBoost
Page 128 of 128file:///Users/vivi/Desktop/xgboost/index.html#1
FAQ
128/128

Contenu connexe

Tendances

Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learningbutest
 
1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリ1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリNVIDIA Japan
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
adversarial training.pptx
adversarial training.pptxadversarial training.pptx
adversarial training.pptxssuserc45ddf
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedSimon Lia-Jonassen
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regressionAkhilesh Joshi
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lectureShreyas S K
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning ClassifiersMostafa
 
7 - Model Assessment and Selection
7 - Model Assessment and Selection7 - Model Assessment and Selection
7 - Model Assessment and SelectionNikita Zhiltsov
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界Preferred Networks
 
決定木・回帰木に基づくアンサンブル学習の最近
決定木・回帰木に基づくアンサンブル学習の最近決定木・回帰木に基づくアンサンブル学習の最近
決定木・回帰木に基づくアンサンブル学習の最近Ichigaku Takigawa
 

Tendances (20)

Xgboost
XgboostXgboost
Xgboost
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 
1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリ1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリ
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
adversarial training.pptx
adversarial training.pptxadversarial training.pptx
adversarial training.pptx
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regression
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lecture
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning Classifiers
 
7 - Model Assessment and Selection
7 - Model Assessment and Selection7 - Model Assessment and Selection
7 - Model Assessment and Selection
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
 
決定木・回帰木に基づくアンサンブル学習の最近
決定木・回帰木に基づくアンサンブル学習の最近決定木・回帰木に基づくアンサンブル学習の最近
決定木・回帰木に基づくアンサンブル学習の最近
 

En vedette

Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹Wayne Chen
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboostShuai Zhang
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
GBM package in r
GBM package in rGBM package in r
GBM package in rmark_landry
 
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2OSri Ambati
 
Automated data analysis with Python
Automated data analysis with PythonAutomated data analysis with Python
Automated data analysis with PythonGramener
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostJaroslaw Szymczak
 
NIPS2016論文紹介 Riemannian SVRG fast stochastic optimization on riemannian manif...
NIPS2016論文紹介 Riemannian SVRG fast stochastic optimization on riemannian manif...NIPS2016論文紹介 Riemannian SVRG fast stochastic optimization on riemannian manif...
NIPS2016論文紹介 Riemannian SVRG fast stochastic optimization on riemannian manif...Takami Sato
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)Vivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 

En vedette (20)

Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
Inlining Heuristics
Inlining HeuristicsInlining Heuristics
Inlining Heuristics
 
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2O
 
XGBoost (System Overview)
XGBoost (System Overview)XGBoost (System Overview)
XGBoost (System Overview)
 
Automated data analysis with Python
Automated data analysis with PythonAutomated data analysis with Python
Automated data analysis with Python
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboost
 
NIPS2016論文紹介 Riemannian SVRG fast stochastic optimization on riemannian manif...
NIPS2016論文紹介 Riemannian SVRG fast stochastic optimization on riemannian manif...NIPS2016論文紹介 Riemannian SVRG fast stochastic optimization on riemannian manif...
NIPS2016論文紹介 Riemannian SVRG fast stochastic optimization on riemannian manif...
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 

Similaire à Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author

maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3Max Kleiner
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen TatarynovFwdays
 
Stat Design3 18 09
Stat Design3 18 09Stat Design3 18 09
Stat Design3 18 09stat
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...FarhanAhmade
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learningMax Kleiner
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageWilliam de Vazelhes
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Chapter 1 Basic Concepts
Chapter 1 Basic ConceptsChapter 1 Basic Concepts
Chapter 1 Basic ConceptsHareem Aslam
 
10 -bits_and_bytes
10  -bits_and_bytes10  -bits_and_bytes
10 -bits_and_bytesHector Garzo
 
Efficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in REfficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in RGregg Barrett
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
data frames.pptx
data frames.pptxdata frames.pptx
data frames.pptxRacksaviR
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Javascript & SQL within database management system
Javascript & SQL within database management systemJavascript & SQL within database management system
Javascript & SQL within database management systemClusterpoint
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
maXbox starter68 machine learning VI
maXbox starter68 machine learning VImaXbox starter68 machine learning VI
maXbox starter68 machine learning VIMax Kleiner
 

Similaire à Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author (20)

maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
 
Stat Design3 18 09
Stat Design3 18 09Stat Design3 18 09
Stat Design3 18 09
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
alexnet.pdf
alexnet.pdfalexnet.pdf
alexnet.pdf
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible package
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Chapter 1 Basic Concepts
Chapter 1 Basic ConceptsChapter 1 Basic Concepts
Chapter 1 Basic Concepts
 
10 -bits_and_bytes
10  -bits_and_bytes10  -bits_and_bytes
10 -bits_and_bytes
 
Efficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in REfficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in R
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
data frames.pptx
data frames.pptxdata frames.pptx
data frames.pptx
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Javascript & SQL within database management system
Javascript & SQL within database management systemJavascript & SQL within database management system
Javascript & SQL within database management system
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
maXbox starter68 machine learning VI
maXbox starter68 machine learning VImaXbox starter68 machine learning VI
maXbox starter68 machine learning VI
 

Plus de Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataVivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Vivian S. Zhang
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Vivian S. Zhang
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycVivian S. Zhang
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycVivian S. Zhang
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Vivian S. Zhang
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Vivian S. Zhang
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Vivian S. Zhang
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...Vivian S. Zhang
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...Vivian S. Zhang
 

Plus de Vivian S. Zhang (18)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
 

Dernier

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 

Dernier (20)

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 

Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author

  • 1. 8/30/15, 10:09 PMXGBoost Page 1 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 XGBoostXGBoost eXtremeGradientBoosting Tong He
  • 2. 8/30/15, 10:09 PMXGBoost Page 2 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Overview Introduction Basic Walkthrough Real World Application Model Specification Parameter Introduction Advanced Features Kaggle Winning Solution · · · · · · · 2/128
  • 3. 8/30/15, 10:09 PMXGBoost Page 3 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Introduction 3/128
  • 4. 8/30/15, 10:09 PMXGBoost Page 4 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Introduction Nowadays we have plenty of machine learning models. Those most well-knowns are Linear/Logistic Regression k-Nearest Neighbours Support Vector Machines Tree-based Model Neural Networks · · · · Decision Tree Random Forest Gradient Boosting Machine - - - · 4/128
  • 5. 8/30/15, 10:09 PMXGBoost Page 5 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Introduction XGBoost is short for eXtreme Gradient Boosting. It is An open-sourced tool A variant of the gradient boosting machine The winning model for several kaggle competitions · Computation in C++ R/python/Julia interface provided - - · Tree-based model- · 5/128
  • 6. 8/30/15, 10:09 PMXGBoost Page 6 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Introduction XGBoost is currently host on github. The primary author of the model and the c++ implementation is Tianqi Chen. The author for the R-package is Tong He. · · 6/128
  • 7. 8/30/15, 10:09 PMXGBoost Page 7 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Introduction XGBoost is widely used for kaggle competitions. The reason to choose XGBoost includes Easy to use Efficiency Accuracy Feasibility · Easy to install. Highly developed R/python interface for users. - - · Automatic parallel computation on a single machine. Can be run on a cluster. - - · Good result for most data sets.- · Customized objective and evaluation Tunable parameters - - 7/128
  • 8. 8/30/15, 10:09 PMXGBoost Page 8 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Basic Walkthrough We introduce the R package for XGBoost. To install, please run This command downloads the package from github and compile it automatically on your machine. Therefore we need RTools installed on Windows. devtools::install_github('dmlc/xgboost',subdir='R-package') 8/128
  • 9. 8/30/15, 10:09 PMXGBoost Page 9 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Basic Walkthrough XGBoost provides a data set to demonstrate its usages. This data set includes the information for some kinds of mushrooms. The features are binary, indicate whether the mushroom has this characteristic. The target variable is whether they are poisonous. require(xgboost) ## Loading required package: xgboost data(agaricus.train, package='xgboost') data(agaricus.test, package='xgboost') train = agaricus.train test = agaricus.test 9/128
  • 10. 8/30/15, 10:09 PMXGBoost Page 10 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Basic Walkthrough Let's investigate the data first. We can see that the data is a dgCMatrix class object. This is a sparse matrix class from the package Matrix. Sparse matrix is more memory efficient for some specific data. str(train$data) ## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ## ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... ## ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ... ## ..@ Dim : int [1:2] 6513 126 ## ..@ Dimnames:List of 2 ## .. ..$ : NULL ## .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=fla ## ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... ## ..@ factors : list() 10/128
  • 11. 8/30/15, 10:09 PMXGBoost Page 11 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Basic Walkthrough To use XGBoost to classify poisonous mushrooms, the minimum information we need to provide is: 1. Input features 2. Target variable 3. Objective 4. Number of iteration XGBoost allows dense and sparse matrix as the input.· A numeric vector. Use integers starting from 0 for classification, or real values for regression · For regression use 'reg:linear' For binary classification use 'binary:logistic' · · The number of trees added to the model· 11/128
  • 12. 8/30/15, 10:09 PMXGBoost Page 12 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Basic Walkthrough To run XGBoost, we can use the following command: The output is the classification error on the training data set. bst = xgboost(data = train$data, label = train$label, nround = 2, objective = "binary:logistic") ## [0] train-error:0.000614 ## [1] train-error:0.001228 12/128
  • 13. 8/30/15, 10:09 PMXGBoost Page 13 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Basic Walkthrough Sometimes we might want to measure the classification by 'Area Under the Curve': bst = xgboost(data = train$data, label = train$label, nround = 2, objective = "binary:logistic", eval_metric = "auc") ## [0] train-auc:0.999238 ## [1] train-auc:0.999238 13/128
  • 14. 8/30/15, 10:09 PMXGBoost Page 14 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Basic Walkthrough To predict, you can simply write pred = predict(bst, test$data) head(pred) ## [1] 0.2582498 0.7433221 0.2582498 0.2582498 0.2576509 0.2750908 14/128
  • 15. 8/30/15, 10:09 PMXGBoost Page 15 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Basic Walkthrough Cross validation is an important method to measure the model's predictive power, as well as the degree of overfitting. XGBoost provides a convenient function to do cross validation in a line of code. Notice the difference of the arguments between xgb.cv and xgboost is the additional nfold parameter. To perform cross validation on a certain set of parameters, we just need to copy them to the xgb.cv function and add the number of folds. cv.res = xgb.cv(data = train$data, nfold = 5, label = train$label, nround = 2, objective = "binary:logistic", eval_metric = "auc") ## [0] train-auc:0.998668+0.000354 test-auc:0.998497+0.001380 ## [1] train-auc:0.999187+0.000785 test-auc:0.998700+0.001536 15/128
  • 16. 8/30/15, 10:09 PMXGBoost Page 16 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Basic Walkthrough xgb.cv returns a data.table object containing the cross validation results. This is helpful for choosing the correct number of iterations. cv.res ## train.auc.mean train.auc.std test.auc.mean test.auc.std ## 1: 0.998668 0.000354 0.998497 0.001380 ## 2: 0.999187 0.000785 0.998700 0.001536 16/128
  • 17. 8/30/15, 10:09 PMXGBoost Page 17 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Real World Experiment 17/128
  • 18. 8/30/15, 10:09 PMXGBoost Page 18 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition The debut of XGBoost was in the higgs boson competition. Tianqi introduced the tool along with a benchmark code which achieved the top 10% at the beginning of the competition. To the end of the competition, it was already the mostly used tool in that competition. 18/128
  • 19. 8/30/15, 10:09 PMXGBoost Page 19 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition XGBoost offers the script on github. To run the script, prepare a data directory and download the competition data into this directory. 19/128
  • 20. 8/30/15, 10:09 PMXGBoost Page 20 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition Firstly we prepare the environment require(xgboost) require(methods) testsize = 550000 20/128
  • 21. 8/30/15, 10:09 PMXGBoost Page 21 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition Then we can read in the data dtrain = read.csv("data/training.csv", header=TRUE) dtrain[33] = dtrain[33] == "s" label = as.numeric(dtrain[[33]]) data = as.matrix(dtrain[2:31]) weight = as.numeric(dtrain[[32]]) * testsize / length(label) sumwpos <- sum(weight * (label==1.0)) sumwneg <- sum(weight * (label==0.0)) 21/128
  • 22. 8/30/15, 10:09 PMXGBoost Page 22 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition The data contains missing values and they are marked as -999. We can construct an xgb.DMatrix object containing the information of weight and missing. xgmat = xgb.DMatrix(data, label = label, weight = weight, missing = -999.0) 22/128
  • 23. 8/30/15, 10:09 PMXGBoost Page 23 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition Next step is to set the basic parameters param = list("objective" = "binary:logitraw", "scale_pos_weight" = sumwneg / sumwpos, "bst:eta" = 0.1, "bst:max_depth" = 6, "eval_metric" = "auc", "eval_metric" = "ams@0.15", "silent" = 1, "nthread" = 16) 23/128
  • 24. 8/30/15, 10:09 PMXGBoost Page 24 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition We then start the training step bst = xgboost(params = param, data = xgmat, nround = 120) 24/128
  • 25. 8/30/15, 10:09 PMXGBoost Page 25 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition Then we read in the test data dtest = read.csv("data/test.csv", header=TRUE) data = as.matrix(dtest[2:31]) xgmat = xgb.DMatrix(data, missing = -999.0) 25/128
  • 26. 8/30/15, 10:09 PMXGBoost Page 26 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition We now can make prediction on the test data set. ypred = predict(bst, xgmat) 26/128
  • 27. 8/30/15, 10:09 PMXGBoost Page 27 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition Finally we output the prediction according to the required format. Please submit the result to see your performance :) rorder = rank(ypred, ties.method="first") threshold = 0.15 ntop = length(rorder) - as.integer(threshold*length(rorder)) plabel = ifelse(rorder > ntop, "s", "b") outdata = list("EventId" = idx, "RankOrder" = rorder, "Class" = plabel) write.csv(outdata, file = "submission.csv", quote=FALSE, row.names=FALSE) 27/128
  • 28. 8/30/15, 10:09 PMXGBoost Page 28 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition Besides the good performace, the efficiency is also a highlight of XGBoost. The following plot shows the running time result on the Higgs boson data set. 28/128
  • 29. 8/30/15, 10:09 PMXGBoost Page 29 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Higgs Boson Competition After some feature engineering and parameter tuning, one can achieve around 25th with a single model on the leaderboard. This is an article written by a former-physist introducing his solution with a single XGboost model: https://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs- competition-what-a-single-model-can-do/ On our post-competition attempts, we achieved 11th on the leaderboard with a single XGBoost model. 29/128
  • 30. 8/30/15, 10:09 PMXGBoost Page 30 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Model Specification 30/128
  • 31. 8/30/15, 10:09 PMXGBoost Page 31 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Training Objective To understand other parameters, one need to have a basic understanding of the model behind. Suppose we have trees, the model is where each is the prediction from a decision tree. The model is a collection of decision trees. K ∑ k=1 K fk fk 31/128
  • 32. 8/30/15, 10:09 PMXGBoost Page 32 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Training Objective Having all the decision trees, we make prediction by where is the feature vector for the -th data point. Similarly, the prediction at the -th step can be defined as = ( )yiˆ ∑ k=1 K fk xi xi i t = ( )yiˆ (t) ∑ k=1 t fk xi 32/128
  • 33. 8/30/15, 10:09 PMXGBoost Page 33 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Training Objective To train the model, we need to optimize a loss function. Typically, we use Rooted Mean Squared Error for regression LogLoss for binary classification mlogloss for multi-classification · - L = ( −1 N ∑N i=1 yi yiˆ )2 · - L = − ( log( ) + (1 − ) log(1 − ))1 N ∑N i=1 yi pi yi pi · - L = − log( )1 N ∑N i=1 ∑M j=1 yi,j pi,j 33/128
  • 34. 8/30/15, 10:09 PMXGBoost Page 34 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Training Objective Regularization is another important part of the model. A good regularization term controls the complexity of the model which prevents overfitting. Define where is the number of leaves, and is the score on the -th leaf. Ω = γT + λ 1 2 ∑ j=1 T w2 j T w2 j j 34/128
  • 35. 8/30/15, 10:09 PMXGBoost Page 35 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Training Objective Put loss function and regularization together, we have the objective of the model: where loss function controls the predictive power, and regularization controls the simplicity. Obj = L + Ω 35/128
  • 36. 8/30/15, 10:09 PMXGBoost Page 36 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Training Objective In XGBoost, we use gradient descent to optimize the objective. Given an objective to optimize, gradient descent is an iterative technique which calculate at each iteration. Then we improve along the direction of the gradient to minimize the objective. Obj(y, )yˆ Obj(y, )∂yˆ yˆ yˆ 36/128
  • 37. 8/30/15, 10:09 PMXGBoost Page 37 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Training Objective Recall the definition of objective . For a iterative algorithm we can re-define the objective function as To optimize it by gradient descent, we need to calculate the gradient. The performance can also be improved by considering both the first and the second order gradient. Obj = L + Ω Ob = L( , ) + Ω( ) = L( , + ( )) + Ω( )j(t) ∑ i=1 N yi yˆ(t) i ∑ i=1 t fi ∑ i=1 N yi yiˆ (t−1) ft xi ∑ i=1 t fi Ob∂yiˆ (t) j(t) Ob∂2 yiˆ (t) j(t) 37/128
  • 38. 8/30/15, 10:09 PMXGBoost Page 38 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Training Objective Since we don't have derivative for every objective function, we calculate the second order taylor approximation of it where Ob ≃ [L( , ) + ( ) + ( )] + Ω( )j(t) ∑ i=1 N yi yˆ(t−1) gift xi 1 2 hif 2 t xi ∑ i=1 t fi · = l( , )gi ∂yˆ(t−1) yi yˆ(t−1) · = l( , )hi ∂2 yˆ(t−1) yi yˆ(t−1) 38/128
  • 39. 8/30/15, 10:09 PMXGBoost Page 39 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Training Objective Remove the constant terms, we get This is the objective at the -th step. Our goal is to find a to optimize it. Ob = [ ( ) + ( )] + Ω( )j(t) ∑ i=1 n gift xi 1 2 hif 2 t xi ft t ft 39/128
  • 40. 8/30/15, 10:09 PMXGBoost Page 40 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm The tree structures in XGBoost leads to the core problem: how can we find a tree that improves the prediction along the gradient? 40/128
  • 41. 8/30/15, 10:09 PMXGBoost Page 41 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm Every decision tree looks like this Each data point flows to one of the leaves following the direction on each node. 41/128
  • 42. 8/30/15, 10:09 PMXGBoost Page 42 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm The core concepts are: Internal Nodes Leaves · Each internal node split the flow of data points by one of the features. The condition on the edge specifies what data can flow through. - - · Data points reach to a leaf will be assigned a weight. The weight is the prediction. - - 42/128
  • 43. 8/30/15, 10:09 PMXGBoost Page 43 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm Two key questions for building a decision tree are 1. How to find a good structure? 2. How to assign prediction score? We want to solve these two problems with the idea of gradient descent. 43/128
  • 44. 8/30/15, 10:09 PMXGBoost Page 44 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm Let us assume that we already have the solution to question 1. We can mathematically define a tree as where is a "directing" function which assign every data point to the -th leaf. This definition describes the prediction process on a tree as (x) =ft wq(x) q(x) q(x) Assign the data point to a leaf by Assign the corresponding score on the -th leaf to the data point. · x q · wq(x) q(x) 44/128
  • 45. 8/30/15, 10:09 PMXGBoost Page 45 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm Define the index set This set contains the indices of data points that are assigned to the -th leaf. = {i|q( ) = j}Ij xi j 45/128
  • 46. 8/30/15, 10:09 PMXGBoost Page 46 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm Then we rewrite the objective as Since all the data points on the same leaf share the same prediction, this form sums the prediction by leaves. Ob = [ ( ) + ( )] + γT + λj(t) ∑ i=1 n gift xi 1 2 hif 2 t xi 1 2 ∑ j=1 T w2 j = [( ) + ( + λ) ] + γT ∑ j=1 T ∑ i∈Ij gi wj 1 2 ∑ i∈Ij hi w2 j 46/128
  • 47. 8/30/15, 10:09 PMXGBoost Page 47 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm It is a quadratic problem of , so it is easy to find the best to optimize . The corresponding value of is wj wj Obj = −w∗ j ∑i∈Ij gi + λ∑i∈Ij hi Obj Ob = − + γTj(t) 1 2 ∑ j=1 T (∑i∈Ij gi)2 + λ∑i∈Ij hi 47/128
  • 48. 8/30/15, 10:09 PMXGBoost Page 48 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm The leaf score relates to = −wj ∑i∈Ij gi + λ∑i∈Ij hi The first and second order of the loss function and The regularization parameter · g h · λ 48/128
  • 49. 8/30/15, 10:09 PMXGBoost Page 49 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm Now we come back to the first question: How to find a good structure? We can further split it into two sub-questions: 1. How to choose the feature to split? 2. When to stop the split? 49/128
  • 50. 8/30/15, 10:09 PMXGBoost Page 50 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm In each split, we want to greedily find the best splitting point that can optimize the objective. For each feature 1. Sort the numbers 2. Scan the best splitting point. 3. Choose the best feature. 50/128
  • 51. 8/30/15, 10:09 PMXGBoost Page 51 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm Now we give a definition to "the best split" by the objective. Everytime we do a split, we are changing a leaf into a internal node. 51/128
  • 52. 8/30/15, 10:09 PMXGBoost Page 52 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm Let Recall the best value of objective on the -th leaf is be the set of indices of data points assigned to this node and be the sets of indices of data points assigned to two new leaves. · I · IL IR j Ob = − + γj(t) 1 2 (∑i∈Ij gi)2 + λ∑i∈Ij hi 52/128
  • 53. 8/30/15, 10:09 PMXGBoost Page 53 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm The gain of the split is gain = [ + − ] − γ 1 2 (∑i∈IL gi)2 + λ∑i∈IL hi (∑i∈IR gi)2 + λ∑i∈IR hi (∑i∈I gi)2 + λ∑i∈I hi 53/128
  • 54. 8/30/15, 10:09 PMXGBoost Page 54 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm To build a tree, we find the best splitting point recursively until we reach to the maximum depth. Then we prune out the nodes with a negative gain in a bottom-up order. 54/128
  • 55. 8/30/15, 10:09 PMXGBoost Page 55 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm XGBoost can handle missing values in the data. For each node, we guide all the data points with a missing value Finally every node has a "default direction" for missing values. to the left subnode, and calculate the maximum gain to the right subnode, and calculate the maximum gain Choose the direction with a larger gain · · · 55/128
  • 56. 8/30/15, 10:09 PMXGBoost Page 56 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree Building Algorithm To sum up, the outline of the algorithm is Iterate for nround times· Grow the tree to the maximun depth Prune the tree to delete nodes with negative gain - Find the best splitting point Assign weight to the two new leaves - - - 56/128
  • 57. 8/30/15, 10:09 PMXGBoost Page 57 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Parameters 57/128
  • 58. 8/30/15, 10:09 PMXGBoost Page 58 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Parameter Introduction XGBoost has plenty of parameters. We can group them into 1. General parameters 2. Booster parameters 3. Task parameters Number of threads· Stepsize Regularization · · Objective Evaluation metric · · 58/128
  • 59. 8/30/15, 10:09 PMXGBoost Page 59 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Parameter Introduction After the introduction of the model, we can understand the parameters provided in XGBoost. To check the parameter list, one can look into The documentation of xgb.train. The documentation in the repository. · · 59/128
  • 60. 8/30/15, 10:09 PMXGBoost Page 60 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Parameter Introduction General parameters: nthread booster · Number of parallel threads.- · gbtree: tree-based model. gblinear: linear function. - - 60/128
  • 61. 8/30/15, 10:09 PMXGBoost Page 61 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Parameter Introduction Parameter for Tree Booster eta gamma · Step size shrinkage used in update to prevents overfitting. Range in [0,1], default 0.3 - - · Minimum loss reduction required to make a split. Range [0, ], default 0 - - ∞ 61/128
  • 62. 8/30/15, 10:09 PMXGBoost Page 62 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Parameter Introduction Parameter for Tree Booster max_depth min_child_weight max_delta_step · Maximum depth of a tree. Range [1, ], default 6 - - ∞ · Minimum sum of instance weight needed in a child. Range [0, ], default 1 - - ∞ · Maximum delta step we allow each tree's weight estimation to be. Range [0, ], default 0 - - ∞ 62/128
  • 63. 8/30/15, 10:09 PMXGBoost Page 63 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Parameter Introduction Parameter for Tree Booster subsample colsample_bytree · Subsample ratio of the training instance. Range (0, 1], default 1 - - · Subsample ratio of columns when constructing each tree. Range (0, 1], default 1 - - 63/128
  • 64. 8/30/15, 10:09 PMXGBoost Page 64 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Parameter Introduction Parameter for Linear Booster lambda alpha lambda_bias · L2 regularization term on weights default 0 - - · L1 regularization term on weights default 0 - - · L2 regularization term on bias default 0 - - 64/128
  • 65. 8/30/15, 10:09 PMXGBoost Page 65 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Parameter Introduction Objectives· "reg:linear": linear regression, default option. "binary:logistic": logistic regression for binary classification, output probability "multi:softmax": multiclass classification using the softmax objective, need to specify num_class User specified objective - - - - 65/128
  • 66. 8/30/15, 10:09 PMXGBoost Page 66 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Parameter Introduction Evaluation Metric· "rmse" "logloss" "error" "auc" "merror" "mlogloss" User specified evaluation metric - - - - - - - 66/128
  • 67. 8/30/15, 10:09 PMXGBoost Page 67 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Guide on Parameter Tuning It is nearly impossible to give a set of universal optimal parameters, or a global algorithm achieving it. The key points of parameter tuning are Control Overfitting Deal with Imbalanced data Trust the cross validation · · · 67/128
  • 68. 8/30/15, 10:09 PMXGBoost Page 68 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Guide on Parameter Tuning The "Bias-Variance Tradeoff", or the "Accuracy-Simplicity Tradeoff" is the main idea for controlling overfitting. For the booster specific parameters, we can group them as Controlling the model complexity Robust to noise · max_depth, min_child_weight and gamma- · subsample, colsample_bytree- 68/128
  • 69. 8/30/15, 10:09 PMXGBoost Page 69 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Guide on Parameter Tuning Sometimes the data is imbalanced among classes. Only care about the ranking order Care about predicting the right probability · Balance the positive and negative weights, by scale_pos_weight Use "auc" as the evaluation metric - - · Cannot re-balance the dataset Set parameter max_delta_step to a finite number (say 1) will help convergence - - 69/128
  • 70. 8/30/15, 10:09 PMXGBoost Page 70 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Guide on Parameter Tuning To select ideal parameters, use the result from xgb.cv. Trust the score for the test Use early.stop.round to detect continuously being worse on test set. If overfitting observed, reduce stepsize eta and increase nround at the same time. · · · 70/128
  • 71. 8/30/15, 10:09 PMXGBoost Page 71 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Advanced Features 71/128
  • 72. 8/30/15, 10:09 PMXGBoost Page 72 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Advanced Features There are plenty of highlights in XGBoost: Customized objective and evaluation metric Prediction from cross validation Continue training on existing model Calculate and plot the variable importance · · · · 72/128
  • 73. 8/30/15, 10:09 PMXGBoost Page 73 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization According to the algorithm, we can define our own loss function, as long as we can calculate the first and second order gradient of the loss function. Define and . We can optimize the loss function if we can calculate these two values. grad = l∂yt−1 hess = l∂2 yt−1 73/128
  • 74. 8/30/15, 10:09 PMXGBoost Page 74 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization We can rewrite logloss for the -th data point as The is calculated by applying logistic transformation on our prediction . Then the logloss is i L = log( ) + (1 − ) log(1 − )yi pi yi pi pi yˆi L = log + (1 − ) logyi 1 1 + e−yˆi yi e−yˆi 1 + e−yˆi 74/128
  • 75. 8/30/15, 10:09 PMXGBoost Page 75 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization We can see that Next we translate them into the code. · grad = − = −1 1+e−yˆi yi pi yi · hess = = (1 − )1+e−yˆi (1+e−yˆi )2 pi pi 75/128
  • 76. 8/30/15, 10:09 PMXGBoost Page 76 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization The complete code: logregobj = function(preds, dtrain) { labels = getinfo(dtrain, "label") preds = 1/(1 + exp(-preds)) grad = preds - labels hess = preds * (1 - preds) return(list(grad = grad, hess = hess)) } 76/128
  • 77. 8/30/15, 10:09 PMXGBoost Page 77 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization The complete code: logregobj = function(preds, dtrain) { # Extract the true label from the second argument labels = getinfo(dtrain, "label") preds = 1/(1 + exp(-preds)) grad = preds - labels hess = preds * (1 - preds) return(list(grad = grad, hess = hess)) } 77/128
  • 78. 8/30/15, 10:09 PMXGBoost Page 78 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization The complete code: logregobj = function(preds, dtrain) { # Extract the true label from the second argument labels = getinfo(dtrain, "label") # apply logistic transformation to the output preds = 1/(1 + exp(-preds)) grad = preds - labels hess = preds * (1 - preds) return(list(grad = grad, hess = hess)) } 78/128
  • 79. 8/30/15, 10:09 PMXGBoost Page 79 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization The complete code: logregobj = function(preds, dtrain) { # Extract the true label from the second argument labels = getinfo(dtrain, "label") # apply logistic transformation to the output preds = 1/(1 + exp(-preds)) # Calculate the 1st gradient grad = preds - labels hess = preds * (1 - preds) return(list(grad = grad, hess = hess)) } 79/128
  • 80. 8/30/15, 10:09 PMXGBoost Page 80 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization The complete code: logregobj = function(preds, dtrain) { # Extract the true label from the second argument labels = getinfo(dtrain, "label") # apply logistic transformation to the output preds = 1/(1 + exp(-preds)) # Calculate the 1st gradient grad = preds - labels # Calculate the 2nd gradient hess = preds * (1 - preds) return(list(grad = grad, hess = hess)) } 80/128
  • 81. 8/30/15, 10:09 PMXGBoost Page 81 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization The complete code: logregobj = function(preds, dtrain) { # Extract the true label from the second argument labels = getinfo(dtrain, "label") # apply logistic transformation to the output preds = 1/(1 + exp(-preds)) # Calculate the 1st gradient grad = preds - labels # Calculate the 2nd gradient hess = preds * (1 - preds) # Return the result return(list(grad = grad, hess = hess)) } 81/128
  • 82. 8/30/15, 10:09 PMXGBoost Page 82 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization We can also customize the evaluation metric. evalerror = function(preds, dtrain) { labels = getinfo(dtrain, "label") err = as.numeric(sum(labels != (preds > 0)))/length(labels) return(list(metric = "error", value = err)) } 82/128
  • 83. 8/30/15, 10:09 PMXGBoost Page 83 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization We can also customize the evaluation metric. evalerror = function(preds, dtrain) { # Extract the true label from the second argument labels = getinfo(dtrain, "label") err = as.numeric(sum(labels != (preds > 0)))/length(labels) return(list(metric = "error", value = err)) } 83/128
  • 84. 8/30/15, 10:09 PMXGBoost Page 84 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization We can also customize the evaluation metric. evalerror = function(preds, dtrain) { # Extract the true label from the second argument labels = getinfo(dtrain, "label") # Calculate the error err = as.numeric(sum(labels != (preds > 0)))/length(labels) return(list(metric = "error", value = err)) } 84/128
  • 85. 8/30/15, 10:09 PMXGBoost Page 85 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization We can also customize the evaluation metric. evalerror = function(preds, dtrain) { # Extract the true label from the second argument labels = getinfo(dtrain, "label") # Calculate the error err = as.numeric(sum(labels != (preds > 0)))/length(labels) # Return the name of this metric and the value return(list(metric = "error", value = err)) } 85/128
  • 86. 8/30/15, 10:09 PMXGBoost Page 86 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Customization To utilize the customized objective and evaluation, we simply pass them to the arguments: param = list(max.depth=2,eta=1,nthread = 2, silent=1, objective=logregobj, eval_metric=evalerror) bst = xgboost(params = param, data = train$data, label = train$label, nround = 2) ## [0] train-error:0.0465223399355136 ## [1] train-error:0.0222631659757408 86/128
  • 87. 8/30/15, 10:09 PMXGBoost Page 87 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Prediction in Cross Validation "Stacking" is an ensemble learning technique which takes the prediction from several models. It is widely used in many scenarios. One of the main concern is avoid overfitting. The common way is use the prediction value from cross validation. XGBoost provides a convenient argument to calculate the prediction during the cross validation. 87/128
  • 88. 8/30/15, 10:09 PMXGBoost Page 88 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Prediction in Cross Validation res = xgb.cv(params = param, data = train$data, label = train$label, nround = 2, nfold=5, prediction = TRUE) ## [0] train-error:0.046522+0.001347 test-error:0.046522+0.005387 ## [1] train-error:0.022263+0.000637 test-error:0.022263+0.002545 str(res) ## List of 2 ## $ dt :Classes 'data.table' and 'data.frame': 2 obs. of 4 variables: ## ..$ train.error.mean: num [1:2] 0.0465 0.0223 ## ..$ train.error.std : num [1:2] 0.001347 0.000637 ## ..$ test.error.mean : num [1:2] 0.0465 0.0223 ## ..$ test.error.std : num [1:2] 0.00539 0.00254 ## ..- attr(*, ".internal.selfref")=<externalptr> ## $ pred: num [1:6513] 2.58 -1.07 -1.03 2.59 -3.03 ... 88/128
  • 89. 8/30/15, 10:09 PMXGBoost Page 89 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 xgb.DMatrix XGBoost has its own class of input data xgb.DMatrix. One can convert the usual data set into it by It is the data structure used by XGBoost algorithm. XGBoost preprocess the input data and label into an xgb.DMatrix object before feed it to the training algorithm. If one need to repeat training process on the same big data set, it is good to use the xgb.DMatrix object to save preprocessing time. dtrain = xgb.DMatrix(data = train$data, label = train$label) 89/128
  • 90. 8/30/15, 10:09 PMXGBoost Page 90 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 xgb.DMatrix An xgb.DMatrix object contains 1. Preprocessed training data 2. Several features Missing values data weight · · 90/128
  • 91. 8/30/15, 10:09 PMXGBoost Page 91 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Continue Training Train the model for 5000 rounds is sometimes useful, but we are also taking the risk of overfitting. A better strategy is to train the model with fewer rounds and repeat that for many times. This enable us to observe the outcome after each step. 91/128
  • 92. 8/30/15, 10:09 PMXGBoost Page 92 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Continue Training bst = xgboost(params = param, data = dtrain, nround = 1) ## [0] train-error:0.0465223399355136 ptrain = predict(bst, dtrain, outputmargin = TRUE) setinfo(dtrain, "base_margin", ptrain) ## [1] TRUE bst = xgboost(params = param, data = dtrain, nround = 1) ## [0] train-error:0.0222631659757408 92/128
  • 93. 8/30/15, 10:09 PMXGBoost Page 93 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Continue Training # Train with only one round bst = xgboost(params = param, data = dtrain, nround = 1) ## [0] train-error:0.0222631659757408 ptrain = predict(bst, dtrain, outputmargin = TRUE) setinfo(dtrain, "base_margin", ptrain) ## [1] TRUE bst = xgboost(params = param, data = dtrain, nround = 1) ## [0] train-error:0.00706279748195916 93/128
  • 94. 8/30/15, 10:09 PMXGBoost Page 94 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Continue Training # Train with only one round bst = xgboost(params = param, data = dtrain, nround = 1) ## [0] train-error:0.00706279748195916 # margin means the baseline of the prediction ptrain = predict(bst, dtrain, outputmargin = TRUE) setinfo(dtrain, "base_margin", ptrain) ## [1] TRUE bst = xgboost(params = param, data = dtrain, nround = 1) ## [0] train-error:0.0152003684937817 94/128
  • 95. 8/30/15, 10:09 PMXGBoost Page 95 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Continue Training # Train with only one round bst = xgboost(params = param, data = dtrain, nround = 1) ## [0] train-error:0.0152003684937817 # margin means the baseline of the prediction ptrain = predict(bst, dtrain, outputmargin = TRUE) # Set the margin information to the xgb.DMatrix object setinfo(dtrain, "base_margin", ptrain) ## [1] TRUE bst = xgboost(params = param, data = dtrain, nround = 1) ## [0] train-error:0.00706279748195916 95/128
  • 96. 8/30/15, 10:09 PMXGBoost Page 96 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Continue Training # Train with only one round bst = xgboost(params = param, data = dtrain, nround = 1) ## [0] train-error:0.00706279748195916 # margin means the baseline of the prediction ptrain = predict(bst, dtrain, outputmargin = TRUE) # Set the margin information to the xgb.DMatrix object setinfo(dtrain, "base_margin", ptrain) ## [1] TRUE # Train based on the previous result bst = xgboost(params = param, data = dtrain, nround = 1) ## [0] train-error:0.00122831260555811 96/128
  • 97. 8/30/15, 10:09 PMXGBoost Page 97 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Importance and Tree plotting The result of XGBoost contains many trees. We can count the number of appearance of each variable in all the trees, and use this number as the importance score. bst = xgboost(data = train$data, label = train$label, max.depth = 2, verbose = FALSE, eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") xgb.importance(train$dataDimnames[[2]], model = bst) ## Feature Gain Cover Frequence ## 1: 28 0.67615484 0.4978746 0.4 ## 2: 55 0.17135352 0.1920543 0.2 ## 3: 59 0.12317241 0.1638750 0.2 ## 4: 108 0.02931922 0.1461960 0.2 97/128
  • 98. 8/30/15, 10:09 PMXGBoost Page 98 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Importance and Tree plotting We can also plot the trees in the model by xgb.plot.tree. xgb.plot.tree(agaricus.train$data@Dimnames[[2]], model = bst) 98/128
  • 99. 8/30/15, 10:09 PMXGBoost Page 99 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Tree plotting 99/128
  • 100. 8/30/15, 10:09 PMXGBoost Page 100 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Early Stopping When doing cross validation, it is usual to encounter overfitting at a early stage of iteration. Sometimes the prediction gets worse consistantly from round 300 while the total number of iteration is 1000. To stop the cross validation process, one can use the early.stop.round argument in xgb.cv. bst = xgb.cv(params = param, data = train$data, label = train$label, nround = 20, nfold = 5, maximize = FALSE, early.stop.round = 3) 100/128
  • 101. 8/30/15, 10:09 PMXGBoost Page 101 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Early Stopping ## [0] train-error:0.050783+0.011279 test-error:0.055272+0.013879 ## [1] train-error:0.021227+0.001863 test-error:0.021496+0.004443 ## [2] train-error:0.008483+0.003373 test-error:0.007677+0.002820 ## [3] train-error:0.014202+0.002592 test-error:0.014126+0.003504 ## [4] train-error:0.004721+0.002682 test-error:0.005527+0.004490 ## [5] train-error:0.001382+0.000533 test-error:0.001382+0.000642 ## [6] train-error:0.001228+0.000219 test-error:0.001228+0.000875 ## [7] train-error:0.001228+0.000219 test-error:0.001228+0.000875 ## [8] train-error:0.001228+0.000219 test-error:0.001228+0.000875 ## [9] train-error:0.000691+0.000645 test-error:0.000921+0.001001 ## [10] train-error:0.000422+0.000582 test-error:0.000768+0.001086 ## [11] train-error:0.000192+0.000429 test-error:0.000460+0.001030 ## [12] train-error:0.000192+0.000429 test-error:0.000460+0.001030 ## [13] train-error:0.000000+0.000000 test-error:0.000000+0.000000 ## [14] train-error:0.000000+0.000000 test-error:0.000000+0.000000 ## [15] train-error:0.000000+0.000000 test-error:0.000000+0.000000 ## [16] train-error:0.000000+0.000000 test-error:0.000000+0.000000 ## Stopping. Best iteration: 14 101/128
  • 102. 8/30/15, 10:09 PMXGBoost Page 102 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Kaggle Winning Solution 102/128
  • 103. 8/30/15, 10:09 PMXGBoost Page 103 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Kaggle Winning Solution To get a higher rank, one need to push the limit of 1. Feature Engineering 2. Parameter Tuning 3. Model Ensemble The winning solution in the recent Otto Competition is an excellent example. 103/128
  • 104. 8/30/15, 10:09 PMXGBoost Page 104 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Kaggle Winning Solution They used a 3-layer ensemble learning model, including 33 models on top of the original data XGBoost, neural network and adaboost on 33 predictions from the models and 8 engineered features Weighted average of the 3 prediction from the second step · · · 104/128
  • 105. 8/30/15, 10:09 PMXGBoost Page 105 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Kaggle Winning Solution The data for this competition is special: the meanings of the featuers are hidden. For feature engineering, they generated 8 new features: Distances to nearest neighbours of each classes Sum of distances of 2 nearest neighbours of each classes Sum of distances of 4 nearest neighbours of each classes Distances to nearest neighbours of each classes in TFIDF space Distances to nearest neighbours of each classed in T-SNE space (3 dimensions) Clustering features of original dataset Number of non-zeros elements in each row X (That feature was used only in NN 2nd level training) · · · · · · · · 105/128
  • 106. 8/30/15, 10:09 PMXGBoost Page 106 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Kaggle Winning Solution This means a lot of work. However this also implies they need to try a lot of other models, although some of them turned out to be not helpful in this competition. Their attempts include: A lot of training algorithms in first level as Some preprocessing like PCA, ICA and FFT Feature Selection Semi-supervised learning · Vowpal Wabbit(many configurations) R glm, glmnet, scikit SVC, SVR, Ridge, SGD, etc... - - · · · 106/128
  • 107. 8/30/15, 10:09 PMXGBoost Page 107 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Let's learn to use a single XGBoost model to achieve a high rank in an old competition! The competition we choose is the Influencers in Social Networks competition. It was a hackathon in 2013, therefore the size of data is small enough so that we can train the model in seconds. 107/128
  • 108. 8/30/15, 10:09 PMXGBoost Page 108 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks First let's download the data, and load them into R train = read.csv('train.csv',header = TRUE) test = read.csv('test.csv',header = TRUE) y = train[,1] train = as.matrix(train[,-1]) test = as.matrix(test) 108/128
  • 109. 8/30/15, 10:09 PMXGBoost Page 109 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Observe the data: colnames(train) ## [1] "A_follower_count" "A_following_count" "A_listed_count" ## [4] "A_mentions_received" "A_retweets_received" "A_mentions_sent" ## [7] "A_retweets_sent" "A_posts" "A_network_feature_1" ## [10] "A_network_feature_2" "A_network_feature_3" "B_follower_count" ## [13] "B_following_count" "B_listed_count" "B_mentions_received" ## [16] "B_retweets_received" "B_mentions_sent" "B_retweets_sent" ## [19] "B_posts" "B_network_feature_1" "B_network_feature_2" ## [22] "B_network_feature_3" 109/128
  • 110. 8/30/15, 10:09 PMXGBoost Page 110 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Observe the data: train[1,] ## A_follower_count A_following_count A_listed_count ## 2.280000e+02 3.020000e+02 3.000000e+00 ## A_mentions_received A_retweets_received A_mentions_sent ## 5.839794e-01 1.005034e-01 1.005034e-01 ## A_retweets_sent A_posts A_network_feature_1 ## 1.005034e-01 3.621501e-01 2.000000e+00 ## A_network_feature_2 A_network_feature_3 B_follower_count ## 1.665000e+02 1.135500e+04 3.446300e+04 ## B_following_count B_listed_count B_mentions_received ## 2.980800e+04 1.689000e+03 1.543050e+01 ## B_retweets_received B_mentions_sent B_retweets_sent ## 3.984029e+00 8.204331e+00 3.324230e-01 ## B_posts B_network_feature_1 B_network_feature_2 ## 6.988815e+00 6.600000e+01 7.553030e+01 ## B_network_feature_3 110/128
  • 111. 8/30/15, 10:09 PMXGBoost Page 111 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks The data contains information from two users in a social network service. Our mission is to determine who is more influencial than the other one. This type of data gives us some room for feature engineering. 111/128
  • 112. 8/30/15, 10:09 PMXGBoost Page 112 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks The first trick is to increase the information in the data. Every data point can be expressed as <y, A, B>. Actually it indicates <1-y, B, A> as well. We can simply use extract this part of information from the training set. new.train = cbind(train[,12:22],train[,1:11]) train = rbind(train,new.train) y = c(y,1-y) 112/128
  • 113. 8/30/15, 10:09 PMXGBoost Page 113 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks The following feature engineering steps are done on both training and test set. Therefore we combine them together. x = rbind(train,test) 113/128
  • 114. 8/30/15, 10:09 PMXGBoost Page 114 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks The next step could be calculating the ratio between features of A and B seperately: followers/following mentions received/sent retweets received/sent followers/posts retweets received/posts mentions received/posts · · · · · · 114/128
  • 115. 8/30/15, 10:09 PMXGBoost Page 115 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Considering there might be zeroes, we need to smooth the ratio by a constant. Next we can calculate the ratio with this helper function. calcRatio = function(dat,i,j,lambda = 1) (dat[,i]+lambda)/(dat[,j]+lambda) 115/128
  • 116. 8/30/15, 10:09 PMXGBoost Page 116 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks A.follow.ratio = calcRatio(x,1,2) A.mention.ratio = calcRatio(x,4,6) A.retweet.ratio = calcRatio(x,5,7) A.follow.post = calcRatio(x,1,8) A.mention.post = calcRatio(x,4,8) A.retweet.post = calcRatio(x,5,8) B.follow.ratio = calcRatio(x,12,13) B.mention.ratio = calcRatio(x,15,17) B.retweet.ratio = calcRatio(x,16,18) B.follow.post = calcRatio(x,12,19) B.mention.post = calcRatio(x,15,19) B.retweet.post = calcRatio(x,16,19) 116/128
  • 117. 8/30/15, 10:09 PMXGBoost Page 117 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Combine the features into the data set. x = cbind(x[,1:11], A.follow.ratio,A.mention.ratio,A.retweet.ratio, A.follow.post,A.mention.post,A.retweet.post, x[,12:22], B.follow.ratio,B.mention.ratio,B.retweet.ratio, B.follow.post,B.mention.post,B.retweet.post) 117/128
  • 118. 8/30/15, 10:09 PMXGBoost Page 118 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Then we can compare the difference between A and B. Because XGBoost is scale invariant, therefore minus and division are the essentially same. AB.diff = x[,1:17]-x[,18:34] x = cbind(x,AB.diff) train = x[1:nrow(train),] test = x[-(1:nrow(train)),] 118/128
  • 119. 8/30/15, 10:09 PMXGBoost Page 119 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Now comes to the modeling part. We investigate how far can we can go with a single model. The parameter tuning step is very important in this step. We can see the performance from cross validation. 119/128
  • 120. 8/30/15, 10:09 PMXGBoost Page 120 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Here's the xgb.cv with default parameters. set.seed(1024) cv.res = xgb.cv(data = train, nfold = 3, label = y, nrounds = 100, verbose = FALSE, objective='binary:logistic', eval_metric = 'auc') 120/128
  • 121. 8/30/15, 10:09 PMXGBoost Page 121 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks We can see the trend of AUC on training and test sets. 121/128
  • 122. 8/30/15, 10:09 PMXGBoost Page 122 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks It is obvious our model severly overfits. The direct reason is simple: the default value of eta is 0.3, which is too large for this mission. Recall the parameter tuning guide, we need to decrease eta and inccrease nrounds based on the result of cross validation. 122/128
  • 123. 8/30/15, 10:09 PMXGBoost Page 123 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks After some trials, we get the following set of parameters: set.seed(1024) cv.res = xgb.cv(data = train, nfold = 3, label = y, nrounds = 3000, objective='binary:logistic', eval_metric = 'auc', eta = 0.005, gamma = 1,lambda = 3, nthread = 8, max_depth = 4, min_child_weight = 1, verbose = F, subsample = 0.8,colsample_bytree = 0.8) 123/128
  • 124. 8/30/15, 10:09 PMXGBoost Page 124 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks We can see the trend of AUC on training and test sets. 124/128
  • 125. 8/30/15, 10:09 PMXGBoost Page 125 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Next we extract the best number of iterations. We calculate the AUC minus the standard deviation, and choose the iteration with the largest value. bestRound = which.max(as.matrix(cv.res)[,3]-as.matrix(cv.res)[,4]) bestRound ## [1] 2442 cv.res[bestRound,] ## train.auc.mean train.auc.std test.auc.mean test.auc.std ## 1: 0.934967 0.00125 0.876629 0.002073 125/128
  • 126. 8/30/15, 10:09 PMXGBoost Page 126 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Then we train the model with the same set of parameters: set.seed(1024) bst = xgboost(data = train, label = y, nrounds = 3000, objective='binary:logistic', eval_metric = 'auc', eta = 0.005, gamma = 1,lambda = 3, nthread = 8, max_depth = 4, min_child_weight = 1, subsample = 0.8,colsample_bytree = 0.8) preds = predict(bst,test,ntreelimit = bestRound) 126/128
  • 127. 8/30/15, 10:09 PMXGBoost Page 127 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 Influencers in Social Networks Finally we submit our solution This wins us top 10 on the leaderboard! result = data.frame(Id = 1:nrow(test), Choice = preds) write.csv(result,'submission.csv',quote=FALSE,row.names=FALSE) 127/128
  • 128. 8/30/15, 10:09 PMXGBoost Page 128 of 128file:///Users/vivi/Desktop/xgboost/index.html#1 FAQ 128/128