SlideShare une entreprise Scribd logo
1  sur  79
Télécharger pour lire hors ligne
GRADIENT	BOOSTING	IN	PRACTICE
A	DEEP	DIVE	INTO	XGBOOST
by Jaroslaw Machine Learning Scientist
Szymczak @ OLX Tech Hub Berlin
LET'S	START	WITH	SOME	THEORY...
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
CLASSIFICATION	PROBLEM	EXAMPLE
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
EXAMPLE	DECISION	TREE
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
BIAS	AND	VARIANCE	SOURCE	OF	ERROR
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
BAGGING
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
ADAPTIVE	BOOSTING
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
GRADIENT	BOOSTING
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
GRADIENT	DESCENT
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
XGBOOST
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
XGBOOST	CUSTOM	TREE	BUILDING	ALGORITHM
Most of machine learning practitioners know at least these two measures used for tree
building:
◍ entropy (information gain)
◍ gini coefficient
XGBoost has a custom objective function used for building the tree. It requires:
◍ gradient
◍ hessian
of the objective function. E.g. for linear regression and RMSE, for single observation:
◍ gradient = residual
◍ hessian = 1
We will go further into regularization parameters when we discuss tree tuning.
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
XGBOOST
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
TIME	FOR	SOME	CODING	PRACTICE
XGBOOST	CLASSIFIER	WITH	DEFAULT
PARAMETERS
In [2]: import xgboost as xgb
clf = xgb.XGBClassifier()
clf.__dict__
Out[2]: {'_Booster': None,
'base_score': 0.5,
'colsample_bylevel': 1,
'colsample_bytree': 1,
'gamma': 0,
'learning_rate': 0.1,
'max_delta_step': 0,
'max_depth': 3,
'min_child_weight': 1,
'missing': nan,
'n_estimators': 100,
'nthread': -1,
'objective': 'binary:logistic',
'reg_alpha': 0,
'reg_lambda': 1,
'scale_pos_weight': 1,
'seed': 0,
'silent': True,
'subsample': 1}
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
DATASET	PREPARATION
train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
X, y = train.data, train.target
X_test_docs, y_test = test.data, test.target
X_train_docs, X_val_docs, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
tfidf = TfidfVectorizer(min_df=0.005, max_df=0.5).fit(X_train)
X_train_tfidf = tfidf.transform(X_train_docs).tocsc()
X_val_tfidf = tfidf.transform(X_val_docs).tocsc()
X_test_tfidf = tfidf.transform(X_test_docs).tocsc()
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [70]: print("X_train shape: {}".format(X_train_tfidf.get_shape()))
print("X_val shape: {}".format(X_val_tfidf.get_shape()))
print("X_test shape: {}".format(X_test_tfidf.get_shape()))
print("X_train density: {:.3f}".format(
(X_train_tfidf.nnz / np.prod(X_train_tfidf.shape))))
X_train shape: (9051, 2665)
X_val shape: (2263, 2665)
X_test shape: (7532, 2665)
X_train density: 0.023
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
SIMPLE	CLASSIFICATION
In [5]: clf = xgb.XGBClassifier(seed=42, nthread=1)
clf = clf.fit(X_train_tfidf, y_train,
eval_set=[(X_train_tfidf, y_train), (X_val_tfidf, y_val)],
verbose=11)
y_pred = clf.predict(X_test_tfidf)
y_pred_proba = clf.predict_proba(X_test_tfidf)
print("Test error: {:.3f}".format(1 - accuracy_score(y_test, y_pred)))
[0] validation_0-merror:0.588443 validation_1-merror:0.600088
[11] validation_0-merror:0.429345 validation_1-merror:0.476359
[22] validation_0-merror:0.390012 validation_1-merror:0.454264
[33] validation_0-merror:0.356093 validation_1-merror:0.443659
[44] validation_0-merror:0.330461 validation_1-merror:0.433053
[55] validation_0-merror:0.308806 validation_1-merror:0.427309
[66] validation_0-merror:0.28936 validation_1-merror:0.423774
[77] validation_0-merror:0.270909 validation_1-merror:0.41361
[88] validation_0-merror:0.256988 validation_1-merror:0.410517
[99] validation_0-merror:0.242404 validation_1-merror:0.410517
Test error: 0.462
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
EXPLORING	SOME	XGBOOST	PROPERTIES
As mentioned in meetup description, we will see that:
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
EXPLORING	SOME	XGBOOST	PROPERTIES
As mentioned in meetup description, we will see that:
◍ monotonic transformation have little to no effect on final results
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
EXPLORING	SOME	XGBOOST	PROPERTIES
As mentioned in meetup description, we will see that:
◍ monotonic transformation have little to no effect on final results
◍ features can be correlated
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
EXPLORING	SOME	XGBOOST	PROPERTIES
As mentioned in meetup description, we will see that:
◍ monotonic transformation have little to no effect on final results
◍ features can be correlated
◍ classifier is robust to noise
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
REPEATING	THE	PROCESS	WITHOUT	IDF
FACTOR
In [6]: tf = TfidfVectorizer(min_df=0.005, max_df=0.5, use_idf=False).fit(X_train_docs)
X_train = tf.transform(X_train_docs).tocsc()
X_val = tf.transform(X_val_docs).tocsc()
X_test = tf.transform(X_test_docs).tocsc()
clf = xgb.XGBClassifier(seed=42, nthread=1).fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbo
se=33)
y_pred = clf.predict(X_test)
print("Test error: {:.3f}".format(1 - accuracy_score(y_test, y_pred)))
Previous run:
[0] validation_0-merror:0.588443 validation_1-merror:0.600088
[33] validation_0-merror:0.356093 validation_1-merror:0.443659
[66] validation_0-merror:0.28936 validation_1-merror:0.423774
[99] validation_0-merror:0.242404 validation_1-merror:0.410517
[0] validation_0-merror:0.588443 validation_1-merror:0.604507
[33] validation_0-merror:0.356093 validation_1-merror:0.443659
[66] validation_0-merror:0.288034 validation_1-merror:0.419355
[99] validation_0-merror:0.241962 validation_1-merror:0.411843
Test error: 0.461
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
ADDING	SOME	CORRELATED	COUNTVECTORIZER
FEATURES
In [7]: cv = CountVectorizer(min_df=0.005, max_df=0.5).fit(X_train_docs)
X_train_cv = cv.transform(X_train_docs).tocsc()
X_val_cv = cv.transform(X_val_docs).tocsc()
X_test_cv = cv.transform(X_test_docs).tocsc()
X_train_corr = scipy.sparse.hstack([X_train, X_train_cv])
X_val_corr = scipy.sparse.hstack([X_val, X_val_cv])
X_test_corr = scipy.sparse.hstack([X_test, X_test_cv])
clf = xgb.XGBClassifier(seed=42, nthread=1)
clf = clf.fit(X_train_corr, y_train, eval_set=[(X_train_corr, y_train), (X_val_corr, y_val)], verbose=99)
y_pred = clf.predict(X_test_corr)
print("Test error: {:.3f}".format(1 - accuracy_score(y_test, y_pred)))
[0] validation_0-merror:0.588664 validation_1-merror:0.60274
[99] validation_0-merror:0.240636 validation_1-merror:0.408308
Test error: 0.464
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
ADDING	SOME	RANDOMNESS
In [8]: def extend_with_random(X, density=0.023):
X_extend = scipy.sparse.random(X.shape[0], 2*X.shape[1], density=density, format='csc')
return scipy.sparse.hstack([X, X_extend])
X_train_noise = extend_with_random(X_train)
X_val_noise = extend_with_random(X_val)
X_test_noise = extend_with_random(X_test)
clf = xgb.XGBClassifier(seed=42, nthread=1)
clf = clf.fit(X_train_noise, y_train, eval_set=[(X_train_noise, y_train), (X_val_noise, y_val)], verbose=99)
y_pred = clf.predict(X_test_noise)
print("Test error: {:.3f}".format(1 - accuracy_score(y_test, y_pred)))
[0] validation_0-merror:0.588443 validation_1-merror:0.604507
[99] validation_0-merror:0.221522 validation_1-merror:0.417587
Test error: 0.470
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
SOME	XGBOOST	FEATURES
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
SOME	XGBOOST	FEATURES
◍ watchlists (that we have seen so far)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
SOME	XGBOOST	FEATURES
◍ watchlists (that we have seen so far)
◍ early stopping
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
SOME	XGBOOST	FEATURES
◍ watchlists (that we have seen so far)
◍ early stopping
◍ full learning history that we can plot
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
EARLY	STOPPING	WITH	WATCHLIST
In [9]: clf = xgb.XGBClassifier(n_estimators=50000)
clf = clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_val, y_val)],
verbose=100,
early_stopping_rounds=10)
print("Best iteration: {}".format(clf.booster().best_iteration))
y_pred = clf.predict(X_test, ntree_limit=clf.booster().best_ntree_limit)
print("Test error: {:.3f}".format(1 - accuracy_score(y_test, y_pred)))
[0] validation_0-merror:0.588443 validation_1-merror:0.604507
Multiple eval metrics have been passed: 'validation_1-merror' will be used for early stopping.
Will train until validation_1-merror hasn't improved in 10 rounds.
[100] validation_0-merror:0.241189 validation_1-merror:0.412285
Stopping. Best iteration:
[123] validation_0-merror:0.215114 validation_1-merror:0.401679
Best iteration: 123
Test error: 0.457
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
PLOTTING	THE	LEARNING	CURVES
In [10]: evals_result = clf.evals_result()
train_errors = evals_result['validation_0']['merror']
validation_errors = evals_result['validation_1']['merror']
df = pd.DataFrame([train_errors, validation_errors]).T
df.columns = ['train', 'val']
df.index.name = 'round'
df.plot(title="XGBoost learning curves", ylim=(0,0.7), figsize=(12,5))
Out[10]: <matplotlib.axes._subplots.AxesSubplot at 0x11a21c9b0>
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
NOW	WE'LL	DO	SOME	TUNING
TUNING	SETTING
◍ we will do the comparison of errors on train, validation and test sets
◍ normally I would recommend cross-validation for such task
◍ together with a hold-out set that you check rarely
◍ as you can overfit even when you use cross-validation
◍ slight changes in error are inconclusive about parameters, just try different
random seeds and you'll see for yourself
◍ some parameters affect the training time, so it is enough that they don't make the
final model worse, they don't have to improve it
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
GENERAL	PARAMETERS
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
GENERAL	PARAMETERS
◍ max_depth (how deep is your tree?)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
GENERAL	PARAMETERS
◍ max_depth (how deep is your tree?)
◍ learning_rate (shrinkage parameter, "speed of learning")
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
GENERAL	PARAMETERS
◍ max_depth (how deep is your tree?)
◍ learning_rate (shrinkage parameter, "speed of learning")
◍ max_delta_step (additional cap on learning rate, needed in case of highly
imbalanced classes, in which case it is suggested to try values from 1 to 10)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
GENERAL	PARAMETERS
◍ max_depth (how deep is your tree?)
◍ learning_rate (shrinkage parameter, "speed of learning")
◍ max_delta_step (additional cap on learning rate, needed in case of highly
imbalanced classes, in which case it is suggested to try values from 1 to 10)
◍ n_estimators (number of boosting rounds)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
GENERAL	PARAMETERS
◍ max_depth (how deep is your tree?)
◍ learning_rate (shrinkage parameter, "speed of learning")
◍ max_delta_step (additional cap on learning rate, needed in case of highly
imbalanced classes, in which case it is suggested to try values from 1 to 10)
◍ n_estimators (number of boosting rounds)
◍ booster (as XGBoost is not only about trees, but it actually is)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
GENERAL	PARAMETERS
◍ max_depth (how deep is your tree?)
◍ learning_rate (shrinkage parameter, "speed of learning")
◍ max_delta_step (additional cap on learning rate, needed in case of highly
imbalanced classes, in which case it is suggested to try values from 1 to 10)
◍ n_estimators (number of boosting rounds)
◍ booster (as XGBoost is not only about trees, but it actually is)
◍ scale_pos_weight (in binary classification whether to rebalance the weights of
positive vs. negative samples)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
◍ base_score (boosting starts with predicting 0.5 for all observations, you can
change it here)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
◍ base_score (boosting starts with predicting 0.5 for all observations, you can
change it here)
◍ seed / random_state (so you can reproduce your results, will not work exactly if
you use multiple threads or on some processor / system architectures due to
randomness introduced by floating point numbers rounding)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
◍ base_score (boosting starts with predicting 0.5 for all observations, you can
change it here)
◍ seed / random_state (so you can reproduce your results, will not work exactly if
you use multiple threads or on some processor / system architectures due to
randomness introduced by floating point numbers rounding)
◍ missing (if you want to treat some other things as missing than np.nan)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
◍ base_score (boosting starts with predicting 0.5 for all observations, you can
change it here)
◍ seed / random_state (so you can reproduce your results, will not work exactly if
you use multiple threads or on some processor / system architectures due to
randomness introduced by floating point numbers rounding)
◍ missing (if you want to treat some other things as missing than np.nan)
◍ objective (if you want to play with maths, just give the callable that maps
objective(y_true, y_pred) -> grad, hes)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
◍ base_score (boosting starts with predicting 0.5 for all observations, you can
change it here)
◍ seed / random_state (so you can reproduce your results, will not work exactly if
you use multiple threads or on some processor / system architectures due to
randomness introduced by floating point numbers rounding)
◍ missing (if you want to treat some other things as missing than np.nan)
◍ objective (if you want to play with maths, just give the callable that maps
objective(y_true, y_pred) -> grad, hes)
Normally you will not really tune this parameters, apart from max_depth (and potentially
learning_rate, but with so many others to tune I suggest to settle for sensible value). You
can modify them, e.g. learning rate, if you think it makes sense, but you would rather not
use hyperopt / sklearn grid search to find the optimal values here.
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
EFFECT	OF	MAX_DEPTH	ERROR
CHARACTERISTICS
In [31]: results = []
for max_depth in [3, 6, 9, 12, 15, 30]:
clf = xgb.XGBClassifier(max_depth=max_depth, n_estimators=20)
clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=False)
results.append(
{
'max_depth': max_depth,
'train_error': 1 - accuracy_score(y_train, clf.predict(X_train)),
'validation_error': 1 - accuracy_score(y_val, clf.predict(X_val)),
'test_error': 1 - accuracy_score(y_test, clf.predict(X_test))
}
)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [32]: df_max_depth = pd.DataFrame(results).set_index('max_depth').sort_index()
df_max_depth
Out[32]:
test_error train_error validation_error
max_depth
3 0.504647 0.398630 0.462218
6 0.489379 0.292785 0.444101
9 0.492167 0.219755 0.443217
12 0.495884 0.169263 0.441008
15 0.492698 0.134350 0.437914
30 0.492432 0.064192 0.439682
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [33]: df_max_depth.plot(ylim=(0,0.7), figsize=(12,5))
Out[33]: <matplotlib.axes._subplots.AxesSubplot at 0x11de96d68>
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
AND	HOW	DOES	LEARNING	SPEED	AFFECTS	THE
WHOLE	PROCESS?
In [34]: results = []
for learning_rate in [0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0]:
clf = xgb.XGBClassifier(learning_rate=learning_rate)
clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=False)
results.append(
{
'learning_rate': learning_rate,
'train_error': 1 - accuracy_score(y_train, clf.predict(X_train)),
'validation_error': 1 - accuracy_score(y_val, clf.predict(X_val)),
'test_error': 1 - accuracy_score(y_test, clf.predict(X_test))
}
)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [35]: df_learning_rate = pd.DataFrame(results).set_index('learning_rate').sort_index()
df_learning_rate
Out[35]:
test_error train_error validation_error
learning_rate
0.05 0.479819 0.322285 0.434379
0.10 0.461365 0.241962 0.411843
0.20 0.456187 0.158436 0.395493
0.40 0.467472 0.077781 0.399470
0.60 0.484466 0.050713 0.412285
0.80 0.492698 0.039222 0.438356
1.00 0.501328 0.032925 0.443217
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [36]: df_learning_rate.plot(ylim=(0,0.55), figsize=(12,5))
Out[36]: <matplotlib.axes._subplots.AxesSubplot at 0x11ddebb38>
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
RANDOM	SUBSAMPLING	PARAMETERS
◍ subsample (subsample ratio of the training instance)
◍ colsample_bytree (subsample ratio of columns when constructing each tree
◍ colsample_bylevel (as above, but per level of the tree rather than whole tree)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
SUBSAMPLE
In [37]: subsample_search_grid = np.arange(0.2, 1.01, 0.2)
results = []
for subsample in subsample_search_grid:
clf = xgb.XGBClassifier(subsample=subsample, learning_rate=1.0)
clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=False)
results.append(
{
'subsample': subsample,
'train_error': 1 - accuracy_score(y_train, clf.predict(X_train)),
'validation_error': 1 - accuracy_score(y_val, clf.predict(X_val)),
'test_error': 1 - accuracy_score(y_test, clf.predict(X_test))
}
)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [38]: df_subsample = pd.DataFrame(results).set_index('subsample').sort_index()
df_subsample
Out[38]:
test_error train_error validation_error
subsample
0.2 0.649628 0.273009 0.587715
0.4 0.550186 0.050271 0.488290
0.6 0.520579 0.035024 0.463102
0.8 0.517791 0.033256 0.459125
1.0 0.501328 0.032925 0.443217
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [48]: df_subsample.plot(ylim=(0,0.7), figsize=(12,5))
Out[48]: <matplotlib.axes._subplots.AxesSubplot at 0x1198b1080>
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
COLSAMPLE_BYTREE
In [40]: colsample_bytree_search_grid = np.arange(0.2, 1.01, 0.2)
results = []
for colsample_bytree in colsample_bytree_search_grid:
clf = xgb.XGBClassifier(colsample_bytree=colsample_bytree, learning_rate=1.0)
clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=False)
results.append(
{
'colsample_bytree': colsample_bytree,
'train_error': 1 - accuracy_score(y_train, clf.predict(X_train)),
'validation_error': 1 - accuracy_score(y_val, clf.predict(X_val)),
'test_error': 1 - accuracy_score(y_test, clf.predict(X_test))
}
)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [41]: df_colsample_bytree = pd.DataFrame(results).set_index('colsample_bytree').sort_index()
df_colsample_bytree
Out[41]:
test_error train_error validation_error
colsample_bytree
0.2 0.503717 0.036018 0.439240
0.4 0.508763 0.035687 0.440566
0.6 0.500000 0.034361 0.451613
0.8 0.499203 0.032041 0.453380
1.0 0.501328 0.032925 0.443217
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [42]: df_colsample_bytree.plot(ylim=(0,0.55), figsize=(12,5))
Out[42]: <matplotlib.axes._subplots.AxesSubplot at 0x11bdd4d68>
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
COLSAMPLE_BYLEVEL
In [43]: colsample_bylevel_search_grid = np.arange(0.2, 1.01, 0.2)
results = []
for colsample_bylevel in colsample_bylevel_search_grid:
clf = xgb.XGBClassifier(colsample_bylevel=colsample_bylevel, learning_rate=1.0)
clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=False)
results.append(
{
'colsample_bylevel': colsample_bylevel,
'train_error': 1 - accuracy_score(y_train, clf.predict(X_train)),
'validation_error': 1 - accuracy_score(y_val, clf.predict(X_val)),
'test_error': 1 - accuracy_score(y_test, clf.predict(X_test))
}
)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [44]: df_colsample_bylevel = pd.DataFrame(results).set_index('colsample_bylevel').sort_index()
df_colsample_bylevel
Out[44]:
test_error train_error validation_error
colsample_bylevel
0.2 0.503585 0.035134 0.436589
0.4 0.499336 0.033808 0.444985
0.6 0.497876 0.034582 0.439240
0.8 0.505975 0.032593 0.459567
1.0 0.501328 0.032925 0.443217
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
In [45]: df_colsample_bylevel.plot(ylim=(0,0.55), figsize=(12,5))
Out[45]: <matplotlib.axes._subplots.AxesSubplot at 0x11a21d160>
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
XGBOOST	OBJECTIVE	REGULARIZATION
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
REGULARIZATION	PARAMETERS
◍ reg_alpha (L1 regularization, L1 norm factor)
◍ reg_lambda (L2 regularization, L2 norm factor)
◍ gamma ("L0 regularization" - dependent on number of leaves)
◍ min_child_weight - minimum sum o hessians in a child, for RMSE regression it is
simply number of examples
◍ min_child_weight for binary classification would be a sum of second order
gradients caluclated by formula:
std::max(predt * (1.0f - predt), eps)
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
DIFFERENCE	BETWEEN	MIN_CHILD_WEIGHT	AND
GAMMA
◍ min_child_weight is like a local optimization, there is certain path of the tree that
leads to certain number of instances with certain weights there
◍ gamma is global, it takes into account number of all leaves in a tree
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
HOW	TO	TUNE	REGULARIZATION	PARAMETERS
◍ min_child_weight you can just set as sensible value
◍ alfa, lambda and gamma though depend very much on your other parameters
◍ different values will be optimal for tree depth 6, and totally different for tree depth
15, as number and weights of leaves will differ a lot
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
FEATURE	IMPORTANCE	ANALYSIS
In [69]: # tf was our TfIdfVectorizer
# clf is our trained xgboost
type(clf.feature_importances_)
df = pd.DataFrame([tf.get_feature_names(), list(clf.feature_importances_)]).T
df.columns = ['feature_name', 'feature_score']
df.sort_values('feature_score', ascending=False, inplace=True)
df.set_index('feature_name', inplace=True)
df.iloc[:10].plot(kind='barh', legend=False, figsize=(12,5))
Out[69]: <matplotlib.axes._subplots.AxesSubplot at 0x1255a50f0>
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
LESSONS	LEARNED	FROM	FEATURE	ANALYSIS
◍ as you see most of the features look like stopwords
◍ they can actually correspond to post length and some writing style of contributors
◍ as general practice says, and winners of many Kaggle competitions agree on:
- feature enegineering is extremely important
- aim to maintain your code in such manner, that at least initially you can
analyse the features
- there are some dedicated packages for analysis of feature importance
for "black box" models (eli5)
- and also xgboost has a possibility of more sophisticated feature analysis,
could be in repo version only though
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
SUMMARY
SOME	THINGS	NOT	MENTIONED	ELSEWHERE
◍ in regression problems, tree version of XGBoost cannot extrapolate
◍ current documentation is not compatibile with Python package (which is quite
outdated)
◍ there are some histogram based improvements, similar like in LightGBM, to train
the models faster (a lot of issues were reported about this feature)
◍ in 0.6a2 version from pip repo, there is a bug in handling csr matrices
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
LINKS
◍
◍
◍
https://github.com/datitran/jupyter2slides
https://github.com/dmlc/xgboost
https://arxiv.org/pdf/1603.02754.pdf
by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
QUESTIONS?

Contenu connexe

Tendances

Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboostShuai Zhang
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning TechniquesBabu Priyavrat
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science CompetitionsJeong-Yoon Lee
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMustafa Yagmur
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine LearningKuppusamy P
 
ADABoost classifier
ADABoost classifierADABoost classifier
ADABoost classifierSreerajVA
 
Boosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning ProblemsBoosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning ProblemsDr Sulaimon Afolabi
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stackingankit_ppt
 
Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat omarodibat
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic netKyusonLim
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 

Tendances (20)

Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning Techniques
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
 
ADABoost classifier
ADABoost classifierADABoost classifier
ADABoost classifier
 
Boosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning ProblemsBoosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning Problems
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
 
Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat
 
Xgboost
XgboostXgboost
Xgboost
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic net
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 

En vedette

Machine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's businessMachine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's businessJaroslaw Szymczak
 
Automatic image moderation in classifieds
Automatic image moderation in classifiedsAutomatic image moderation in classifieds
Automatic image moderation in classifiedsJaroslaw Szymczak
 
GBM package in r
GBM package in rGBM package in r
GBM package in rmark_landry
 
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2OSri Ambati
 
Automated data analysis with Python
Automated data analysis with PythonAutomated data analysis with Python
Automated data analysis with PythonGramener
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDeepak George
 
2017 holiday survey: An annual analysis of the peak shopping season
2017 holiday survey: An annual analysis of the peak shopping season2017 holiday survey: An annual analysis of the peak shopping season
2017 holiday survey: An annual analysis of the peak shopping seasonDeloitte United States
 

En vedette (10)

Machine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's businessMachine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's business
 
Automatic image moderation in classifieds
Automatic image moderation in classifiedsAutomatic image moderation in classifieds
Automatic image moderation in classifieds
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2O
 
Inlining Heuristics
Inlining HeuristicsInlining Heuristics
Inlining Heuristics
 
XGBoost (System Overview)
XGBoost (System Overview)XGBoost (System Overview)
XGBoost (System Overview)
 
Automated data analysis with Python
Automated data analysis with PythonAutomated data analysis with Python
Automated data analysis with Python
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
 
2017 holiday survey: An annual analysis of the peak shopping season
2017 holiday survey: An annual analysis of the peak shopping season2017 holiday survey: An annual analysis of the peak shopping season
2017 holiday survey: An annual analysis of the peak shopping season
 

Similaire à Gradient boosting in practice: a deep dive into xgboost

NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...NUS-ISS
 
Getting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NETGetting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NETTomas Jansson
 
Python과 node.js기반 데이터 분석 및 가시화
Python과 node.js기반 데이터 분석 및 가시화Python과 node.js기반 데이터 분석 및 가시화
Python과 node.js기반 데이터 분석 및 가시화Tae wook kang
 
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...Dataconomy Media
 
Pythonbrasil - 2018 - Acelerando Soluções com GPU
Pythonbrasil - 2018 - Acelerando Soluções com GPUPythonbrasil - 2018 - Acelerando Soluções com GPU
Pythonbrasil - 2018 - Acelerando Soluções com GPUPaulo Sergio Lemes Queiroz
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleYvonne K. Matos
 
Session 1.5 supporting virtual integration of linked data with just-in-time...
Session 1.5   supporting virtual integration of linked data with just-in-time...Session 1.5   supporting virtual integration of linked data with just-in-time...
Session 1.5 supporting virtual integration of linked data with just-in-time...semanticsconference
 
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser Bootsma
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser BootsmaDSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser Bootsma
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser BootsmaDeltares
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsDatabricks
 
Intro to Machine Learning with TF- workshop
Intro to Machine Learning with TF- workshopIntro to Machine Learning with TF- workshop
Intro to Machine Learning with TF- workshopProttay Karim
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Using standard libraries like stdio and sdtlib.h and using stats.h a.pdf
Using standard libraries like stdio and sdtlib.h and using stats.h a.pdfUsing standard libraries like stdio and sdtlib.h and using stats.h a.pdf
Using standard libraries like stdio and sdtlib.h and using stats.h a.pdffashiongallery1
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution
 
Bootcamp Web Notes - Akshansh Chaudhary
Bootcamp Web Notes - Akshansh ChaudharyBootcamp Web Notes - Akshansh Chaudhary
Bootcamp Web Notes - Akshansh ChaudharyAkshansh Chaudhary
 

Similaire à Gradient boosting in practice: a deep dive into xgboost (20)

NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
 
Getting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NETGetting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NET
 
Python과 node.js기반 데이터 분석 및 가시화
Python과 node.js기반 데이터 분석 및 가시화Python과 node.js기반 데이터 분석 및 가시화
Python과 node.js기반 데이터 분석 및 가시화
 
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
 
Pythonbrasil - 2018 - Acelerando Soluções com GPU
Pythonbrasil - 2018 - Acelerando Soluções com GPUPythonbrasil - 2018 - Acelerando Soluções com GPU
Pythonbrasil - 2018 - Acelerando Soluções com GPU
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
Session 1.5 supporting virtual integration of linked data with just-in-time...
Session 1.5   supporting virtual integration of linked data with just-in-time...Session 1.5   supporting virtual integration of linked data with just-in-time...
Session 1.5 supporting virtual integration of linked data with just-in-time...
 
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser Bootsma
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser BootsmaDSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser Bootsma
DSD-INT 2018 Work with iMOD MODFLOW models in Python - Visser Bootsma
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo Everts
 
Intro to Machine Learning with TF- workshop
Intro to Machine Learning with TF- workshopIntro to Machine Learning with TF- workshop
Intro to Machine Learning with TF- workshop
 
Python Software Testing in Kytos.io
Python Software Testing in Kytos.ioPython Software Testing in Kytos.io
Python Software Testing in Kytos.io
 
PyData Paris 2015 - Track 1.2 Gilles Louppe
PyData Paris 2015 - Track 1.2 Gilles LouppePyData Paris 2015 - Track 1.2 Gilles Louppe
PyData Paris 2015 - Track 1.2 Gilles Louppe
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
Using standard libraries like stdio and sdtlib.h and using stats.h a.pdf
Using standard libraries like stdio and sdtlib.h and using stats.h a.pdfUsing standard libraries like stdio and sdtlib.h and using stats.h a.pdf
Using standard libraries like stdio and sdtlib.h and using stats.h a.pdf
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
GeeCON Prague 2015
GeeCON Prague 2015GeeCON Prague 2015
GeeCON Prague 2015
 
Bootcamp Web Notes - Akshansh Chaudhary
Bootcamp Web Notes - Akshansh ChaudharyBootcamp Web Notes - Akshansh Chaudhary
Bootcamp Web Notes - Akshansh Chaudhary
 

Dernier

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Dernier (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

Gradient boosting in practice: a deep dive into xgboost

  • 1. GRADIENT BOOSTING IN PRACTICE A DEEP DIVE INTO XGBOOST by Jaroslaw Machine Learning Scientist Szymczak @ OLX Tech Hub Berlin
  • 3. by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 4. CLASSIFICATION PROBLEM EXAMPLE by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 5. EXAMPLE DECISION TREE by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 6. by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 7. BIAS AND VARIANCE SOURCE OF ERROR by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 8. by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 9. BAGGING by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 10. by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 11. by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 12. by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 13. ADAPTIVE BOOSTING by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 14. by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 15. GRADIENT BOOSTING by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 16. GRADIENT DESCENT by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 17. by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 18. XGBOOST by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 19. XGBOOST CUSTOM TREE BUILDING ALGORITHM Most of machine learning practitioners know at least these two measures used for tree building: ◍ entropy (information gain) ◍ gini coefficient XGBoost has a custom objective function used for building the tree. It requires: ◍ gradient ◍ hessian of the objective function. E.g. for linear regression and RMSE, for single observation: ◍ gradient = residual ◍ hessian = 1 We will go further into regularization parameters when we discuss tree tuning. by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 20. XGBOOST by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 22. XGBOOST CLASSIFIER WITH DEFAULT PARAMETERS In [2]: import xgboost as xgb clf = xgb.XGBClassifier() clf.__dict__ Out[2]: {'_Booster': None, 'base_score': 0.5, 'colsample_bylevel': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': nan, 'n_estimators': 100, 'nthread': -1, 'objective': 'binary:logistic', 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': 0, 'silent': True, 'subsample': 1} by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 23. DATASET PREPARATION train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes')) test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes')) X, y = train.data, train.target X_test_docs, y_test = test.data, test.target X_train_docs, X_val_docs, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) tfidf = TfidfVectorizer(min_df=0.005, max_df=0.5).fit(X_train) X_train_tfidf = tfidf.transform(X_train_docs).tocsc() X_val_tfidf = tfidf.transform(X_val_docs).tocsc() X_test_tfidf = tfidf.transform(X_test_docs).tocsc() by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 24. In [70]: print("X_train shape: {}".format(X_train_tfidf.get_shape())) print("X_val shape: {}".format(X_val_tfidf.get_shape())) print("X_test shape: {}".format(X_test_tfidf.get_shape())) print("X_train density: {:.3f}".format( (X_train_tfidf.nnz / np.prod(X_train_tfidf.shape)))) X_train shape: (9051, 2665) X_val shape: (2263, 2665) X_test shape: (7532, 2665) X_train density: 0.023 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 25. SIMPLE CLASSIFICATION In [5]: clf = xgb.XGBClassifier(seed=42, nthread=1) clf = clf.fit(X_train_tfidf, y_train, eval_set=[(X_train_tfidf, y_train), (X_val_tfidf, y_val)], verbose=11) y_pred = clf.predict(X_test_tfidf) y_pred_proba = clf.predict_proba(X_test_tfidf) print("Test error: {:.3f}".format(1 - accuracy_score(y_test, y_pred))) [0] validation_0-merror:0.588443 validation_1-merror:0.600088 [11] validation_0-merror:0.429345 validation_1-merror:0.476359 [22] validation_0-merror:0.390012 validation_1-merror:0.454264 [33] validation_0-merror:0.356093 validation_1-merror:0.443659 [44] validation_0-merror:0.330461 validation_1-merror:0.433053 [55] validation_0-merror:0.308806 validation_1-merror:0.427309 [66] validation_0-merror:0.28936 validation_1-merror:0.423774 [77] validation_0-merror:0.270909 validation_1-merror:0.41361 [88] validation_0-merror:0.256988 validation_1-merror:0.410517 [99] validation_0-merror:0.242404 validation_1-merror:0.410517 Test error: 0.462 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 26. EXPLORING SOME XGBOOST PROPERTIES As mentioned in meetup description, we will see that: by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 27. EXPLORING SOME XGBOOST PROPERTIES As mentioned in meetup description, we will see that: ◍ monotonic transformation have little to no effect on final results by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 28. EXPLORING SOME XGBOOST PROPERTIES As mentioned in meetup description, we will see that: ◍ monotonic transformation have little to no effect on final results ◍ features can be correlated by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 29. EXPLORING SOME XGBOOST PROPERTIES As mentioned in meetup description, we will see that: ◍ monotonic transformation have little to no effect on final results ◍ features can be correlated ◍ classifier is robust to noise by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 30. REPEATING THE PROCESS WITHOUT IDF FACTOR In [6]: tf = TfidfVectorizer(min_df=0.005, max_df=0.5, use_idf=False).fit(X_train_docs) X_train = tf.transform(X_train_docs).tocsc() X_val = tf.transform(X_val_docs).tocsc() X_test = tf.transform(X_test_docs).tocsc() clf = xgb.XGBClassifier(seed=42, nthread=1).fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbo se=33) y_pred = clf.predict(X_test) print("Test error: {:.3f}".format(1 - accuracy_score(y_test, y_pred))) Previous run: [0] validation_0-merror:0.588443 validation_1-merror:0.600088 [33] validation_0-merror:0.356093 validation_1-merror:0.443659 [66] validation_0-merror:0.28936 validation_1-merror:0.423774 [99] validation_0-merror:0.242404 validation_1-merror:0.410517 [0] validation_0-merror:0.588443 validation_1-merror:0.604507 [33] validation_0-merror:0.356093 validation_1-merror:0.443659 [66] validation_0-merror:0.288034 validation_1-merror:0.419355 [99] validation_0-merror:0.241962 validation_1-merror:0.411843 Test error: 0.461 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 31. ADDING SOME CORRELATED COUNTVECTORIZER FEATURES In [7]: cv = CountVectorizer(min_df=0.005, max_df=0.5).fit(X_train_docs) X_train_cv = cv.transform(X_train_docs).tocsc() X_val_cv = cv.transform(X_val_docs).tocsc() X_test_cv = cv.transform(X_test_docs).tocsc() X_train_corr = scipy.sparse.hstack([X_train, X_train_cv]) X_val_corr = scipy.sparse.hstack([X_val, X_val_cv]) X_test_corr = scipy.sparse.hstack([X_test, X_test_cv]) clf = xgb.XGBClassifier(seed=42, nthread=1) clf = clf.fit(X_train_corr, y_train, eval_set=[(X_train_corr, y_train), (X_val_corr, y_val)], verbose=99) y_pred = clf.predict(X_test_corr) print("Test error: {:.3f}".format(1 - accuracy_score(y_test, y_pred))) [0] validation_0-merror:0.588664 validation_1-merror:0.60274 [99] validation_0-merror:0.240636 validation_1-merror:0.408308 Test error: 0.464 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 32. ADDING SOME RANDOMNESS In [8]: def extend_with_random(X, density=0.023): X_extend = scipy.sparse.random(X.shape[0], 2*X.shape[1], density=density, format='csc') return scipy.sparse.hstack([X, X_extend]) X_train_noise = extend_with_random(X_train) X_val_noise = extend_with_random(X_val) X_test_noise = extend_with_random(X_test) clf = xgb.XGBClassifier(seed=42, nthread=1) clf = clf.fit(X_train_noise, y_train, eval_set=[(X_train_noise, y_train), (X_val_noise, y_val)], verbose=99) y_pred = clf.predict(X_test_noise) print("Test error: {:.3f}".format(1 - accuracy_score(y_test, y_pred))) [0] validation_0-merror:0.588443 validation_1-merror:0.604507 [99] validation_0-merror:0.221522 validation_1-merror:0.417587 Test error: 0.470 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 33. SOME XGBOOST FEATURES by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 34. SOME XGBOOST FEATURES ◍ watchlists (that we have seen so far) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 35. SOME XGBOOST FEATURES ◍ watchlists (that we have seen so far) ◍ early stopping by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 36. SOME XGBOOST FEATURES ◍ watchlists (that we have seen so far) ◍ early stopping ◍ full learning history that we can plot by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 37. EARLY STOPPING WITH WATCHLIST In [9]: clf = xgb.XGBClassifier(n_estimators=50000) clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=100, early_stopping_rounds=10) print("Best iteration: {}".format(clf.booster().best_iteration)) y_pred = clf.predict(X_test, ntree_limit=clf.booster().best_ntree_limit) print("Test error: {:.3f}".format(1 - accuracy_score(y_test, y_pred))) [0] validation_0-merror:0.588443 validation_1-merror:0.604507 Multiple eval metrics have been passed: 'validation_1-merror' will be used for early stopping. Will train until validation_1-merror hasn't improved in 10 rounds. [100] validation_0-merror:0.241189 validation_1-merror:0.412285 Stopping. Best iteration: [123] validation_0-merror:0.215114 validation_1-merror:0.401679 Best iteration: 123 Test error: 0.457 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 38. PLOTTING THE LEARNING CURVES In [10]: evals_result = clf.evals_result() train_errors = evals_result['validation_0']['merror'] validation_errors = evals_result['validation_1']['merror'] df = pd.DataFrame([train_errors, validation_errors]).T df.columns = ['train', 'val'] df.index.name = 'round' df.plot(title="XGBoost learning curves", ylim=(0,0.7), figsize=(12,5)) Out[10]: <matplotlib.axes._subplots.AxesSubplot at 0x11a21c9b0> by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 40. TUNING SETTING ◍ we will do the comparison of errors on train, validation and test sets ◍ normally I would recommend cross-validation for such task ◍ together with a hold-out set that you check rarely ◍ as you can overfit even when you use cross-validation ◍ slight changes in error are inconclusive about parameters, just try different random seeds and you'll see for yourself ◍ some parameters affect the training time, so it is enough that they don't make the final model worse, they don't have to improve it by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 41. GENERAL PARAMETERS by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 42. GENERAL PARAMETERS ◍ max_depth (how deep is your tree?) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 43. GENERAL PARAMETERS ◍ max_depth (how deep is your tree?) ◍ learning_rate (shrinkage parameter, "speed of learning") by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 44. GENERAL PARAMETERS ◍ max_depth (how deep is your tree?) ◍ learning_rate (shrinkage parameter, "speed of learning") ◍ max_delta_step (additional cap on learning rate, needed in case of highly imbalanced classes, in which case it is suggested to try values from 1 to 10) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 45. GENERAL PARAMETERS ◍ max_depth (how deep is your tree?) ◍ learning_rate (shrinkage parameter, "speed of learning") ◍ max_delta_step (additional cap on learning rate, needed in case of highly imbalanced classes, in which case it is suggested to try values from 1 to 10) ◍ n_estimators (number of boosting rounds) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 46. GENERAL PARAMETERS ◍ max_depth (how deep is your tree?) ◍ learning_rate (shrinkage parameter, "speed of learning") ◍ max_delta_step (additional cap on learning rate, needed in case of highly imbalanced classes, in which case it is suggested to try values from 1 to 10) ◍ n_estimators (number of boosting rounds) ◍ booster (as XGBoost is not only about trees, but it actually is) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 47. GENERAL PARAMETERS ◍ max_depth (how deep is your tree?) ◍ learning_rate (shrinkage parameter, "speed of learning") ◍ max_delta_step (additional cap on learning rate, needed in case of highly imbalanced classes, in which case it is suggested to try values from 1 to 10) ◍ n_estimators (number of boosting rounds) ◍ booster (as XGBoost is not only about trees, but it actually is) ◍ scale_pos_weight (in binary classification whether to rebalance the weights of positive vs. negative samples) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 48. ◍ base_score (boosting starts with predicting 0.5 for all observations, you can change it here) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 49. ◍ base_score (boosting starts with predicting 0.5 for all observations, you can change it here) ◍ seed / random_state (so you can reproduce your results, will not work exactly if you use multiple threads or on some processor / system architectures due to randomness introduced by floating point numbers rounding) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 50. ◍ base_score (boosting starts with predicting 0.5 for all observations, you can change it here) ◍ seed / random_state (so you can reproduce your results, will not work exactly if you use multiple threads or on some processor / system architectures due to randomness introduced by floating point numbers rounding) ◍ missing (if you want to treat some other things as missing than np.nan) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 51. ◍ base_score (boosting starts with predicting 0.5 for all observations, you can change it here) ◍ seed / random_state (so you can reproduce your results, will not work exactly if you use multiple threads or on some processor / system architectures due to randomness introduced by floating point numbers rounding) ◍ missing (if you want to treat some other things as missing than np.nan) ◍ objective (if you want to play with maths, just give the callable that maps objective(y_true, y_pred) -> grad, hes) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 52. ◍ base_score (boosting starts with predicting 0.5 for all observations, you can change it here) ◍ seed / random_state (so you can reproduce your results, will not work exactly if you use multiple threads or on some processor / system architectures due to randomness introduced by floating point numbers rounding) ◍ missing (if you want to treat some other things as missing than np.nan) ◍ objective (if you want to play with maths, just give the callable that maps objective(y_true, y_pred) -> grad, hes) Normally you will not really tune this parameters, apart from max_depth (and potentially learning_rate, but with so many others to tune I suggest to settle for sensible value). You can modify them, e.g. learning rate, if you think it makes sense, but you would rather not use hyperopt / sklearn grid search to find the optimal values here. by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 53. EFFECT OF MAX_DEPTH ERROR CHARACTERISTICS In [31]: results = [] for max_depth in [3, 6, 9, 12, 15, 30]: clf = xgb.XGBClassifier(max_depth=max_depth, n_estimators=20) clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=False) results.append( { 'max_depth': max_depth, 'train_error': 1 - accuracy_score(y_train, clf.predict(X_train)), 'validation_error': 1 - accuracy_score(y_val, clf.predict(X_val)), 'test_error': 1 - accuracy_score(y_test, clf.predict(X_test)) } ) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 54. In [32]: df_max_depth = pd.DataFrame(results).set_index('max_depth').sort_index() df_max_depth Out[32]: test_error train_error validation_error max_depth 3 0.504647 0.398630 0.462218 6 0.489379 0.292785 0.444101 9 0.492167 0.219755 0.443217 12 0.495884 0.169263 0.441008 15 0.492698 0.134350 0.437914 30 0.492432 0.064192 0.439682 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 55. In [33]: df_max_depth.plot(ylim=(0,0.7), figsize=(12,5)) Out[33]: <matplotlib.axes._subplots.AxesSubplot at 0x11de96d68> by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 56. AND HOW DOES LEARNING SPEED AFFECTS THE WHOLE PROCESS? In [34]: results = [] for learning_rate in [0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0]: clf = xgb.XGBClassifier(learning_rate=learning_rate) clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=False) results.append( { 'learning_rate': learning_rate, 'train_error': 1 - accuracy_score(y_train, clf.predict(X_train)), 'validation_error': 1 - accuracy_score(y_val, clf.predict(X_val)), 'test_error': 1 - accuracy_score(y_test, clf.predict(X_test)) } ) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 57. In [35]: df_learning_rate = pd.DataFrame(results).set_index('learning_rate').sort_index() df_learning_rate Out[35]: test_error train_error validation_error learning_rate 0.05 0.479819 0.322285 0.434379 0.10 0.461365 0.241962 0.411843 0.20 0.456187 0.158436 0.395493 0.40 0.467472 0.077781 0.399470 0.60 0.484466 0.050713 0.412285 0.80 0.492698 0.039222 0.438356 1.00 0.501328 0.032925 0.443217 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 58. In [36]: df_learning_rate.plot(ylim=(0,0.55), figsize=(12,5)) Out[36]: <matplotlib.axes._subplots.AxesSubplot at 0x11ddebb38> by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 59. RANDOM SUBSAMPLING PARAMETERS ◍ subsample (subsample ratio of the training instance) ◍ colsample_bytree (subsample ratio of columns when constructing each tree ◍ colsample_bylevel (as above, but per level of the tree rather than whole tree) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 60. SUBSAMPLE In [37]: subsample_search_grid = np.arange(0.2, 1.01, 0.2) results = [] for subsample in subsample_search_grid: clf = xgb.XGBClassifier(subsample=subsample, learning_rate=1.0) clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=False) results.append( { 'subsample': subsample, 'train_error': 1 - accuracy_score(y_train, clf.predict(X_train)), 'validation_error': 1 - accuracy_score(y_val, clf.predict(X_val)), 'test_error': 1 - accuracy_score(y_test, clf.predict(X_test)) } ) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 61. In [38]: df_subsample = pd.DataFrame(results).set_index('subsample').sort_index() df_subsample Out[38]: test_error train_error validation_error subsample 0.2 0.649628 0.273009 0.587715 0.4 0.550186 0.050271 0.488290 0.6 0.520579 0.035024 0.463102 0.8 0.517791 0.033256 0.459125 1.0 0.501328 0.032925 0.443217 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 62. In [48]: df_subsample.plot(ylim=(0,0.7), figsize=(12,5)) Out[48]: <matplotlib.axes._subplots.AxesSubplot at 0x1198b1080> by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 63. COLSAMPLE_BYTREE In [40]: colsample_bytree_search_grid = np.arange(0.2, 1.01, 0.2) results = [] for colsample_bytree in colsample_bytree_search_grid: clf = xgb.XGBClassifier(colsample_bytree=colsample_bytree, learning_rate=1.0) clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=False) results.append( { 'colsample_bytree': colsample_bytree, 'train_error': 1 - accuracy_score(y_train, clf.predict(X_train)), 'validation_error': 1 - accuracy_score(y_val, clf.predict(X_val)), 'test_error': 1 - accuracy_score(y_test, clf.predict(X_test)) } ) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 64. In [41]: df_colsample_bytree = pd.DataFrame(results).set_index('colsample_bytree').sort_index() df_colsample_bytree Out[41]: test_error train_error validation_error colsample_bytree 0.2 0.503717 0.036018 0.439240 0.4 0.508763 0.035687 0.440566 0.6 0.500000 0.034361 0.451613 0.8 0.499203 0.032041 0.453380 1.0 0.501328 0.032925 0.443217 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 65. In [42]: df_colsample_bytree.plot(ylim=(0,0.55), figsize=(12,5)) Out[42]: <matplotlib.axes._subplots.AxesSubplot at 0x11bdd4d68> by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 66. COLSAMPLE_BYLEVEL In [43]: colsample_bylevel_search_grid = np.arange(0.2, 1.01, 0.2) results = [] for colsample_bylevel in colsample_bylevel_search_grid: clf = xgb.XGBClassifier(colsample_bylevel=colsample_bylevel, learning_rate=1.0) clf = clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], verbose=False) results.append( { 'colsample_bylevel': colsample_bylevel, 'train_error': 1 - accuracy_score(y_train, clf.predict(X_train)), 'validation_error': 1 - accuracy_score(y_val, clf.predict(X_val)), 'test_error': 1 - accuracy_score(y_test, clf.predict(X_test)) } ) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 67. In [44]: df_colsample_bylevel = pd.DataFrame(results).set_index('colsample_bylevel').sort_index() df_colsample_bylevel Out[44]: test_error train_error validation_error colsample_bylevel 0.2 0.503585 0.035134 0.436589 0.4 0.499336 0.033808 0.444985 0.6 0.497876 0.034582 0.439240 0.8 0.505975 0.032593 0.459567 1.0 0.501328 0.032925 0.443217 by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 68. In [45]: df_colsample_bylevel.plot(ylim=(0,0.55), figsize=(12,5)) Out[45]: <matplotlib.axes._subplots.AxesSubplot at 0x11a21d160> by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 69. XGBOOST OBJECTIVE REGULARIZATION by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 70. REGULARIZATION PARAMETERS ◍ reg_alpha (L1 regularization, L1 norm factor) ◍ reg_lambda (L2 regularization, L2 norm factor) ◍ gamma ("L0 regularization" - dependent on number of leaves) ◍ min_child_weight - minimum sum o hessians in a child, for RMSE regression it is simply number of examples ◍ min_child_weight for binary classification would be a sum of second order gradients caluclated by formula: std::max(predt * (1.0f - predt), eps) by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 71. DIFFERENCE BETWEEN MIN_CHILD_WEIGHT AND GAMMA ◍ min_child_weight is like a local optimization, there is certain path of the tree that leads to certain number of instances with certain weights there ◍ gamma is global, it takes into account number of all leaves in a tree by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 72. HOW TO TUNE REGULARIZATION PARAMETERS ◍ min_child_weight you can just set as sensible value ◍ alfa, lambda and gamma though depend very much on your other parameters ◍ different values will be optimal for tree depth 6, and totally different for tree depth 15, as number and weights of leaves will differ a lot by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 74. In [69]: # tf was our TfIdfVectorizer # clf is our trained xgboost type(clf.feature_importances_) df = pd.DataFrame([tf.get_feature_names(), list(clf.feature_importances_)]).T df.columns = ['feature_name', 'feature_score'] df.sort_values('feature_score', ascending=False, inplace=True) df.set_index('feature_name', inplace=True) df.iloc[:10].plot(kind='barh', legend=False, figsize=(12,5)) Out[69]: <matplotlib.axes._subplots.AxesSubplot at 0x1255a50f0> by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 75. LESSONS LEARNED FROM FEATURE ANALYSIS ◍ as you see most of the features look like stopwords ◍ they can actually correspond to post length and some writing style of contributors ◍ as general practice says, and winners of many Kaggle competitions agree on: - feature enegineering is extremely important - aim to maintain your code in such manner, that at least initially you can analyse the features - there are some dedicated packages for analysis of feature importance for "black box" models (eli5) - and also xgboost has a possibility of more sophisticated feature analysis, could be in repo version only though by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017
  • 77. SOME THINGS NOT MENTIONED ELSEWHERE ◍ in regression problems, tree version of XGBoost cannot extrapolate ◍ current documentation is not compatibile with Python package (which is quite outdated) ◍ there are some histogram based improvements, similar like in LightGBM, to train the models faster (a lot of issues were reported about this feature) ◍ in 0.6a2 version from pip repo, there is a bug in handling csr matrices by Jaroslaw Szymczak, @PyData Berlin, 15th November 2017