SlideShare a Scribd company logo
1 of 32
Download to read offline
Winning Data Science
Competitions
Some (hopefully) useful pointers
Owen Zhang
Data Scientist
A plug for myself
Current
● Chief Product Officer
Previous
● VP, Science
A plug for myself
Current
● Chief Product Officer
Previous
● VP, Science
1st / 330,336
176,181 points
Agenda
● Structure of a Data Science Competition
● Philosophical considerations
● Sources of competitive advantage
● Some tools/techniques
● Three cases -- Amazon Allstate LM
● Apply what we learn out of competitions
Technique
Strategy
Philosophy
Data Science Competitions remind us that the purpose of a
predictive model is to predict on data that we have NOT seen.
Training Public LB
(validation)
Private LB
(holdout)
Build model using Training Data to
predict outcomes on Private LB Data
Structure of a Data Science Competition
Quick but sometimes misleading feedback
A little “philosophy”
● There are many ways to overfit
● Beware of “multiple comparison fallacy”
○ There is a cost in “peeking at the answer”,
○ Usually the first idea (if it works) is the best
“Think” more, “try” less
Sources of Competitive Advantage (the Secret Sauce)
● Luck
● Discipline (once bitten twice shy)
○ Proper validation framework
● Effort
● (Some) Domain knowledge
● Feature engineering
● The “right” model structure
● Machine/statistical learning packages
● Coding/data manipulation efficiency
The right tool is very important
Be Disciplined
+
Work Hard
+
Learn from
everyone
+
Luck
Good Validation is MORE IMPORTANT than Good Model
● Simple Training/Validation split is NOT enough
○ When you looked at your validation result for the Nth time, you
are training models on it
● If possible, have “holdout” dataset that you do not touch at all during
model building process
○ This includes feature extraction, etc.
A Typical Modeling Project
● What if holdout result is bad?
○ Be brave and scrap the project
Identify
Opportunity
Find/Prep
Data
Split Data
and Hide
Holdout
Build Model
Validate
Model
Test Model
with holdout
Implement
Model
Make Validation Dataset as Realistic as Possible
● Usually this means “out-of-time” validation.
○ You are free to use “in-time” random split to build models, tune
parameters, etc
○ But hold out data should be out-of-time
● Exception to the rule: cross validation when data extremely small
○ But keep in mind that your model won’t perform as well in reality
○ The more times you “tweak” your model, the bigger the gap.
Kaggle Competitions -- Typical Data Partitioning
Training
Public LB
Private LB
X Y
Training
Public LB
Private LB
X Y
Training Public LB Private LB
Time Time
Time
Training
Public LB
Private LB
Time
X Y X Y
X Y X Y
● When should we
use Public LB
feedback to tune
our models?
Kaggle Competitions -- Use PLB as Training?
Training
Public LB
Private LB
X Y
Training
Public LB
Private LB
X Y
Training Public LB Private LB
Time Time
Time
Training
Public LB
Private LB
Time
X Y X Y
X Y X Y
YES
YES
M
U
ST
N
O
Tools/techniques -- GBM
● My confession: I (over)use GBM
○ When in doubt, use GBM
● GBM automatically approximate
○ Non-linear transformations
○ Subtle and deep interactions
● GBM gracefully treats missing values
● GBM is invariant to monotonic transformation of
features
GBDT Hyper Parameter Tuning
Hyper Parameter Tuning Approach Range Note
# of Trees Fixed value 100-1000 Depending on datasize
Learning Rate Fixed => Fine Tune [2 - 10] / # of Trees Depending on # trees
Row Sampling Grid Search [.5, .75, 1.0]
Column Sampling Grid Search [.4, .6, .8, 1.0]
Min Leaf Weight Fixed => Fine Tune 3/(% of rare events) Rule of thumb
Max Tree Depth Grid Search [4, 6, 8, 10]
Min Split Gain Fixed 0 Keep it 0
Best GBDT implementation today: https://github.com/tqchen/xgboost
by Tianqi Chen (U of Washington)
Tools/techniques -- data preprocessing for GBDT
● High cardinality features
○ These are very commonly encountered -- zip code, injury type,
ICD9, text, etc.
○ Convert into numerical with preprocessing -- out-of-fold average,
counts, etc.
○ Use Ridge regression (or similar) and
■ use out-of-fold prediction as input to GBM
■ or blend
○ Be brave, use N-way interactions
■ I used 7-way interaction in the Amazon competition.
● GBM with out-of-fold treatment of high-cardinality feature performs
very well
Technical Tricks -- Stacking
● Basic idea -- use one model’s output as the next model’s input
● It is NOT a good idea to use in sample prediction for stacking
○ The problem is over-fitting
○ The more “over-fit” prediction1 is , the more weight it will get in
Model 2
Text Features
Model 2
GBM
Prediction 1
Model 1
Ridge
Regression
Final
Prediction
Num Features
Technical Tricks -- Stacking -- OOS / CV
● Use out of sample predictions
○ Take half of the training data to build model 1
○ Apply model 1 on the rest of the training data,
use the output as input to model 2
● Use cross-validation partitioning when data limited
○ Partition training data into K partitions
○ For each of the K partition, compute “prediction
1” by building a model with OTHER partitions
Technical Tricks -- feature engineering in GBM
● GBM only APPROXIMATE interactions and non-
linear transformations
● Strong interactions benefit from being explicitly
defined
○ Especially ratios/sums/differences among
features
● GBM cannot capture complex features such as
“average sales in the previous period for this type of
product”
Technical Tricks -- Glmnet
● From a methodology perspective, the opposite of
GBM
● Captures (log/logistic) linear relationship
● Work with very small # of rows (a few hundred or
even less)
● Complements GBM very well in a blend
● Need a lot of more work
○ missing values, outliers, transformations (log?),
interactions
● The sparsity assumption -- L1 vs L2
Technical Tricks -- Text mining
● tau package in R
● Python’s sklearn
● L2 penalty a must
● N-grams work well.
● Don’t forget the “trivial features”: length of text,
number of words, etc.
● Many “text-mining” competitions on kaggle are
actually dominated by structured fields -- KDD2014
Technical Tricks -- Blending
● All models are wrong, but some are useful (George
Box)
○ The hope is that they are wrong in different ways
● When in doubt, use average blender
● Beware of temptation to overfit public leaderboard
○ Use public LB + training CV
● The strongest individual model does not necessarily
make the best blend
○ Sometimes intentionally built weak models are good blending
candidates -- Liberty Mutual Competition
Technical Tricks -- blending continued
● Try to build “diverse” models
○ Different tools -- GBM, Glmnet, RF, SVM, etc.
○ Different model specifications -- Linear,
lognormal, poisson, 2 stage, etc.
○ Different subsets of features
○ Subsampled observations
○ Weighted/unweighted
○ …
● But, do not “peek at answers” (at least not too much)
Apply what we learn outside of competitions
● Competitions give us really good models, but we also need to
○ Select the right problem and structure it correctly
○ Find good (at least useful) data
○ Make sure models are used the right way
Competitions help us
● Understand how much “signal” exists in the data
● Identify flaws in data or data creation process
● Build generalizable models
● Broaden our technical horizon
● …
Case 1 -- Amazon User Access competition
● One of the most popular competitions on Kaggle to date
○ 1687 teams
● Use anonymized features to predict if employee access
request would be granted or denied
● All categorical features
○ Resource ID / Mgr ID / User ID / Dept ID …
○ Many features have high cardinality
● But I want to use GBM
Case 1 -- Amazon User Access competition
● Encode categorical features using observation counts
○ This is even available for holdout data!
● Encode categorical features using average response
○ Average all but one (example on next slide)
○ Add noise to the training features
● Build different kind of trees + ENET
○ GBM + ERT + ENET + RF + GBM2 + ERT2
● I didn't know VW (or similar), otherwise might have got better
results.
● https://github.com/owenzhang/Kaggle-AmazonChallenge2013
Case 1 -- Amazon User Access competition
“Leave-one-out” encoding of categorical features:
Split User ID Y mean(Y) random Exp_UID
Training A1 0 .667 1.05 0.70035
Training A1 1 .333 .97 0.32301
Training A1 1 .333 .98 0.32634
Training A1 0 .667 1.02 0.68034
Test A1 - .5 1 .5
Test A1 - .5 1 .5
Training A2 0
Case 2 -- Allstate User Purchase Option Prediction
● Predict final purchased product options based on earlier
transactions.
○ 7 correlated targets
● This turns out to be very difficult because:
○ The evaluation criteria is all-or-nothing: all 7 predictions
need to be correct
○ The baseline “last quoted” is very hard to beat.
■ Last quoted 53.269%
■ #3 (me) : 53.713% (+0.444%)
■ #1 solution 53.743% (+0.474%)
● Key challenges -- capture correlation, and not to lose to
baseline
Case 2 -- Allstate User Purchase Option Prediction
● Dependency -- Chained models
○ First build stand-alone model for F
○ Then model for G, given F
○ F => G => B => A => C => E => D
○ “Free models” first, “dependent” model later
○ In training time, use actual data
○ In prediction time, use most likely predicted value
● Not to lose to baseline -- 2 stage models
○ One model to predict which one to use: chained prediction,
or baseline
● ~1 million insurance records
● 300 variables:
target : The transformed ratio of loss to total insured value
id : A unique identifier of the data set
dummy : Nuisance variable used to control the model, but not a predictor
var1 – var17 : A set of normalized variables representing policy
characteristics
crimeVar1 – crimeVar9 : Normalized Crime Rate variables
geodemVar1 – geodemVar37 : Normalized geodemographic variables
weatherVar1 – weatherVar236 : Normalized weather station variables
DATA OVERVIEW
info@DataRobot.com | @DataRobot | DataRobot, INC.
Case 3 -- Liberty Mutual fire loss prediction
FEATURE ENGINEERING
info@DataRobot.com | @DataRobot | DataRobot, INC.
32 features
Policy Characteristics
30 features:
- All policy characteristics features (17)
- Split V4 into 2 levels (8)
- Computed ratio of certain features
- Combined surrogate ID and subsets of policy vars
Geodemographics
1 feature:
- Derived from PCA trained on scaled vars
Weather
1 feature:
- Derived from elasticnet trained on scaled variables
Crime Rate
0 features
● Broke feature set into 4 components
● Created surrogate ID based on identical crime, geodemographics and weather
variables
FINAL SOLUTION SUMMARY
info@DataRobot.com | @DataRobot | DataRobot, INC.
split
var4
Policy
Weather
Geodem
Crime
Surrogate
ID
25 policy features
1 weather feature
= Enet(Weather)
4 count features
=Count(ID *
4 subsets of
policy features)
1 geo-demo
feature =PCA
(Geodem)Raw
data
31
Features
+
ratio
R(glmnet)
Elastinet
DataRobot:
RF
ExtraTrees
GLM.
Weighted
Average
Blend
Select 28
features +
CrimeVar3
downsample
20K obs(y==0)
One-hot encoded
categorical +
Scaled numerical
R(gbm)
Lambdmart
y2=min(y, cap)
downsample
10K obs(y==0)
y2=min(y, cap)
full sample
Useful Resources
● http://www.kaggle.com/competitions
● http://www.kaggle.com/forums
● http://statweb.stanford.edu/~tibs/ElemStatLearn/
● http://scikit-learn.org/
● http://cran.r-project.org/
● https://github.com/JohnLangford/vowpal_wabbit/wiki
● ….

More Related Content

What's hot

TalkingData AdTracking Fraud Detection Challenge (1st place solution)
TalkingData AdTracking  Fraud Detection Challenge (1st place solution)TalkingData AdTracking  Fraud Detection Challenge (1st place solution)
TalkingData AdTracking Fraud Detection Challenge (1st place solution)Takanori Hayashi
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
Kaggleのテクニック
KaggleのテクニックKaggleのテクニック
KaggleのテクニックYasunori Ozaki
 
実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだことnishio
 
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Preferred Networks
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? HackerEarth
 
敵対的学習に対するラデマッハ複雑度
敵対的学習に対するラデマッハ複雑度敵対的学習に対するラデマッハ複雑度
敵対的学習に対するラデマッハ複雑度Masa Kato
 
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision TreeNIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision TreeTakami Sato
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
ベイズ最適化によるハイパラーパラメータ探索
ベイズ最適化によるハイパラーパラメータ探索ベイズ最適化によるハイパラーパラメータ探索
ベイズ最適化によるハイパラーパラメータ探索西岡 賢一郎
 
Maximum Entropy IRL(最大エントロピー逆強化学習)とその発展系について
Maximum Entropy IRL(最大エントロピー逆強化学習)とその発展系についてMaximum Entropy IRL(最大エントロピー逆強化学習)とその発展系について
Maximum Entropy IRL(最大エントロピー逆強化学習)とその発展系についてYusuke Nakata
 
[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANsDeep Learning JP
 
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)RyuichiKanoh
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)Takao Yamanaka
 
機械学習プロフェッショナルシリーズ 深層学習 chapter3 確率的勾配降下法
機械学習プロフェッショナルシリーズ 深層学習 chapter3 確率的勾配降下法機械学習プロフェッショナルシリーズ 深層学習 chapter3 確率的勾配降下法
機械学習プロフェッショナルシリーズ 深層学習 chapter3 確率的勾配降下法zakktakk
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
AlphaGo Zero 解説
AlphaGo Zero 解説AlphaGo Zero 解説
AlphaGo Zero 解説suckgeun lee
 
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)Preferred Networks
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostTakami Sato
 

What's hot (20)

TalkingData AdTracking Fraud Detection Challenge (1st place solution)
TalkingData AdTracking  Fraud Detection Challenge (1st place solution)TalkingData AdTracking  Fraud Detection Challenge (1st place solution)
TalkingData AdTracking Fraud Detection Challenge (1st place solution)
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Kaggleのテクニック
KaggleのテクニックKaggleのテクニック
Kaggleのテクニック
 
実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと
 
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
 
敵対的学習に対するラデマッハ複雑度
敵対的学習に対するラデマッハ複雑度敵対的学習に対するラデマッハ複雑度
敵対的学習に対するラデマッハ複雑度
 
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision TreeNIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
ベイズ最適化によるハイパラーパラメータ探索
ベイズ最適化によるハイパラーパラメータ探索ベイズ最適化によるハイパラーパラメータ探索
ベイズ最適化によるハイパラーパラメータ探索
 
Maximum Entropy IRL(最大エントロピー逆強化学習)とその発展系について
Maximum Entropy IRL(最大エントロピー逆強化学習)とその発展系についてMaximum Entropy IRL(最大エントロピー逆強化学習)とその発展系について
Maximum Entropy IRL(最大エントロピー逆強化学習)とその発展系について
 
[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs
 
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)
 
機械学習プロフェッショナルシリーズ 深層学習 chapter3 確率的勾配降下法
機械学習プロフェッショナルシリーズ 深層学習 chapter3 確率的勾配降下法機械学習プロフェッショナルシリーズ 深層学習 chapter3 確率的勾配降下法
機械学習プロフェッショナルシリーズ 深層学習 chapter3 確率的勾配降下法
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
TabNetの論文紹介
TabNetの論文紹介TabNetの論文紹介
TabNetの論文紹介
 
AlphaGo Zero 解説
AlphaGo Zero 解説AlphaGo Zero 解説
AlphaGo Zero 解説
 
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 

Viewers also liked

10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in PythonImry Kissos
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The PeopleDaniel Tunkelang
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning SystemsXavier Amatriain
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013Philip Zheng
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsNhatHai Phan
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentationlpaviglianiti
 

Viewers also liked (20)

10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
 

Similar to Tips for data science competitions

Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang)  - 2014 Boston Data FestivalWinning Data Science Competitions (Owen Zhang)  - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festivalfreshdatabos
 
Winning data science competitions
Winning data science competitionsWinning data science competitions
Winning data science competitionsOwen Zhang
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningRebecca Bilbro
 
Bimbo Final Project Presentation
Bimbo Final Project PresentationBimbo Final Project Presentation
Bimbo Final Project PresentationCan Köklü
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scaleOwen Zhang
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestBerker Kozan
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning SystemsThoughtworks
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBigML, Inc
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
 
Kdd 2013 talk-converted
Kdd 2013 talk-convertedKdd 2013 talk-converted
Kdd 2013 talk-convertedkb10june
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1BigML, Inc
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning SystemsXavier Amatriain
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine LearningAlexey Grigorev
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsBigML, Inc
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsXavier Amatriain
 

Similar to Tips for data science competitions (20)

Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang)  - 2014 Boston Data FestivalWinning Data Science Competitions (Owen Zhang)  - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
 
Winning data science competitions
Winning data science competitionsWinning data science competitions
Winning data science competitions
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
 
Bimbo Final Project Presentation
Bimbo Final Project PresentationBimbo Final Project Presentation
Bimbo Final Project Presentation
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning Systems
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Kdd 2013 talk-converted
Kdd 2013 talk-convertedKdd 2013 talk-converted
Kdd 2013 talk-converted
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine Learning
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
 

Recently uploaded

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 

Recently uploaded (20)

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 

Tips for data science competitions

  • 1. Winning Data Science Competitions Some (hopefully) useful pointers Owen Zhang Data Scientist
  • 2. A plug for myself Current ● Chief Product Officer Previous ● VP, Science
  • 3. A plug for myself Current ● Chief Product Officer Previous ● VP, Science 1st / 330,336 176,181 points
  • 4. Agenda ● Structure of a Data Science Competition ● Philosophical considerations ● Sources of competitive advantage ● Some tools/techniques ● Three cases -- Amazon Allstate LM ● Apply what we learn out of competitions Technique Strategy Philosophy
  • 5. Data Science Competitions remind us that the purpose of a predictive model is to predict on data that we have NOT seen. Training Public LB (validation) Private LB (holdout) Build model using Training Data to predict outcomes on Private LB Data Structure of a Data Science Competition Quick but sometimes misleading feedback
  • 6. A little “philosophy” ● There are many ways to overfit ● Beware of “multiple comparison fallacy” ○ There is a cost in “peeking at the answer”, ○ Usually the first idea (if it works) is the best “Think” more, “try” less
  • 7. Sources of Competitive Advantage (the Secret Sauce) ● Luck ● Discipline (once bitten twice shy) ○ Proper validation framework ● Effort ● (Some) Domain knowledge ● Feature engineering ● The “right” model structure ● Machine/statistical learning packages ● Coding/data manipulation efficiency The right tool is very important Be Disciplined + Work Hard + Learn from everyone + Luck
  • 8. Good Validation is MORE IMPORTANT than Good Model ● Simple Training/Validation split is NOT enough ○ When you looked at your validation result for the Nth time, you are training models on it ● If possible, have “holdout” dataset that you do not touch at all during model building process ○ This includes feature extraction, etc.
  • 9. A Typical Modeling Project ● What if holdout result is bad? ○ Be brave and scrap the project Identify Opportunity Find/Prep Data Split Data and Hide Holdout Build Model Validate Model Test Model with holdout Implement Model
  • 10. Make Validation Dataset as Realistic as Possible ● Usually this means “out-of-time” validation. ○ You are free to use “in-time” random split to build models, tune parameters, etc ○ But hold out data should be out-of-time ● Exception to the rule: cross validation when data extremely small ○ But keep in mind that your model won’t perform as well in reality ○ The more times you “tweak” your model, the bigger the gap.
  • 11. Kaggle Competitions -- Typical Data Partitioning Training Public LB Private LB X Y Training Public LB Private LB X Y Training Public LB Private LB Time Time Time Training Public LB Private LB Time X Y X Y X Y X Y ● When should we use Public LB feedback to tune our models?
  • 12. Kaggle Competitions -- Use PLB as Training? Training Public LB Private LB X Y Training Public LB Private LB X Y Training Public LB Private LB Time Time Time Training Public LB Private LB Time X Y X Y X Y X Y YES YES M U ST N O
  • 13. Tools/techniques -- GBM ● My confession: I (over)use GBM ○ When in doubt, use GBM ● GBM automatically approximate ○ Non-linear transformations ○ Subtle and deep interactions ● GBM gracefully treats missing values ● GBM is invariant to monotonic transformation of features
  • 14. GBDT Hyper Parameter Tuning Hyper Parameter Tuning Approach Range Note # of Trees Fixed value 100-1000 Depending on datasize Learning Rate Fixed => Fine Tune [2 - 10] / # of Trees Depending on # trees Row Sampling Grid Search [.5, .75, 1.0] Column Sampling Grid Search [.4, .6, .8, 1.0] Min Leaf Weight Fixed => Fine Tune 3/(% of rare events) Rule of thumb Max Tree Depth Grid Search [4, 6, 8, 10] Min Split Gain Fixed 0 Keep it 0 Best GBDT implementation today: https://github.com/tqchen/xgboost by Tianqi Chen (U of Washington)
  • 15. Tools/techniques -- data preprocessing for GBDT ● High cardinality features ○ These are very commonly encountered -- zip code, injury type, ICD9, text, etc. ○ Convert into numerical with preprocessing -- out-of-fold average, counts, etc. ○ Use Ridge regression (or similar) and ■ use out-of-fold prediction as input to GBM ■ or blend ○ Be brave, use N-way interactions ■ I used 7-way interaction in the Amazon competition. ● GBM with out-of-fold treatment of high-cardinality feature performs very well
  • 16. Technical Tricks -- Stacking ● Basic idea -- use one model’s output as the next model’s input ● It is NOT a good idea to use in sample prediction for stacking ○ The problem is over-fitting ○ The more “over-fit” prediction1 is , the more weight it will get in Model 2 Text Features Model 2 GBM Prediction 1 Model 1 Ridge Regression Final Prediction Num Features
  • 17. Technical Tricks -- Stacking -- OOS / CV ● Use out of sample predictions ○ Take half of the training data to build model 1 ○ Apply model 1 on the rest of the training data, use the output as input to model 2 ● Use cross-validation partitioning when data limited ○ Partition training data into K partitions ○ For each of the K partition, compute “prediction 1” by building a model with OTHER partitions
  • 18. Technical Tricks -- feature engineering in GBM ● GBM only APPROXIMATE interactions and non- linear transformations ● Strong interactions benefit from being explicitly defined ○ Especially ratios/sums/differences among features ● GBM cannot capture complex features such as “average sales in the previous period for this type of product”
  • 19. Technical Tricks -- Glmnet ● From a methodology perspective, the opposite of GBM ● Captures (log/logistic) linear relationship ● Work with very small # of rows (a few hundred or even less) ● Complements GBM very well in a blend ● Need a lot of more work ○ missing values, outliers, transformations (log?), interactions ● The sparsity assumption -- L1 vs L2
  • 20. Technical Tricks -- Text mining ● tau package in R ● Python’s sklearn ● L2 penalty a must ● N-grams work well. ● Don’t forget the “trivial features”: length of text, number of words, etc. ● Many “text-mining” competitions on kaggle are actually dominated by structured fields -- KDD2014
  • 21. Technical Tricks -- Blending ● All models are wrong, but some are useful (George Box) ○ The hope is that they are wrong in different ways ● When in doubt, use average blender ● Beware of temptation to overfit public leaderboard ○ Use public LB + training CV ● The strongest individual model does not necessarily make the best blend ○ Sometimes intentionally built weak models are good blending candidates -- Liberty Mutual Competition
  • 22. Technical Tricks -- blending continued ● Try to build “diverse” models ○ Different tools -- GBM, Glmnet, RF, SVM, etc. ○ Different model specifications -- Linear, lognormal, poisson, 2 stage, etc. ○ Different subsets of features ○ Subsampled observations ○ Weighted/unweighted ○ … ● But, do not “peek at answers” (at least not too much)
  • 23. Apply what we learn outside of competitions ● Competitions give us really good models, but we also need to ○ Select the right problem and structure it correctly ○ Find good (at least useful) data ○ Make sure models are used the right way Competitions help us ● Understand how much “signal” exists in the data ● Identify flaws in data or data creation process ● Build generalizable models ● Broaden our technical horizon ● …
  • 24. Case 1 -- Amazon User Access competition ● One of the most popular competitions on Kaggle to date ○ 1687 teams ● Use anonymized features to predict if employee access request would be granted or denied ● All categorical features ○ Resource ID / Mgr ID / User ID / Dept ID … ○ Many features have high cardinality ● But I want to use GBM
  • 25. Case 1 -- Amazon User Access competition ● Encode categorical features using observation counts ○ This is even available for holdout data! ● Encode categorical features using average response ○ Average all but one (example on next slide) ○ Add noise to the training features ● Build different kind of trees + ENET ○ GBM + ERT + ENET + RF + GBM2 + ERT2 ● I didn't know VW (or similar), otherwise might have got better results. ● https://github.com/owenzhang/Kaggle-AmazonChallenge2013
  • 26. Case 1 -- Amazon User Access competition “Leave-one-out” encoding of categorical features: Split User ID Y mean(Y) random Exp_UID Training A1 0 .667 1.05 0.70035 Training A1 1 .333 .97 0.32301 Training A1 1 .333 .98 0.32634 Training A1 0 .667 1.02 0.68034 Test A1 - .5 1 .5 Test A1 - .5 1 .5 Training A2 0
  • 27. Case 2 -- Allstate User Purchase Option Prediction ● Predict final purchased product options based on earlier transactions. ○ 7 correlated targets ● This turns out to be very difficult because: ○ The evaluation criteria is all-or-nothing: all 7 predictions need to be correct ○ The baseline “last quoted” is very hard to beat. ■ Last quoted 53.269% ■ #3 (me) : 53.713% (+0.444%) ■ #1 solution 53.743% (+0.474%) ● Key challenges -- capture correlation, and not to lose to baseline
  • 28. Case 2 -- Allstate User Purchase Option Prediction ● Dependency -- Chained models ○ First build stand-alone model for F ○ Then model for G, given F ○ F => G => B => A => C => E => D ○ “Free models” first, “dependent” model later ○ In training time, use actual data ○ In prediction time, use most likely predicted value ● Not to lose to baseline -- 2 stage models ○ One model to predict which one to use: chained prediction, or baseline
  • 29. ● ~1 million insurance records ● 300 variables: target : The transformed ratio of loss to total insured value id : A unique identifier of the data set dummy : Nuisance variable used to control the model, but not a predictor var1 – var17 : A set of normalized variables representing policy characteristics crimeVar1 – crimeVar9 : Normalized Crime Rate variables geodemVar1 – geodemVar37 : Normalized geodemographic variables weatherVar1 – weatherVar236 : Normalized weather station variables DATA OVERVIEW info@DataRobot.com | @DataRobot | DataRobot, INC. Case 3 -- Liberty Mutual fire loss prediction
  • 30. FEATURE ENGINEERING info@DataRobot.com | @DataRobot | DataRobot, INC. 32 features Policy Characteristics 30 features: - All policy characteristics features (17) - Split V4 into 2 levels (8) - Computed ratio of certain features - Combined surrogate ID and subsets of policy vars Geodemographics 1 feature: - Derived from PCA trained on scaled vars Weather 1 feature: - Derived from elasticnet trained on scaled variables Crime Rate 0 features ● Broke feature set into 4 components ● Created surrogate ID based on identical crime, geodemographics and weather variables
  • 31. FINAL SOLUTION SUMMARY info@DataRobot.com | @DataRobot | DataRobot, INC. split var4 Policy Weather Geodem Crime Surrogate ID 25 policy features 1 weather feature = Enet(Weather) 4 count features =Count(ID * 4 subsets of policy features) 1 geo-demo feature =PCA (Geodem)Raw data 31 Features + ratio R(glmnet) Elastinet DataRobot: RF ExtraTrees GLM. Weighted Average Blend Select 28 features + CrimeVar3 downsample 20K obs(y==0) One-hot encoded categorical + Scaled numerical R(gbm) Lambdmart y2=min(y, cap) downsample 10K obs(y==0) y2=min(y, cap) full sample
  • 32. Useful Resources ● http://www.kaggle.com/competitions ● http://www.kaggle.com/forums ● http://statweb.stanford.edu/~tibs/ElemStatLearn/ ● http://scikit-learn.org/ ● http://cran.r-project.org/ ● https://github.com/JohnLangford/vowpal_wabbit/wiki ● ….