6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
Mastering Machine Learning with Competitions
1.
2. Jeong-Yoon Lee, Ph.D.
Sr. Applied Machine Learning Scientist, Microsoft
Science Advisor, Neofect, Conversion Logic
KDD Cup 2012, 2015 Winner, Top 10, Kaggle 2015
KDD Cup 2018 Co-Chair, ACM SIGKDD
OneML Organizing Committee, Microsoft
3. ML Competitions
Since 1997
2006 - 2009
Since 2010
For the latest list of competitions, see https://github.com/iphysresearch/DataSciComp
Started in 8/2018
23. Feature Engineering
Types Note
Numerical Log, Log2(1 + x), Box-Cox, Normalization, Binning
Categorical One-hot-encoding, Label-encoding, Count, Weight-of-Evidence
Text Bag-of-Words, TF-IDF, N-gram, Character-n-gram, K-skip-n-gram
Timeseries/ Sensor data Descriptive Statistics, Derivatives, FFT, MFCC, ERP
Network Graph Degree, Closeness, Betweenness, PageRank
Numerical/ Timeseries Convert to categorical features using RF/GBM
Dimensionality Reduction PCA, SVD, Autoencoder, Hashing Trick
Interaction Addition/subtraction/multiplication/division. Hashing Trick
* More comprehensive overview on feature engineering by HJ van Veen: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750
24. Diverse Algorithms
Algorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions
Random Forests Scikit-Learn, randomForest Used to be popular before GBM
Extremely Random Trees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet, PyTorch, CNTK Blends well with GBM. Best at image and speech recognition competitions
Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
Support Vector Machine Scikit-Learn
FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions
Factorization Machine libFM, fastFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)
25. A Tale of Two Algorithms
GBM Deep Learning
No. 1 winning algorithm at most
machine learning competitions
Highlight
Most popular algorithm across media,
industry, and academia
Decision Tree
(Morgan & Sonquist 1963)
Base algorithm
Perceptron
(Rosenblatt 1958)
Structured, categorical data Use cases Image, speech, natural language data
Feature engineering Crucial step
Architecture design.
Finding pre-trained models
LightGBM, XGBoost, CatBoost, H2O Open source tools
Keras, PyTorch, Tensorflow, CNTK,
MXNet, Caffe
26. Cross Validation
Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
27.
28. * for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/
Ensemble - Stacking
36. Resources
캐글뽀개기
Introduction to Machine Learning for Coders
Practical Deep Learning for Coders
How to Win a Data Science Competition
Winning Tips on Machine Learning Competitions
Feature Engineering mlwave.com