SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Featurizing log data
before XGBoost
Xavier Conort
Thursday, August 20, 2015 @
● XuetangX, a Chinese MOOC learning platform initiated
by Tsinghua University,
● launched online on Oct 10th, 2013.
● more than 100 Chinese courses and over 260
international courses
● high dropout rate
The competition host
● challenge: predict whether a user will drop a course
within next 10 days based on his or her prior activities.
● data:
○ enrollment_train (120K rows) / enrollment_test (80K rows):
■ Columns: enrollment_id, username, course_id
○ log_train / log_test
■ Columns: enrollment_id, time, source, event, object
○ object
■ Columns: course_id, module_id, category, children, start
○ truth_train
■ Columns: enrollment_id, dropped_out
Problem to solve
Log data 5890
objects
Team
Chief Product Officer Chief Data Scientist
Data Scientist Data Scientist
(O. Zhang)
How we worked as a Team
● worked separately on feature engineering. 90% of
our time was spent here.
● delegated Modeling part to DataRobot to:
○ find best algorithm (with XGboost as a winner!)
○ model text features
○ tune hyperparameters
○ experiment different feature sets and blend 8 XGBoost
using different sets
○ communicate results
Feature engineering techniques used
● counts
● time statistics (min, mean, max, diff)
● entropy
● sequences treated as text on which we ran
○ SVD on 3grams
○ DataRobot text mining solution
● 20 first components of SVD on user x object
NB: removed duplicated log info and used training + test
sets to build most features
How to build efficient features in R
Key course features
● course_id
● first log time
● enrollment counts
● unique log counts
● mean time interval
Key enrollment count features
● log counts
● unique log counts
● ratio between unique log counts over log counts
● unique log counts by event (nagivate, access,
problem, video, page_close, discussion, wiki)
● unique log counts before end of course (5 days, 10
days and 30 days before)
● sequence number of enrollment in that course
Key enrollment time stats
● log time stats (min, mean, max)
● gap between first and last log of enrollment
● gap between enrollment first log and course first log
● gap between enrollment last log and course last logs
● difference between mean log time and mid point
between first and last log
● log interval stats (mean, 90, 99 and 100 quantiles)
Enrollment entropy features
enrollment entropy over
● days
● weekdays
● fraction (4) of weekdays
● hours of the day
● hours of the day for the last 1/3/7 days before last
logs
● object (when event == problem)
● chapter ids
Example of entropy feature
- log(weekday_log_count / enrollment_log_count) *
weekday_log_count / enrollment_log_count
Sum => weekday_entropy[enrollment_id==1]
1.589988
Enrollment sequence features
● for each enrollment_id, built sequences of
○ weekdays
○ objects
■ all objects / 'problem' and 'video' objects only
○ events
● treated sequences as 4 text variables. Ran for each
○ svd on 3 grams => first 10 components
○ DataRobot stacked predictions from logistic regr.
& Nystroem SVM on (tuned) n-grams
Extract of enrollment object sequences
1/2-grams from Object sequences
DataRobot on Object 1-2 grams
Key user count features and time
stats
● enrollment count
● binary indicator whether user signed up for each of
the 38 courses
● unique log count
● mean log time interval
● sequence number of enrollment for that user
User entropy features
user entropy over
● days
● weekdays
● fraction (4) of weekdays
● hours of the day
User sequence features
● for each user, built sequences of
○ weekdays
○ chapter_ids
○ events
● treated them as 3 text variables. Ran
○ SVD on 3 grams => first 10 components
○ DataRobot stacked predictions from logistic regr.
+ Nystroem SVM on (tuned) n-grams
How we got to the TOP3
● entropy features mentioned before
● exploited info in
○ log count in the 5 / 10 / 20 days after end of course
○ log counts by event, sign_up counts and day entropy in the next
10 days after end of course
○ time to sign up for new course
○ time until the next log for same user
added ~0.001 to AUC (vs
less powerful features)
added ~0.002 to AUC
XGBoost
Thank you!

Contenu connexe

Tendances

Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data scienceAkira Shibata
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillationNAVER Engineering
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Edureka!
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? HackerEarth
 
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Simplilearn
 
【DL輪読会】Free Lunch for Few-shot Learning: Distribution Calibration
【DL輪読会】Free Lunch for Few-shot Learning: Distribution Calibration【DL輪読会】Free Lunch for Few-shot Learning: Distribution Calibration
【DL輪読会】Free Lunch for Few-shot Learning: Distribution CalibrationDeep Learning JP
 

Tendances (20)

Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
Temporal based Recommendation System
Temporal based Recommendation SystemTemporal based Recommendation System
Temporal based Recommendation System
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
 
【DL輪読会】Free Lunch for Few-shot Learning: Distribution Calibration
【DL輪読会】Free Lunch for Few-shot Learning: Distribution Calibration【DL輪読会】Free Lunch for Few-shot Learning: Distribution Calibration
【DL輪読会】Free Lunch for Few-shot Learning: Distribution Calibration
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 

En vedette

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovationHackerEarth
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case StudyHackerEarth
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation systemHackerEarth
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionJeong-Yoon Lee
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEoHackerEarth
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarthHackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?HackerEarth
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at ScaleDomino Data Lab
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT MinistryJeong-Yoon Lee
 
Tda presentation
Tda presentationTda presentation
Tda presentationHJ van Veen
 
Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbitodsc
 

En vedette (20)

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Xgboost
XgboostXgboost
Xgboost
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
 
Work - LIGHT Ministry
Work - LIGHT MinistryWork - LIGHT Ministry
Work - LIGHT Ministry
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT Ministry
 
Tda presentation
Tda presentationTda presentation
Tda presentation
 
Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbit
 

Similaire à Featurizing log data before XGBoost

Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overviewjoeyrobert
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Lviv Startup Club
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
Agile Testing Analytics
Agile Testing AnalyticsAgile Testing Analytics
Agile Testing AnalyticsQASymphony
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...GeeksLab Odessa
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
Active Learning on Question Answering with Dialogues
 Active Learning on Question Answering with Dialogues Active Learning on Question Answering with Dialogues
Active Learning on Question Answering with DialoguesJinho Choi
 
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk DevelopmentTSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk DevelopmentHarry McLaren
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestBerker Kozan
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Technical Debt Management
Technical Debt ManagementTechnical Debt Management
Technical Debt ManagementMark Niebergall
 
Agile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful InformationAgile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful InformationTechWell
 
Pytest - testing tips and useful plugins
Pytest - testing tips and useful pluginsPytest - testing tips and useful plugins
Pytest - testing tips and useful pluginsAndreu Vallbona Plazas
 
B.sc CSIT 2nd semester C++ unit-1
B.sc CSIT  2nd semester C++ unit-1B.sc CSIT  2nd semester C++ unit-1
B.sc CSIT 2nd semester C++ unit-1Tekendra Nath Yogi
 
Interop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough CodeInterop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough CodeJeremy Schulman
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 

Similaire à Featurizing log data before XGBoost (20)

Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overview
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Agile Testing Analytics
Agile Testing AnalyticsAgile Testing Analytics
Agile Testing Analytics
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
 
Application Metrics
Application MetricsApplication Metrics
Application Metrics
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Active Learning on Question Answering with Dialogues
 Active Learning on Question Answering with Dialogues Active Learning on Question Answering with Dialogues
Active Learning on Question Answering with Dialogues
 
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk DevelopmentTSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Technical Debt Management
Technical Debt ManagementTechnical Debt Management
Technical Debt Management
 
Agile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful InformationAgile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful Information
 
Pytest - testing tips and useful plugins
Pytest - testing tips and useful pluginsPytest - testing tips and useful plugins
Pytest - testing tips and useful plugins
 
B.sc CSIT 2nd semester C++ unit-1
B.sc CSIT  2nd semester C++ unit-1B.sc CSIT  2nd semester C++ unit-1
B.sc CSIT 2nd semester C++ unit-1
 
Interop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough CodeInterop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough Code
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 

Plus de DataRobot

How to Understand a DataRobot Model
How to Understand a DataRobot ModelHow to Understand a DataRobot Model
How to Understand a DataRobot ModelDataRobot
 
Artificial Intelligence in Banking: An Infographic
Artificial Intelligence in Banking: An InfographicArtificial Intelligence in Banking: An Infographic
Artificial Intelligence in Banking: An InfographicDataRobot
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringDataRobot
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R PackageDataRobot
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 

Plus de DataRobot (6)

How to Understand a DataRobot Model
How to Understand a DataRobot ModelHow to Understand a DataRobot Model
How to Understand a DataRobot Model
 
Artificial Intelligence in Banking: An Infographic
Artificial Intelligence in Banking: An InfographicArtificial Intelligence in Banking: An Infographic
Artificial Intelligence in Banking: An Infographic
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 

Dernier

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Dernier (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Featurizing log data before XGBoost

  • 1. Featurizing log data before XGBoost Xavier Conort Thursday, August 20, 2015 @
  • 2. ● XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University, ● launched online on Oct 10th, 2013. ● more than 100 Chinese courses and over 260 international courses ● high dropout rate The competition host
  • 3. ● challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities. ● data: ○ enrollment_train (120K rows) / enrollment_test (80K rows): ■ Columns: enrollment_id, username, course_id ○ log_train / log_test ■ Columns: enrollment_id, time, source, event, object ○ object ■ Columns: course_id, module_id, category, children, start ○ truth_train ■ Columns: enrollment_id, dropped_out Problem to solve
  • 5. Team Chief Product Officer Chief Data Scientist Data Scientist Data Scientist (O. Zhang)
  • 6. How we worked as a Team ● worked separately on feature engineering. 90% of our time was spent here. ● delegated Modeling part to DataRobot to: ○ find best algorithm (with XGboost as a winner!) ○ model text features ○ tune hyperparameters ○ experiment different feature sets and blend 8 XGBoost using different sets ○ communicate results
  • 7. Feature engineering techniques used ● counts ● time statistics (min, mean, max, diff) ● entropy ● sequences treated as text on which we ran ○ SVD on 3grams ○ DataRobot text mining solution ● 20 first components of SVD on user x object NB: removed duplicated log info and used training + test sets to build most features
  • 8. How to build efficient features in R
  • 9. Key course features ● course_id ● first log time ● enrollment counts ● unique log counts ● mean time interval
  • 10. Key enrollment count features ● log counts ● unique log counts ● ratio between unique log counts over log counts ● unique log counts by event (nagivate, access, problem, video, page_close, discussion, wiki) ● unique log counts before end of course (5 days, 10 days and 30 days before) ● sequence number of enrollment in that course
  • 11. Key enrollment time stats ● log time stats (min, mean, max) ● gap between first and last log of enrollment ● gap between enrollment first log and course first log ● gap between enrollment last log and course last logs ● difference between mean log time and mid point between first and last log ● log interval stats (mean, 90, 99 and 100 quantiles)
  • 12. Enrollment entropy features enrollment entropy over ● days ● weekdays ● fraction (4) of weekdays ● hours of the day ● hours of the day for the last 1/3/7 days before last logs ● object (when event == problem) ● chapter ids
  • 13. Example of entropy feature - log(weekday_log_count / enrollment_log_count) * weekday_log_count / enrollment_log_count Sum => weekday_entropy[enrollment_id==1] 1.589988
  • 14. Enrollment sequence features ● for each enrollment_id, built sequences of ○ weekdays ○ objects ■ all objects / 'problem' and 'video' objects only ○ events ● treated sequences as 4 text variables. Ran for each ○ svd on 3 grams => first 10 components ○ DataRobot stacked predictions from logistic regr. & Nystroem SVM on (tuned) n-grams
  • 15. Extract of enrollment object sequences
  • 17. DataRobot on Object 1-2 grams
  • 18. Key user count features and time stats ● enrollment count ● binary indicator whether user signed up for each of the 38 courses ● unique log count ● mean log time interval ● sequence number of enrollment for that user
  • 19. User entropy features user entropy over ● days ● weekdays ● fraction (4) of weekdays ● hours of the day
  • 20. User sequence features ● for each user, built sequences of ○ weekdays ○ chapter_ids ○ events ● treated them as 3 text variables. Ran ○ SVD on 3 grams => first 10 components ○ DataRobot stacked predictions from logistic regr. + Nystroem SVM on (tuned) n-grams
  • 21. How we got to the TOP3 ● entropy features mentioned before ● exploited info in ○ log count in the 5 / 10 / 20 days after end of course ○ log counts by event, sign_up counts and day entropy in the next 10 days after end of course ○ time to sign up for new course ○ time until the next log for same user added ~0.001 to AUC (vs less powerful features) added ~0.002 to AUC