4. Car
Destination Crude oil
Refined Oil
process oil into more useful products such gasoline
A successful journeyKey elements for a successful car journey
5. Car = Modelling engine
Machine Learning solutions replace more and more traditional statistical
approach and can automate the modelling process and produce world-
class predictive accuracy without much effort
Destination = Outcome
well defined outcome to predict and well defined
process to use it to optimize business problems
Crude Oil = Raw Data
increased volume and capacity to handle
terabytes of Data
Refined oil = Feature Engineering
talent to extract from raw data
information that can be used by models
open source
programming
social network of
coders
automated
solutions
Key elements for a successful data science journey
8. 8
● Hosted by Practice Fusion, a cloud-based electronic health record
platform for doctors and patients
● Challenge: Given a de-identified data set of patient electronic health
records, build a model to determine who has a diabetes diagnosis
● Data:
○ 17 tables containing 4 years history of medical records!
Example 1:
20. Hosted by XuetangX, a Chinese MOOC learning platform initiated by Tsinghua
University
Challenge: predict whether a user will drop a course within next 10 days based on his or
her prior activities.
Data:
enrollment_train (120K rows) / enrollment_test (80K rows):
Columns: enrollment_id, username, course_id
log_train / log_test
Columns: enrollment_id, time, source, event, object
object
Columns: course_id, module_id, category, children, start
truth_train
Columns: enrollment_id, dropped_out
Example 2:
21. We applied same recipes to log data
5890
objects
and generated a flat file with 100s of
features!!!
22. Techniques we used in
… to describe course, enrollment and students from log
data:
counts
time statistics (min, mean, max, diff)
entropy
sequences treated as text on which we ran
SVD and logistic regression on 3grams
20 first components of SVD on user x object
More can be found in http://www.slideshare.net/DataRobot/featurizing-log-data-before-xgboost
23. Key takeaways
Machine Learning (ML) can automatically generate world class
predictive accuracy
But feature engineering is still an art that requires a lot of creativity,
business insight, curiosity and effort
Be careful! Infinite number of features can be generated… Start with
winning recipes (steal them from others and make up your own)
and then iterate with new recipes, ideas, external data... Stop when
you don’t get much additional accuracy