Recently I gave a talk at UC Berkeley regarding the transition from academia to industry in the context of Machine Learning and Data Science related roles. I based most of my slides on my own transition from being an Astrophysicist to a Machine Learning Expert. I hope this will be useful to many. Feedback is welcome!
6. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipeline
Monitor offline
metrics
The Grand Innovation Workflow
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
7. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipeline
Monitor offline
metrics
In some companies, this is a data scientist
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
8. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipeline
Monitor offline
metrics
In some other companies, this is a data scientist
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
9. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipeline
Monitor offline
metrics
yet in some other companies, this is a data scientist
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
10. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipeline
Monitor offline
metrics
At Netflix, this is broadly what I do
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
12. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipeline
Monitor offline
metrics
SQL, Spark (scala), PySpark, Python-Pandas, Hive,AWS-S3
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
13. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipeline
Monitor offline
metrics
Matplotlib, Tableau, Vega, Plotly, custom javascript (d3)
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
14. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipeline
Monitor offline
metrics
Hive, s3, APIs in Flask/Django/Java
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
15. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipeline
Monitor offline
metricsPython, SciKit-learn, Jupyter notebooks,
TensorFlow/Keras, XGBoost, SparkML/scala, Zeppelin ...
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
16. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipelines
Monitor offline
metrics
Docker, company specific platforms
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
17. Identify
Problem
Understand what
is important to
the business
Deep Data Dives
Visualizations
Communicate to
Stakeholders
Sometimes top
down, sometimes
ground Up Idea
Generation
Prepare
Data
Build
Models
Implement in
Production
Test
Hypotheses
Slice/dice/
massage data
Work with data
teams to ensure
data integrity
Make sure data
tables/feeds that
you need are
stood up
Offline/online
data integrity
Prototype
features
Modeling
extremes: out-of
the-box Logistic
Regression, GBMs
to adapting an
emergent idea
from a recent
paper!
Set up offline
training pipelines
Monitor offline
metrics
Java, Scala, in some cases Python, company specific
Design the
experiment/hypot
hesis/cell
structure
Integrate your
models with the
production
systems (code
review, load
testing)
Hook up with the
testing platform
Read results of
experiments to
determine
significance
Slice and dice the
online data to
determine if your
test affected the
intended audience
If results are flat,
rinse and repeat!
23. ● Perseverance
● Ability to pick up new technical skills
● Presentation skills
● Some quantitative visualization skills
● Ability to distil technical research in related areas and adapt it to the problem at hand
● If you are from a quantitative and experimental field:
○ Mathematical abilities
○ Knowledge of Basic Statistics - error analysis, experiment design
○ Some parameter estimation, bayesian inference exposure
○ Some ability to write code
○ Some exposure to general machine learning
● Learning from failure: Most A/B tests fail - so do experiments in academia
● Writing papers/ technical blogs etc.
25. ● Being a good listener
● Asking questions
● Understanding and articulating the business value of your technical pursuit
● Writing clean, maintainable code with documentation and unit tests
● Ability to collaborate across teams and cultures - cross-functionally
● Admitting that “Good enough” is better than perfect
● Coping with quick project timelines
● Documenting, sharing, getting early input on projects
● Dealing with live, large, and exceptionally dirty datasets.
● Understanding that research in Industry is results driven and not publication driven.
● Stepping out of your focus area and seeing your problem in the bigger context of where your
company is headed.
27. Fill in your
basic skills
gaps
Databases, SQL,
Spark familiarity
Data Structures
Algo/CS 101
Get really strong
in one language -
highly
recommend
Python - pandas,
scikit ecosystem
Good coding
practices -
documentation,
modular code,
unit tests
Amp up
your ML
Knowledge
Create an
Online
Presence
Improve soft
skills
Interview
Prep
Your friends:
Online courses
and open
datasets!
Do mini projects
on ML, esp. Deep
Learning,
Reinforcement
Learning. Get
creative!
Get a rock solid
foundation in
basic stats.
Kaggle
Competitions
Github repo so
recruiters can look
at your code.
Put your hobby
projects online
Write a blog post
on something new
you learned
Follow/contribute
to Stackoverflow
Landing the First Job!
Identify
weakness in
communication
skills and work
on them.
Pick up speaking
engagements at
meetups, at your
university, and
conferences such
as PyData
Do collaborative
projects with
people who are
also transitioning
Practise whiteboarding,
collaborative coding on
CoderPad
Standard books like
Cracking the Coding
Interview, Glassdoor
Go for some “dry run”
interviews.
Do background research
on the company - be
inquisitive, ask
questions
Keep at it!