Parameter Server approaches for online learning at Twitter allow models to be updated continuously based on new data and improve predictions in real-time. Version 1.0 decouples training and prediction to increase efficiency. Version 2.0 scales training by distributing it across servers. Version 3.0 will scale large complex models by sharding models and features across multiple servers. These approaches enable Twitter to perform online learning on massive datasets and complex models in real-time.
Parameter Server Approach for Online Learning at Twitter
1. Parameter Server Approach for
Online Learning @ Twitter
Joe Xie, Yong Wang and Yue Lu
ML Infra Group, Ads Prediction Team
Oct 10, 2017
2. Outline
• Background
– Online learning
– Challenges
• Parameter Server Approaches
– v1.0 Decouple the training and prediction
– v2.0 Scale the training
– v3.0 Scale the model
• Future Directions
4. Twitter is Realtime
• Twitter is all about real-time: news, events, trends,
hashtags.
– Users interest and intent change in realtime.
– Context changes in realtime.
– New advertisers, new campaigns are added in realtime.
• ML is increasingly at the core of everything we build at
Twitter
– ML model dynamically adapts to changes spanning as short as a few
hours even minutes
5. Real time:
Time
Model
Data Stream
Prediction Stream
Time
Model
Data Stream
Prediction Stream
Online Learning Offline Learning
Learning Phase Training Phase Serving Phase
ReadWriteRead &
Write
Read &
Write
6. Real time – Online Learning
Architecture
Simple and efficient for Ads Prediction and
Moments Relevance production services
7. Challenges
• Network fanout
– The same traffic stream is sent many times over to each prediction
instance, wasting network bandwidth.
• Limit to training traffic size
–Online training throughput is currently limited by the capacity (CPU /
Network bandwidth) of a single mesos worker
• Limit to model size
– All model are hosted within the memory for each instance.
9. Model Architecture
Raw Features
Raw Features Feature Crosses Decision Tree
(e.g., XGBoost...)
Neural Network
(e.g., Torch,
TensorFlow...)
...
Distributed Large-scale Online Logistic Regression
(Parameter Server)
● Fully explore the feature interaction
w/o training latency constraint.
● The feature interactions don’t
change frequently historically.
● Flexible architecture with new model
structure & external machine
learning framework.
10. 20X training data
- Parameter server v2.0 to scale the
training traffic
10X features+algo complexity
- Parameter server v3.0 to scale the
model size
10X prediction qps
- Parameter server v1.0 to decouple
the training and prediction requests
Parameter Server Approaches
12. Parameter Server v1.0
• Separated training service
–Take training traffic to generate incremental model update
• New observation service
– Consume incremental model update
– Evaluate training traffic for model quality assurance
• Separated prediction service
– Consume incremental model update
– Serve the prediction request
13. Parameter Server v1.0
• Launched into ads engagement
prediction models.
– Mesos Efficiency: 40% reduction in CPU cores
required.
– Network Efficiency: 60% reduction in fan-out
messages required.
14. Parameter Server v2.0
Parameter
Server
Mo
del
Instance of
Prediction
Service Mo
del
Training
Workers
Training
Traffic
Observation
Service
Observation
Service
Observation
Worker
NO downsamplingPull
Push/Pull
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
M
od
el
Instance of
Prediction
ServicePrediction
Workers
Pull
Model
ModelModel
Model
Through
■ New architecture to
distribute the training
20X Training data
Higher model quality
Dispatch
Workers
Dispatch
Workers
Dispatch
Workers
Downsampling
Prediction
Requests
15. Parameter Server v2.0
• New dispatch service
–Take un-sampled training traffic and dispatch to training service
• Updated training service
–Take training traffic and produce updates for parameter service
–Receive model update from parameter service
• New parameter service
– Aggregate the updates from training services
– Send model update to training / observation / prediction services
16. Parameter Server v2.0
• Launched into ads engagement
prediction models.
• First version using simple model-average
aggregation.
–20x training capacity
–xx% model quality gain
17. Parameter Server v3.0
Mo
del
Instance of
Prediction
Service Mo
del
Training
Workers
Training
Traffic
Observation
Service
Observation
Service
Observation
Worker
NO downsamplingPull
Push/Pull
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
M
od
el
Instance of
Prediction
ServicePrediction
Workers
Pull
Model
ModelModel
Model
Dispatch
Workers
Dispatch
Workers
Dispatch
Workers
Downsampling
Prediction
RequestsParameter
Server
Parameter
Server
Parameter
Server
Model
Through
■ New architecture for
model / feature sharding
More complex model
Higher model quality
18. Parameter Server v3.0
• Updated parameter service (In progress)
–Model sharding: Parameter instance hosts single model instead of
multiple models.
•xx% model quality gain in experimentation.
–Feature sharding: Parameter instance hosts partial of single model.