Thank you Jon. Good morning everybody. Thank you for coming to our talk. I realize that that this is the last talk before lunch; so, we shall keep it short and leave 5 mins for Q&A. My colleague Vibhav and I will be presenting our methodology behind forecasting the life of a healthy service. If you have any questions, please feel free to ask at the end of presentation or reach out to us afterwards or tweet us at any time – our tweet handles are mentioned on the slide. Both Vibhav and I are part of the Capacity and Performance Engineering team at Twitter. The team is chartered to forecast capacity demand, to create performance and scalability models for our services, and to come up with capacity budget. In this regard, we work with a various teams such as supply, finance etc.
I shall kick off the talk with motivation behind the metric Days in Green and then Vibhav shall walk you through the details of the methodology behind DIG. Finally, I shall wrap up with the lessons learned.
Last month Mary Meeker from KPCB posted her annual report on Internet trends. Two highlights of the report were that mobile now accounts for 25% of the total web usage. Also, mobile data traffic growth has been accelerating. There are several drivers such as video behind 81% growth. A key aspect of mobile internet usage is real-time content consumption. Be it a selfie as was the case during Oscars earlier this year, or be the elections in India, the largest democracy in the world or be it events such as the Asiana plane crash or the unfortunate mudslide in Afghanistan.
Capacity planning plays a key role at Twitter for a variety of reasons (which I believe also hold true in general). First of all, we need to carry out capacity planning to support organic growth. In a recent SEC filing, we reported that Twitter has over 255M monthly active users. Further, we need to plan to support growing user engagement. Also, in an agile product development such as at Twitter, new features and new products are A/B tested or rolled out at a rapid pace. To support this, a systematic capacity planning methodology is needed. Last but not the least, capacity planning is needed to support peak traffic. In a recent report, Cisco report that the mobile busy hour was 66% higher than the average hour and is expected to climb to 81% by 2018. It is well know that people flock to Twitter during events such as the Superbowl, Oscars, UEFA Champions League and the ongoing FIFA World Cup; these events induce significant traffic peaks.
A straightforward approach towards capacity planning would be to throw hardware at the problem. Operationally, this approach is clearly not desirable.
Broadly speaking, the objectives of capacity planning are two fold: Check under allocation so as to check any impact on performance and availability of a given service. Both poor performance and availability in turn adversely impact end-user user experience. Second, check over-allocation as it directly relates to operational efficiency. The plot here exemplifies underutilization wherein the red line corresponds to the max utilization for a given SLA and the green line corresponds to actual utilization.
There are two approaches to capacity planning. Under reactive approach additional capacity on a need basis. However, this may result in poor user experience which would in turn adversely impact the bottomline. Clearly, Reactive approach is undesirable. At Twitter, we employ a proactive approach to capacity planning. Further, we use statistical models to determine capacity ask on a per-service basis. Depending on the context, we business metrics such as TPS, Photos per sec or system metrics for forecasting.
Capacity planning is non-trivial due to a variety of other factors such as rapidly evolving product landscape. This in turn changes the performance profile of a given service. Also, organic growth exacts new capacity asks.
Also, an approach should be scalable. At Twitter, we have a service oriented architecture with hundreds of services. We monitor millions of time series on a daily basis. Last but not the least, the methodology needs to be automated. Vibhav shall now walk you through the details of our approach to forecast the life of a healthy service.
One of the most fundamental questions that we deal with in capacity planning is to understand when a service is expected go over its capacity limit. Most approaches to this are reactive in nature, like when you run into performance problems because the service is not able to handle the traffic that is being thrown at it. So we have come up with the idea of DIG number associated with each service, which tells us the # of days left before the service goes out of capacity.
Determine the Driving Resource – The driving resource is the metric that constrains the service, e.g., CPU, Disk etc… Determine Capacity Threshold T – Determine the maximum value of the core driver at which the service will remain healthy Generate a time series for the metric and forecast its value into the future. So here we setup the notion of the Green Zone and the Red Zone. The Green Zone is the region in the graph before the driving resource exceeds threshold T, and the Red Zone is the region after. We then calculate DIG, Days In Green, as the distance between the last point in the raw time series and the start of the RED zone, expressed as the number of day for the service in the green zone,
As already stated before, the first step in determining DIG is to determine the driving resource and its capacity threshold for the service.
Load Test –
In production using canaries by manipulating the amount of traffic served to it (more accurate) Or could also be a in a load test environment where we replay production traffic.
I would like to point out all graphs that we show from this point onwards are generated from real production data.
The graph shows what is commonly called a hockey stick curve. We plot the CPU on the x axis and Latency on the Y axis from a real production load test, and we observe that until CPU reaches threshold T, the relationship between CPU and Latency is linear, and after this point the latency goes into non-linear degradation, so we choose T as are CPU threshold
So depending on the service characteristics we may choose different metric thresholds, e.g. CPU at 70%, Disk at 80%. If we want to use an application metric we can use RPS too, but keep in mind that the RPS metric threshold can change based on the underlying code.
Once we have the threshold for the driving resource, the next step in determining DIG is to generate and analyze the time series for that resource. In order to generate suitable time series, we need to answer two important questions.
Granularity – Should we use daily, hourly, minutely data. Since we want to a long term forecast we choose daily data. Then we need to choose a daily value that is likely to give us a stable time series as well as be as accurate as possible. Because of the nature of twitter’s traffic daily peaks are not very stable and are prone to spikes. We therefore want to choose a value that is close to the daily peak. and has low standard deviation. If you look at this table, I have compared the different percentile values of a metric for a certain number of days. As you can see the mean of the daily maxes is 57.7 and the standard deviation is 3.29. The mean of the daily p99 is 54.7 and standard deviation is 2.49. The values for p95 are 53.1 and 2.4 respectively. There are 1440 mins in a day and if we are only removing 14.4 mins of the day from consideration, against 72 mins for p95. So in this case p99 is a better metric to use. We also assume 7 day seasonality since we have different patterns on different days of the week.
Duration – How long should the time series be? It has been observed that to generate a statistically significant model, we need a minimum of 30 data points. Also, a time series of 90 days give us enough data points to discern a reasonable trend in the data. Due to the frequent changes in service profile anything more than 90 days would probably end up being stale data.
Model Fitting - Once we have built the time series we try to fit a statistical model on it an then forecast it to generate the DIG number
The most simple form of model fitting is linear regression which I am sure everyone has used at some point in time. It is very good for capturing trend but does not fit well for a time series which shows seasonality. As you can see from the graph, the trend is decently captured but the R2 value , which tells us how well the model fits with dataset with values between 0 and 1, with 1 being the perfect fit, is low at 0.56, which doesn’t give us good confidence in our forecast. This model also doesn’t give any weightage to recent data, which would give us a better idea of recent trend changes.
The other model that is quite common is the polynomial model, which fits better than a linear model, with a R2 value of 0.62, but as you can see does a poor job of forecasting. It is also seasonality unaware like the linear model.
There are other techniques like Splines which are really good for fitting but not good for forecasting because they tend to overfit the data.
Holt Winters is widely used to model time series with seasonality and trend but they are modeled implicitly and the parameters are hard to control and automate.
And then there is ARIMA, which we use to model our time series to generate DIG
So ARIMA stands for Auto-Regressive Integrated Moving Average. The model is made of three parameters, p,d, and q where , d is Integrated Order, p is the AR order and q is the MA order. The equation shown here are for the AR and MA components of the model. The I component is evaluated by differencing the time series to de-trend it. I will not go into the details of the math as it is beyond the scope of this talk
The most interesting properties of ARIMA are its ability to explicitly model the seasonality and trend based on the parameters we pass it. And as we said, the “d” parameter allows modeling of time series that are not stationary in nature by detrending the time series by differencing
One of the interesting properties of ARIMA is that if is not able to detect seasonality it degenerate into a linear fit, which can still be useful.
So here we see the an ARIMA model that we fit onto a time series generated by a service. With a nice seasonality and a noticeable trend, ARIMA does a very good job of fitting the time series as shown by the blue dashed line, and is able to forecast the same pattern as shown by the green dashed lined at the end
So its should be pretty easy to take an off the shelf ARIMA package and fit any time series right? Not quite
What we showed in the previous case was ideal case of time series. Most time series that we see are not stable enough to model easily.
We would therefore now like to discuss some of the characteristics of the time series that we notice in our data.
As can be seen in the graph, time series can have anomalies, which are data points with values that are much higher than values around them and show up as spikes on the graph. Anomalies can be both positive and negative, and can have a big impact on the model. Depending on where the anomalies are located, they can skew the model. A really spikey graph is likely to produce really bad forecasts.
Due to changes in the service profile, new features etc, we can also have breakouts, where there is a sudden jump or decrease in metric over a sustained period of time. We observe different types of breakout. For example, if there is a code deployment which makes the service is more expensive and increases the cpu utilization from 40-50% we observe a mean shift. If in another case there is a new feature which is seeing increased adoption but has not stabilized yet, we might see a ramp up
Breakouts make it hard for ARIMA to detect trends and produce a stationary time series for forecasting.
One of the interesting things that you notice in this graph is the change in the shape of the seasonality after the breakout. If we consider the time series after the breakout, it provides for a much better ARIMA model than if we choose the entire time series
For some time series we also see weak seasonality and also different seasonality patterns throughout the time series . In presence of weak seasonality , anomalies can have a strong effect on the model.
As I said before the patterns that we see in the time series can be due to multiple reasons , not limited to the multiple deployments we do throughout the day, changes in traffic pattern, collection issues etc…
We would now like to walk you through the process of fitting the ARIMA model in light of the issues described previously. We take an example time series shown here.
As we can see, this time series is a very interesting use. We observe a breakout in the beginning of the graph, three different trend lines, a lot of a anomalies, and weak seasonality patterns throughout the time series
We can also see that ARIMA seems to fit the time series and seems aware of the trend and seasonality patterns. This is shown by blue dashed line which plots the fitted data.
Lets see what the forecast look like?
As you can see, not very good. Because of the multiple trends and anomalies, the model is not able to successfully de-trend the time series, and gives us an flat linear forecast, which basically means its not able to forecast this time series. At this point I would also like to introduce the idea of confidence bands, which is the space between the top blue dotted line and bottom brown dotted line. The green dotted line is the mean of two. Since we are using 95% confidence interval, the forecast tells us that there is a 95% probability of the forecasted value lying between these two bands, As we can see here, if the confidence band is very big, so its hard to make a good judgment on the quality of forecast. The lower confidence band actually shows a downward trend here
So how many Days In Green does this service have according to the forecast? It is the length of the mean forecast, with a confidence band of where the upper band hits threshold T and the length of the lower confidence band. If we assume each box in the grid is about 5 days, the DIG is 40, with a confidence band 10 – 40 days, where 40 is the the maximum forecast days
So in order to improve our model, we first apply our breakout algorithm to the raw time series. As we can see, the algorithm is able to detect a mean shift in the series , and we remove that data from the fit, as can be seen by the dashed fit line.
As we can see this makes a huge difference in the forecast. The time series has enough stability that allows ARIMA to de-trend it and create a stationary time series. As we can see the seasonality is also visible in the forecast of the confidence bands. The lower confidence band also shows an upward trend, unlike the previous case. So in this case the service has 35 Days In Green, with a confidence band of 2-45
However we still see the limitations of the model fitting. It is still susceptible to anomalies , like the big anomaly we see at the start of the fitted time series. The size of the anomaly actually flattens the forecast a little bit, and we also see a relatively wide confidence band.
We next apply are anomaly detection algorithm over the time series generated by the breakout and we observe that we have been able to eliminate the big anomaly. This results in the reduced weight of that data point on the model and gives more weightage to the later trend in the time series. This reduces the width of the confidence band, resulting in improved accuracy of the forecast. In this case the service has 25 Days In Green with a confidence band of 2-40
We would now like to compare DIG number for the 3 time series discussed earlier. The purple line at the bottom Is the forecast for the RAW time series, The magenta line in the middle is the forecast for time series with breakout detection and the dark blue line is the forecast for time series with breakout and anomaly detection.
So the DIG number is the number forecast by the dark blue line, since it has smallest confidence band, and we can see it has significantly improved the forecast from other two time series. Based on the numbers shown previously, the service has 25 Days In Green, which is improvement of 10 days over DIG for time series with just the Breakout algorithm and 15 days over the raw time series.
From a capacity perspective, In this case it would lead us to add capacity early , thus saving us from potential performance issues.
In other situations, if we report a larger DIG number, then it can lead to potential cost savings, since we know that we can add capacity at a later date.
And that in my view Is the significance of DIG.
So next we would like to look at some of the boundary conditions and limitations of this approach. One of the major boundary conditions we have found with the use of ARIMA is the condition of false seasonality at the end of a time series, where ARIMA tries to assume the change in pattern to be seasonality and tries to forecast that. As we can see , the confidence band widens pretty quickly.
As we have mentioned before the time series we see are not the most stable and we do a best effort to remove breakouts and anomalies, however the time series left may not provide a strong sense of seasonality or trend to allow the model to generate a good forecast. In these cases we may need to look at a longer time series.
Again some time series are not stable enough to forecast and we may not be able to use breakout or anomaly detection to make them forecastable. We will need to look at alternative methods of forecasting here and we are looking at them.
We have already deployed DIG in production and are using it to monitor hundreds of services in a fully automated fashion. Our current algorithm uses CPU data from our services but will be extending to other metrics as well. We are currently using to monitor DR compliance across multiple data centers and to detect services that get close to DR threshold, and allocate capacity accordingly
One of the interesting outcomes of this works has been the idea of Utilization Based Allocation where we are trying to forecast when and how much capacity. The idea is to forecast the value of metric at a certain time in the future and use that forecast value to calculate expected capacity
To summarize the lessons we have learned: It’s of paramount importance to check for data fidelity w.r.t presence of anomalies, absence of seasonality. Further, no forecasting model is perfect. So, it’s critical to assess the forecasting error and continuously refine the model, particularly when the incoming data is dynamic in nature as is the case at Twitter.
Days In Green : Forecasting the Life of a Healthy Service @Twitter
Days In Green (DIG):
Forecasting the life of a healthy service
Vibhav Garg, Arun Kejariwal
Capacity and Performance Engineering @ Twitter
25% of total web usage 
Mobile data traffic: 81%, accelerating growth 
 http://www.kpcb.com/file/kpcb-internet-trends-2014 (May 2014) VG, AK 4
Capacity & Performance
• Organic growth
Over 255M monthly active users 
• Evolving product landscape
• Handle Peak Traffic
Mobile Busy Hour Is 66% Higher Than Average Hour in 2013, 83% by 2018
 http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white_paper_c11-520862.html VG, AK 5
Systematic Capacity Planning
o Adversely impact user experience
o Adversely impacts bottom line
Check poor scalability
Adversely impact user experience
VG, AK 6
Systematic Capacity Planning (contd.)
Rapidly evolving product landscape
Changes services’ performance profile
• Scalable Approach
Service Oriented Architecture
100s of services
Millions of metrics [1,2]
 http://strataconf.com/strata2014/public/schedule/detail/32431 VG, AK 7
DIG: Days in Green
Statistically determine the # of days for which a service is expected to stay
Determine driving resource
Determine capacity threshold T
Generate a time series and forecast
DIG - # days before the service is expected to exceed T
VG, AK 8
• Determining Capacity Thresholds
Driving resource differs
Replay production traffic
CPU at 70%
Disk utilization at, 80%
RPS at X requests/sec
VG, AK 9
• Time Series Analysis
• Long term forecast
o Which value?
• Close to the daily peak but low standard deviation (σ)
o Assume 7 day seasonality
o 30-90 days
VG, AK 10
Percentile Duration Mean σ
100 (Max) 57.7 3.29
99 14.4 mins 54.7 2.49
95 72 mins 53.1 2.4
• Model fitting
Captures trend well
Does not fit well for seasonal time series
No weightage to recent data
VG, AK 11
R2 = 0.56
• Model fitting
Fits better than linear, not good for forecasting
VG, AK 12
R2 = 0.62
• Model fitting
Widely used for curve fitting
Tend to overfit data
Not suitable for forecasting
Triple Exponential Smoothing (Holt Winters)
Good for fit and forecasting
Trend and seasonality modeled implicitly
VG, AK 13
• Auto-Regressive Integrated Moving Average
(p, d , q)
Explicitly models seasonality and trend
Applicable to non-stationary time series
Worst Case degenerates to linear fit
Moving Average component
Moving Average order
VG, AK 14
• Model Fitting
ARIMA in action
Captures underlying trend
Are we good? Not quite!
VG, AK 15
• Time Series Characteristics
VG, AK 16
• Time series characteristics
o Mean shift
o Ramp up
o Positive, Negative
VG, AK 17
• Time series characteristics
Various reasons (but not limited to)
Changes in traffic
VG, AK 18
VG, AK 19
• Curve fitting with ARIMA
Trend and seasonality aware
What does the DIG forecast look like?
• ARIMA Forecast
Not a good forecast because of multiple trends and anomalies
Wide confidence band
40 Days In Green with Confidence band of 10-40
VG, AK 20
• ARIMA Forecast with breakout(s) eliminated
35 Days In Green with a Confidence Band of 2-40
o Wide confidence band
o Susceptible to anomalies VG, AK 21
• ARIMA Forecast with Breakout and Anomaly eliminated
25 Days In Green with a Confidence Band of 2-40
Narrow confidence band
VG, AK 22
• DIG Comparison
With breakout and anomaly detection
VG, AK 23
Raw - BO
Raw – BO- Anomaly
VG, AK 24
“Quality” of data: Poor forecasts
VG, AK 25
Idiosyncratic patterns: Poor forecasts
VG, AK 26
VG, AK 27
• Current Status – Deployed in Production
Hundreds of services
Fully automated for CPU, extending to other metrics
Combine data from multiple datacenters
Detect services that are close to DR threshold
• Future Work
Utilization Based Allocation
VG, AK 28
• Anomaly Detection
Algorithm developed in-house
Presented at USENIX HotCloud’14
Wrapping up & Lessons learned
• DIG: Days In Green
Proactively assess future health of a service
Modeling and forecasting: ARIMA
Anomaly and Breakout removal
Hard to get a stable time series
Organic growth, New products, Behavioral aspect
Exploring advanced data cleansing techniques
Improve Breakout and Anomaly Detection
VG, AK 30
• Piyush Kumar, Capacity Engineer
• Winston Lee, Capacity Engineer
• Owen Vallis Jr & Jordan Hochenbaum, Ex Interns
• Nicholas James, Intern
• Management team
VG, AK 31
Join the Flock
• We are hiring!!
Contact us: @ativilambit, @arun_kejariwal
Like problem solving? Like challenges? Be at cutting Edge Make an impact
VG, AK 32