How Manulife Saved Over $730K by Improving PCF Efficiency and Reliability

Start Counting -
How we unlocked platform
efficiency and reliability,
while saving over $730,000
David Wu, Ph.D - Senior Staff Solutions Architect - VMware Tanzu Labs
David Filippelli - Lead Site Reliability Engineer - Manulife
Alvin Kwame Coch - Senior Site Reliability Engineer - Manulife
2nd September 2021
$1 Million

Content
- PCF @ Manulife
- Efficiency vs Reliability
- Improving Efficiency & Reliability
- Tallying The Savings
- References
- Acknowledgements
- Q & A

PCF @ Manulife
- 7 PCF Foundations in Azure North America and Azure Asia
- Total: ~8000 AIs and 540 Diego Cells across all foundations
- Problem Statement:
- How can we improve the efficiency and reliability of our platforms while saving money?

Efficiency vs Reliability
- Efficiency - Make the platform run better, therefore platform engineers life easier
- Optimize resources used
- Reduce cost and time to do something
- Improve monitoring strategies
- Reliability - Deliver an exceptional customer first experience
- Increase application uptime
- Increase service availability
- Improve Recovery Point Objective (RPO)

Improving Efficiency & Reliability
- Diego Cell Scheduler
- Switching TAS Internal blobstore to External blobstore
- Changing and Tuning Diego Cell VM types
- Foundation Configuration Tuning

Improving Efficiency & Reliability – Diego Cell Scheduler
- Manulife Scheduler App
- Developers self service to subscribe what (apps) and when (day of week and time) to stop/start
their apps.
- Incentive to developers: Save costs on charge back

Improving Efficiency & Reliability - Diego Cell Scheduler
- Dev and Sandbox environments not fully utilized on weekends starting Friday evenings until
Monday morning.
- Costs are incurred to both app teams and platform team. How to save money?
- Concourse pipeline to query diego cell memory utilization after apps stopped, to determine how
many diego cells to scale down by. This is reverted to original diego cell count via pipeline on
Monday morning.
diego
cell
count
day of week
M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F

Improving Efficiency & Reliability - Internal blobstore to External blobstore
- Using internal blobstore == NFS running on VM
- Cost of VM+persistent disk > cost of Azure blob storage (GRS, us-east1, hot) [1,2,3]
- Decrease platform upgrade times while increasing availability
- NFS VM is not HA [4]
- Large NFS persistent disks leads to long upgrade times
- Long upgrade times lead to dev outages and potential issues with auto scaling/healing.
- Decrease backup time and reduce outage (cloud controller [CC] lock)
- e.g. Sandbox: 3 hrs backup, 1hr 14 min CC lock → 13 min backup, 1 min CC lock
- Locked CC leads to dev outages (no app pushes, no delete, no autoscale)
- Possible to do more backups to meet or increase Recovery Point Objective (RPO)

Azure service endpoints
speed up access to blob
storage
Setup firewalls, ensure CC
can access + Windows Diego
cells for buildpacks
4 x blob storage
containers -
Buildpacks
Images
Droplets
Resources
Prepare blob storage and
required processes
First Migration
5 days before switch
over
NFS
Internal
blobstore
in PCF
Use rclone[5] to transfer
blob objects
Check for routing issues
and slow transfer
speeds
...
Incremental
Migration
1 day before switch
over
NFS
Copy only newer blobs.
Will be much faster
Perform full switch
over
...

Azure service endpoints
speed up access to blob
storage
4 x blob storage
containers
NFS
Performing the full switchover
CC
Conduct cf push
performance timing
tests* and record results
Use candidate apps on
platform, e.g. Linux +
Windows. Do at least 2
cf pushes for cache
1
Lockdown the
Cloud controller
using cf cli
No app pushes
will be possible
3
Perform a
final copy
migration
NFS
4 Take a snapshot of NFS
Persistent Disk.
(Original NFS disk will
remain orphaned for 5 days
after switchover) PCF foundation
5 Configure and
apply changes
for switch over
6
Perform at least 2
performance tests and
compare results from
step 1. Check for
network issues.
7 Unlock the cloud
controller using
cf cli
2
* https://github.com/dawu415/PCFToolkit/tree/master/tests/cfpush

Improving Efficiency & Reliability - Changing and Tuning Diego Cell VM types
- Optimize utilization of VM resources. e.g., memory & disk
- Memory optimized VMs - Esv3 series [6]
- Cost of Dsv2 (default) > cost of Esv3 VMs [1]
- Reduce disk space use
- Potential to decrease VM count
- More VM memory in Esv3: Dsv2 - 28Gb RAM vs Esv3 - 32Gb RAM
- Increased app density per cell (don’t over do it)
- Improve reliability by tuning diego cells
- Ensure are using sufficient resources to support current needs and future growth

Tuning diego cells:
- Questions we want quantifiable answers to
- How many diego cells do I need to fit x AIs?
- Are we under-provisioned or appropriately over-provisioned on diego cells?
- What’s the minimum disk space per cell? Azure charges based on upper tier.
- Information and statistics we need to know
- What else is running on the cells and how much memory they need? e.g. Anti-virus
- How many AIs, average Memory per AI used and disk quota per AI used
- Can use cf api to get a snapshot of raw information.
- cf applist script from Rakutentech (https://github.com/rakutentech/cf-tools/blob/master/cf-applist.sh)
./cf-applist.sh -s Instances -s Instances -f Name,State,Instances,Memory,Disk_quota > applist_<env>.txt

- CF capacity behaviour in TAS container and architecture docs [7,8]
- Use our capacity planning spreadsheet:
→ Worksheets at https://github.com/dawu415/PCFToolkit
→ Get all AI memory used and disk quota raw snapshot
→ Get average AI memory and disk quota used. Will be in Mb.
→ Divide by 1024 to get Gb and input to worksheet.
- Capacity information:
- Build capacity monitoring dashboards and alerts
- Understand usage behaviours and do simulations
e.g. what if 25% of app instances switch to higher disk use?

Additional things to consider:
- VM Disk and Memory Capacity
- For on-prem and Availability Zone (AZ) customers:
- Ensure 1/N % extra in IaaS memory and disk to cover 1 AZ failure in an N AZ setup.
- Factor additional resource per cell reserved for addons
- Warning level in your monitoring.
- CPU vs Memory
- If possible, avoid packing too many AIs into a single diego cell.
- Too many means longer evacuation time leading to longer recovery and VM update times.
- Too many also can hinder application performance. CPU share is based off the AI memory.

Changing VM Types:
- Ensure you have historical metrics for CPU, memory and disk bandwidth usage
- > 30 days is preferable and seasonal time of heavy usage
- IaaS level may have historical data to help with this.
- If not enough data, can use blue-green algorithm.
- Does the new VM type support your needs?
- Correlate heavy usage peak metrics to
- CPU load
- Memory utilization and Disk IOPS
- Network egress bandwidth
- Write disk IOPS tests scripts using FIO [9] to verify. Can provision empty bosh VM to do tests.*
*https://www.starkandwayne.com/blog/hey-bosh-gimme-a-vm/

- How to really change the VM type?
- Change VM type in operations manager/platform automation configuration
- Opsman 2.10.3 supports new generation of Azure VM types (without Availability Sets) [10]
- Otherwise add these in automation/manual via opsman api
- But wait…...IaaS might not allow the change or have constraints on what can be changed
- Azure: Cannot switch Availability Sets(AS) sitting in old HW clusters.
- Switch AS only possible when 1 VM in it. e.g. diego cell AS, only 1 diego cell VM.
- Before using the algorithm, check your
- IaaS VM quotas and subnets have capacity
- At least double your current TAS plus all existing iso seg diego cell count.
- Firewall is setup.
- We used the same subnet as the existing diego cell to avoid creating new firewall rules - since it is
temporary for us.

- Zero Downtime Blue/Green Diego cell VM Switchover algorithm
- Use isolation segments to extend diego pool (replicate tile with unique name).
- Share same GoRouters as TAS.
- Leave the segment name blank to ‘extend’ the pool of TAS diego cells [11]
- Use the segment name to extend the pool of an existing iso seg diego cell
TAS Diego cells:
Dsv2.
Blue - current
running apps
Initial
Delete Iso Seg
tile
With thanks to D. Stevenson for initial discussion and idea
Scale down TAS
diego cell
Adjust max-in-flight for
diego cells to speed up
scale down but beware
and monitor BBS load.
Scale up TAS
diego cell
Scale down Iso
Seg diego cell to 0
Green -
running apps
in new Esv3
Extended TAS Diego
cells in Iso Seg: Esv3
White - no running apps
Setup Iso Seg
with diego cells
Convert TAS
diego cell to Esv3
Switched TAS
Diego cell to
Esv3
Diego cell = 1
Apps auto relocate
to new Diego cells

Improving Efficiency & Reliability - Foundation Configuration Tuning
What we are trying to achieve:
- Optimize utilization of VM resources.
- Decrease VM count for over-utilized resources
- Increase VM count to better platform health/HA

- Get your monitoring ready
- Will need to monitor platform metrics to help guide decisions
- Build dashboards based on KPIs [12, 13, 14, 15]
- Cloud Controller KPIs
- BBS CPU Load and Memory, Diego Cells: CPU load, memory, disk capacity
- UAA
- Loggregator: Doppler, Traffic controller and nozzles
- Overall platform health
- MySQL Server: CPU Load, Memory and Disk
- Go Router: CPU Load, Requests Per Second

- Table out the Changes
- Maintain a record of where you are vs where you want to be
- Maintain notes and comments of decisions
- Helps review cost benefit decisions on a bigger picture
Sample only, not
real data
Get worksheet at https://github.com/dawu415/PCFToolkit

- General VM metrics [12]
- CPU load < 80% - 90% ( < 60% GoRouter )
- Memory Utilization < 80%
- Persistent Disk < 80%-90%

- Loggregator [15]
- Understand the loggregator architecture and use the loggregator guide to assist [16]
- Doppler maximum effectiveness using horizontal VM scaling is 40 VMs (v1 + v2 configuration)
- Maintain a 2:1:1 ratio (doppler : traffic controller : nozzle)
- Check for dropped messages in doppler, connection loss of TC and high resource utilization of
nozzles. Should be part of monitoring
Dopplers
Traffic
Controllers/
RLP
Nozzles
Loggregator
Agents

- Cloud Controller [14]
- CPU load < 80% - 90% ( < 60% GoRouter )
- Memory Utilization < 80%
- Persistent Disk < 80% - 90%
- Determine what your the usage pattern + head room and simulate it with parallel cf pushes in a
script. https://github.com/dawu415/PCFToolkit/tree/master/tests/cfpush
- Exercises the cloud controller api when you can’t really test it in dev. Run in parallel: time cf push, time cf
delete. Review cloud controller metrics. Abnormally slow response times are indicative of low scaling
issues with CC or its worker.
- Review and integrate changes into dev environment. Monitor cloud controller metrics.
- Understand the nozzles that query the CC API to get app names to insert into log.

Tallying the Savings
Change Savings Notes
Diego cell Scheduling
Switching blobstore to
external
Switch diego cells VM +
tuning
Total
~$ 40,000.00 p.a
~$ 21,500.00 p.a
~$1.06 million p.a
~$ 1 million p.a
Originally calculated to be
~$730k but actual ~ $1
million
~30 VMs/foundation deleted
and recreated per weekend.
Savings from 2 North
America foundations

References
[1] Virtual Machine Pricing, https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/
[2] Managed Disk Pricing, https://azure.microsoft.com/en-us/pricing/details/managed-disks/
[3] Blob storage Pricing, https://azure.microsoft.com/en-us/pricing/details/storage/blobs/
[4] CF Blobstore High Availability, https://docs.cloudfoundry.org/concepts/high-availability.html#blobstore
[5] rclone tool download, https://rclone.org/downloads/
[6] Ev3 and Esv3-series, https://docs.microsoft.com/en-us/azure/virtual-machines/ev3-esv3-series
[7] TAS Container Mechanics, https://docs.pivotal.io/application-service/concepts/container-security.html#mechanics
[8] TAS Diego Architecture, https://docs.pivotal.io/application-service/concepts/diego/diego-architecture.html
[9] Azure, Benchmark a disk, https://docs.microsoft.com/en-us/azure/virtual-machines/disks-benchmarks
[10] Ops Manager v2.10.3 release notes, https://docs.pivotal.io/ops-manager/2-10/release-notes.html#2-10-3
[11] Isolation Segment - app container, https://docs.pivotal.io/application-service/2-7/operating/installing-pcf-is.html#application_containers
[12] Key Performance Indicators, https://docs.pivotal.io/application-service/operating/monitoring/key-cap-scaling.html
[13] Key Capacity Scaling Indicators, https://docs.pivotal.io/application-service/operating/monitoring/key-cap-scaling.html
[14] Scaling Cloud Controller, https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html
[15] Loggregator Guide for CF operators, https://docs.cloudfoundry.org/loggregator/log-ops-guide.html
[16] Loggregator Architecture, https://docs.cloudfoundry.org/loggregator/architecture.html

Acknowledgements
We would like to thank following people for their involvement and efforts to make things happen
- Piotr Chomiak
- Richard Garro
- Dan Buchko
- John Calabrese
- David Stevenson
- Michael Chung
- Kelvin Li
- Jonathan Leung
- Lok Wong
- Haydon Ryan
- John Tan

How Manulife Saved Over $730K by Improving PCF Efficiency and Reliability

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à How Manulife Saved Over $730K by Improving PCF Efficiency and Reliability

Similaire à How Manulife Saved Over $730K by Improving PCF Efficiency and Reliability (20)

Plus de VMware Tanzu

Plus de VMware Tanzu (20)

Dernier

Dernier (20)

How Manulife Saved Over $730K by Improving PCF Efficiency and Reliability