SlideShare a Scribd company logo
1 of 63
Download to read offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How Netflix Tunes
EC2 Instances for
Performance
B r e n d a n G r e g g , P e r f o r m a n c e a n d O S E n g i n e e r i n g T e a m
C M P 3 2 5
N o v e m b e r 2 8 , 2 0 1 7
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix performance and operating systems team
•  Evaluate technology
-  Instance types, Amazon Elastic Compute Cloud (EC2) options
•  Recommendations and best practices
-  Instance kernel tuning, assist app tuning
•  Develop performance tools
-  Develop tools for observability and analysis
•  Project support
-  New database, programming language, software change
•  Incident response
-  Performance issues, scalability issues
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
1.  Instance selection
2.  Amazon EC2 features
3.  Kernel tuning
4.  Methodologies
5.  Observability
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Warnings
•  This is what’s in our medicine cabinet
•  Consider these “best before: 2018”
•  Take only if prescribed by a performance engineer
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Instance selection
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Netflix cloud
Many application workloads: Compute, storage, caching…
S3
EC2
Cassandra
EVCache
Applications
(services)
ELB
Elasticsearch
SQSSES
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix AWS environment
•  Elastic Load Balancing
allows real load testing
1.  Single instance canary, then,
2.  Auto scaling group
•  Much better than micro-
benchmarking alone, which
is error prone
…
ASG-v011
…
ASG-v010
ASG Cluster
prod1
Canary
ELB
Instance
Instance
Instance
Instance
Instance
Instance
Instance
Instance
Instance
Instance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Current generation instances
•  Families:
-  m4: General purpose
•  Balanced
-  c5: Compute-optimized
•  Latest CPUs, lowest price/compute perf
-  i3, d2: Storage-optimized
•  SSD large capacity storage
-  r4, x1: Memory optimized
•  Lowest cost/Gbyte
-  p2, g3, f1: Accelerated computing
•  GPUs, FPGAs…
•  Types: Range from medium to 16x large+, depending on family
•  Netflix uses over 30 different instance types
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix instance type selection
A.  Flow chart
B.  By-resource
C.  Brute force
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A. Instance selection flow chart
Start
i3
Need large
disk capacity?
Disk I/O
bound?
Can
cache?
Select memory to
cache working set
Find best
balance
d2
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
B. By-resource approach
1.  Determine bounding resource
-  E.g.: CPU, disk I/O, or network I/O
-  Found using:
•  Estimation (expertise)
•  Resource observability with an existing real workload
•  Resource observability with a benchmark or load test (experimentation)
2.  Choose instance type for the bounding resource
-  If disk I/O, consider caching, and a memory-optimized type
-  We have tools to aid this choice: Nomogram Visualization
This focuses on optimizing a given workload
More efficiency can be found by adjusting the workload to suit instance types
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nomogram Visualization tool
2. Select
resources
3. From any
resource,
see types
and cost
1. Select
instance
families
(cost redacted)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
C. Brute force choice
1.  Run load test on ALL instance types
-  Optionally, different workload configurations as well
2.  Measure throughput
-  And check for acceptable latency
3.  Calculate price/performance for all types
4.  Choose most efficient type
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Latency requirements
•  Check for an acceptable latency distribution when
optimizing for price/performance
Headroom UnacceptableAcceptable
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix instance type re-selection
A.  Usage
B.  Cost
C.  Variance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A. Instance usage
•  Older instance types can be identified, analyzed, and upgraded
to newer types
Types
(redacted)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
B. Instance cost
•  Also checked regularly. Tuning the price in price/perf.
Cost per hour
Breakdowns
Details (redacted)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
C. Instance variance
•  An instance type may be resource-constrained only occasionally,
or after warmup, or a code change
•  Continually monitor performance, analyze variance/outliers
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2. Amazon EC2 features
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
EC2 virtualization
slide updated after talk. see: http://www.brendangregg.com/blog/2017-11-29/aws-ec2-virtualization-2017.html
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Networking SR-IOV
•  AWS "enhanced networking"
-  Uses SR-IOV: Single Root I/O Virtualization
-  PCIe device provides virtualized instances
-  Some instance types, VPC only
•  "Bare metal" network access
-  Higher network throughput, reduced RTT and jitter
-  ixgbe driver types: Up to 10 Gbps
-  ena driver types: Up to 25 Gbps
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storage SR-IOV
•  New in 2017, first used by i3s
•  Should be called "enhanced storage"
-  Some instance types only
-  Accesses NVMe attached storage (faster transport than SATA)
-  Uses VT-d for I/O virtualization
•  "Bare metal" disk access
-  i3.16xl can exceed 3 million IOPS
https://aws.amazon.com/blogs/aws/now-available-i3-instances-for-demanding-io-intensive-applications/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3. Kernel tuning
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kernel tuning
•  Typically 1-30% wins, for average performance
-  Adds up to significant savings for the Netflix cloud
•  Bigger wins when reducing latency outliers
•  Deploying tuning:
-  Generic performance tuning is baked into our base AMI
-  Experimental tuning is a package add-on (nflx-kernel-tunables)
-  Workload-specific tuning is configured in application AMIs
-  Remember to tune the workload with the tunables
•  We run Ubuntu Linux
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tuning targets
1.  CPU scheduler
2.  Virtual memory
3.  Huge pages
4.  NUMA
5.  File System
6.  Storage I/O
7.  Networking
8.  Hypervisor (Xen)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. CPU scheduler
•  Tunables:
-  Scheduler class, priorities, migration latency, tasksets…
•  Usage:
-  Some apps benefit from reducing migrations using taskset(1), numactl(8),
cgroups, and tuning sched_migration_cost_ns
-  Some Java apps have benefited from SCHED_BATCH, to reduce context
switching. E.g.:
# schedtool –B PID
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2. Virtual memory
•  Tunables:
-  Swappiness, overcommit, OOM behavior…
•  Usage:
-  Swappiness is set to zero to disable swapping and favor ditching the file
system page cache first to free memory. (This tunable doesn’t make much
difference, as swap devices are usually absent.)
vm.swappiness = 0 # from 60
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3. Huge pages
•  Tunables:
-  Explicit huge page usage, transparent huge pages (THPs)
-  Using 2 or 4 Mbytes, instead of 4k, should reduce various CPU overheads and
improve MMU page translation cache reach
•  Usage:
-  THPs (enabled in later Ubuntu kernels) depending on the workload and CPUs,
sometimes improve perf on HVM instances (~5% lower CPU), but sometimes
hurt perf (~25% higher CPU during %usr, and more during %sys refrag)
-  We switched it back to madvise:
# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
4. NUMA
•  Tunables:
-  NUMA balancing
•  Usage:
-  On multi-NUMA systems (largest instances) and earlier kernels (around 3.13),
NUMA page rebalance was too aggressive, and could consume 60% CPU alone.
-  We disable it. Will re-enable/tune later.
kernel.numa_balancing = 0
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
5. File system
•  Tunables:
-  Page cache flushing behavior, file system type and its own tunables (e.g., ZFS
on Linux)
•  Usage:
-  Page cache flushing is tuned to provide a more even behavior: Background
flush earlier, aggressive flush later
-  Access timestamps disabled, and other options depending on the FS
vm.dirty_ratio = 80 # from 40
vm.dirty_background_ratio = 5 # from 10
vm.dirty_expire_centisecs = 12000 # from 3000
mount -o defaults,noatime,discard,nobarrier …
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
6. Storage I/O
•  Tunables:
-  Read ahead size, number of in-flight requests, I/O scheduler, volume stripe
width…
•  Usage:
-  Some workloads, e.g., Cassandra, can be sensitive to read ahead size
-  SSDs can perform better with the “noop” scheduler (if not default already)
-  Tuning md chunk size and stripe width to match workload
/sys/block/*/queue/rq_affinity 2
/sys/block/*/queue/scheduler noop
/sys/block/*/queue/nr_requests 256
/sys/block/*/queue/read_ahead_kb 256
mdadm –chunk=64 ...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
7. Networking
•  Tunables:
-  TCP buffer sizes, TCP backlog, device backlog, TCP reuse…
•  Usage:
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 5000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_wmem = 4096 12582912 16777216
net.ipv4.tcp_rmem = 4096 12582912 16777216
net.ipv4.tcp_max_syn_backlog = 8096
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_abort_on_overflow = 1 # maybe
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
8. Hypervisor (Xen)
•  Tunables:
-  PV/HVM (baked into AMI)
-  Kernel clocksource. From slow to fast: hpet, xen, tsc
•  Usage:
-  We’ve encountered a Xen clocksource regression in the past (Ubuntu Trusty).
Fixed by tuning clocksource to TSC (although beware of clock drift).
-  Best case example (so far): CPU usage reduced by 30%, and average app
latency reduced by 43%.
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
4. Methodologies
Techniques of performance analysis
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. RPS, CPU 2. Volume
6. Load avg
3. Instances 4. Scaling
5. CPU/RPS
7. Java heap 8. ParNew
9. Latency 10. 99th tile
Checklists: e.g., Netflix perf vitals
dashboard
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analysis perspectives
Application
System libraries
System calls
Kernel
Devices
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
USE Method
•  For every hardware and
software resource, check:
1.  Utilization
2.  Saturation
3.  Errors
•  Resource constraints show as saturation or high utilization
-  Resize or change instance type
-  Investigate tunables for the resource
•  The USE Method poses questions to answer
Resource
utilization
(%)X
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
On-CPU and off-CPU analysis
State	transi*on	diagram	
Can be analyzed using:
•  On-CPU: Sampling
•  Off-CPU: Scheduler tracing
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
5. Observability
Finding, quantifying, and confirming tunables
Discovering system wins (5-25%’s) and application wins (2-10x’s)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Statistical tools
•  vmstat, pidstat, sar, etc., used mostly normally
$ sar -n TCP,ETCP,DEV 1
Linux 3.2.55 (test-e4f1a80b) 08/18/2014 _x86_64_ (8 CPU)
09:10:43 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
09:10:44 PM lo 14.00 14.00 1.34 1.34 0.00 0.00 0.00
09:10:44 PM eth0 4114.00 4186.00 4537.46 28513.24 0.00 0.00 0.00
09:10:43 PM active/s passive/s iseg/s oseg/s
09:10:44 PM 21.00 4.00 4107.00 22511.00
09:10:43 PM atmptf/s estres/s retrans/s isegerr/s orsts/s
09:10:44 PM 0.00 0.00 36.00 0.00 1.00
[…]
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Host perf analysis in 60s
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
1.  	uptime
2.  	dmesg | tail
3.  	vmstat 1
4.  	mpstat -P ALL 1
5.  	pidstat 1
6.  	iostat -xz 1
7.  	free -m
8.  	sar -n DEV 1
9.  	sar -n TCP,ETCP 1
10.  	top
load	averages	
kernel	errors	
overall	stats	by	*me	
CPU	balance	
process	usage	
disk	I/O	
memory	usage	
network	I/O	
TCP	stats	
check	overview
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
System profilers
•  perf
-  Standard Linux profiler. In the Linux source tree.
-  Interval sampling, CPU performance counter events.
-  User and kernel static and dynamic tracing.
•  perf CPU flame graphs:
# git clone https://github.com/brendangregg/FlameGraph
# cd FlameGraph
# perf record -F 49 -ag -- sleep 30
# perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
https://medium.com/netflix-techblog/java-in-flames-e763b3d32166
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Java
(Broken stacks:
No frame pointer)
Kernel
(C)
JVM
(C++)
AWS re:Invent
2014
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Java
Kernel
(C)
JVM
(C++)
User
(C)
AWS re:Invent
2017
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tracing Tools: ftrace
•  Part of the Linux kernel
-  First added in 2.6.27 (2008), and enhanced in later releases
-  Already available in all Netflix Linux instances
•  Front-end tools aid usage: perf-tools collection
-  https://github.com/brendangregg/perf-tools
-  Unsupported hacks: see WARNINGs
-  Also see the trace-cmd front-end, as well as perf
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ftrace tool: iosnoop
# /apps/perf-tools/bin/iosnoop –ts
Tracing block I/O. Ctrl-C to end.
STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms
5982800.302061 5982800.302679 supervise 1809 W 202,1 17039600 4096 0.62
5982800.302423 5982800.302842 supervise 1809 W 202,1 17039608 4096 0.42
5982800.304962 5982800.305446 supervise 1801 W 202,1 17039616 4096 0.48
5982800.305250 5982800.305676 supervise 1801 W 202,1 17039624 4096 0.43
[…]
# /apps/perf-tools/bin/iosnoop –h
USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]
-d device # device string (eg, "202,1)
-i iotype # match type (eg, '*R*' for all reads)
-n name # process name to match on I/O issue
-p PID # PID to match on I/O issue
-Q # include queueing time in LATms
-s # include start time of I/O (s)
-t # include completion time of I/O (s)
[…]
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tracing tools: perf
# perf record –e skb:consume_skb –ag -- sleep 10
# perf report
[...]
74.42% swapper [kernel.kallsyms] [k] consume_skb
|
--- consume_skb
arp_process
arp_rcv
__netif_receive_skb_core
__netif_receive_skb
netif_receive_skb
virtnet_poll
net_rx_action
__do_softirq
irq_exit
do_IRQ
ret_from_intr
[…]
Summarizing stack traces for a
tracepoint
perf can do many things, it is
hard to pick just one example
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tracing tools: BPF
•  Enhanced Berkeley Packet Filter (BPF, aka eBPF)
-  Safe, efficient, advanced, production tracing. Best on Linux 4.9+.
BPF
bytecode
Observability Program Kernel
tracepoints
kprobes
uprobes
BPF
maps
per-event
data
statistics
verifier
output
static tracing
dynamic tracing
async
copy
perf events
sampling, PMCs
BPF
program
event config
attach
load
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
BPF: tcplife
# /usr/share/bcc/tools/tcplife
PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
2509 java 100.82.34.63 8078 100.82.130.159 12410 0 0 5.44
2509 java 100.82.34.63 8078 100.82.78.215 55564 0 0 135.32
2509 java 100.82.34.63 60778 100.82.207.252 7001 0 13 15126.87
2509 java 100.82.34.63 38884 100.82.208.178 7001 0 0 15568.25
2509 java 127.0.0.1 4243 127.0.0.1 42166 0 0 0.61
12030 upload-mes 127.0.0.1 34020 127.0.0.1 8078 11 0 3.38
12030 upload-mes 127.0.0.1 21196 127.0.0.1 7101 0 0 12.61
3964 mesos-slav 127.0.0.1 7101 127.0.0.1 21196 0 0 12.64
12021 upload-sys 127.0.0.1 34022 127.0.0.1 8078 372 0 15.28
2509 java 127.0.0.1 8078 127.0.0.1 34022 0 372 15.31
2235 dockerd 100.82.34.63 13730 100.82.136.233 7002 0 4 18.50
2235 dockerd 100.82.34.63 34314 100.82.64.53 7002 0 8 56.73
[...]
Dynamic tracing of TCP set state only; does not trace send/receive
https://github.com/iovisor/bcc includes other TCP tools
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hardware counters
•  Model Specific Registers (MSRs)
-  Basic details: Timestamp clock, temperature, power
-  Some are available in Amazon EC2
•  Performance Monitoring Counters (PMCs)
-  Advanced details: Cycles, stall cycles, cache misses…
-  Availability depends on instance type: either none, some, or all
•  Root cause CPU usage at the cycle level
-  E.g., higher CPU usage due to more memory stall cycles
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MSRs
•  Can be used to verify real CPU clock rate
-  Can vary with turboboost. Important to know for perf comparisons.
-  Tool from https://github.com/brendangregg/msr-cloud-tools:
ec2-guest# ./showboost
CPU MHz : 2500
Turbo MHz : 2900 (10 active)
Turbo Ratio : 116% (10 active)
CPU 0 summary every 5 seconds...
TIME C0_MCYC C0_ACYC UTIL RATIO MHz
06:11:35 6428553166 7457384521 51% 116% 2900
06:11:40 6349881107 7365764152 50% 115% 2899
06:11:45 6240610655 7239046277 49% 115% 2899
[...]
Real CPU MHz
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
PMCs: Architectural
•  Some instance types (e.g., m4.16xl) support the PMC
architectural set:
http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
PMCs: All
•  All PMCs are available on this c5.18xl:
# perf stat -d -a -- sleep 5
Performance counter stats for 'system wide':
360195.435103 cpu-clock (msec) # 71.454 CPUs utilized
38,733 context-switches # 0.108 K/sec
504 cpu-migrations # 0.001 K/sec
861,393 page-faults # 0.002 M/sec
2,275,234,239 cycles # 0.006 GHz
191,859,050,716 instructions # 84.32 insn per cycle
38,989,119,249 branches # 108.244 M/sec
152,913,791 branch-misses # 0.39% of all branches
40,262,604,776 L1-dcache-loads # 111.780 M/sec
283,924,939 L1-dcache-load-misses # 0.71% of all L1-dcache hits
[...]
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix Atlas
•  Cloud-wide and instance monitoring:
Application
Metrics
Presentation
Interactive
graph
Summary
statistics
Region
Time range
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix Atlas
•  All metrics in one system
•  System metrics:
-  CPU usage, disk I/O, memory…
•  Application metrics:
-  Requests completed, latency percentiles, errors…
•  Filters/breakdowns by region,
application, ASG, metric, instance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix Vector
•  Real-time per-second instance metrics:
Utilization
Per-device
Saturation
Errors
Breakdowns
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Vector on-demand flame graphs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Vector
•  Given an instance, analyze low-level performance
•  On-demand flame graphs
-  CPU, off-CPU, context switch, IPC, page fault, disk I/O
-  These use perf or BPF
•  Quick
-  GUI-driven root cause analysis
•  Scalable
-  Other teams can use it easily
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Summary
1.  Instance selection
2.  Amazon EC2 features
3.  Kernel tuning
4.  Methodologies
5.  Observability
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
References & links
•  Amazon EC2:
-  http://aws.amazon.com/ec2/instance-types/
-  http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
-  http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html
-  https://www.slideshare.net/AmazonWebServices/cmp402-amazon-ec2-instances-deep-dive
-  http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html
•  Netflix on EC2:
-  http://www.slideshare.net/cpwatson/cpn302-yourlinuxamioptimizationandperformance
-  http://www.brendangregg.com/blog/2014-09-27/from-clouds-to-roots.html
-  http://techblog.cloudperf.net/2016/05/2-million-packets-per-second-on-public.html
-  http://techblog.cloudperf.net/2017/04/3-million-storage-iops-on-aws-cloud.html
•  Performance Analysis:
-  http://www.brendangregg.com/linuxperf.html
-  http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
-  https://github.com/iovisor/bcc https://github.com/brendangregg/perf-tools
-  https://www.slideshare.net/brendangregg/velocity-2015-linux-perf-tools
-  http://www.brendangregg.com/USEmethod/use-linux.html
-  https://medium.com/netflix-techblog/java-in-flames-e763b3d32166
-  http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java
-  https://github.com/brendangregg/FlameGraph
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix talks @ re:Invent
Monday
10:45am ARC208:Walking the tightrope: Balancing Innovation, Reliability, Security, and Efficiency (Venetian)
12:15pm SID206: Best Practices for Managing Security on AWS (MGM)
Tuesday
10:45am ARC209: A Day in the Life of a Netflix Engineer (Venetian)
11:30am CMP325: How Netflix Tunes EC2 Instances for Performance (Venetian)
Wednesday
11:30am MCL317: Orchestrating ML Training for Netflix Recommendations (Venetian)
12:15pm NET303: A day in the life of a Cloud Network Engineer at Netflix (Venetian)
1:00pm ARC312: Why Regional Reservations are a Game Changer for Netflix (Venetian)
1:00pm SID304: SecOps 2021 Today: Using AWS Services to Deliver SecOps (MGM)
1:45pm DEV334: Performing Chaos at Netflix Scale (Venetian)
4:45pm SID316: Using Access Advisor to Strike the Balance Between Security and Usability (MGM)
Thursday
12:15pm CMP311: Auto Scaling Made Easy: How Target Tracking Scaling Policies Hit the Bullseye (Palazzo)
12:15pm DAT308: Codex: Conditional Modules Strike Back (Venetian)
12:55pm CMP309: How Netflix Encodes at Scale (Venetian)
5:00pm ABD401: How Netflix Monitors Applications Real Time with Kinesis (Aria)
Friday
8:30am ABD319: Tooling Up For Efficiency: DIY Solutions @ Netflix (Aria)
10:00am ABD401: Netflix Keystone SPaaS - Real-time Stream Processing as a Service (Aria)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!
B r e n d a n G r e g g , N e t fl i x P e r f o r m a n c e a n d O p e r a t i n g S y s t e m s T e a m
@ b r e n d a n g r e g g
C M P 3 2 5

More Related Content

What's hot

[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...OpenStack Korea Community
 
Google Cloud Platform monitoring with Zabbix
Google Cloud Platform monitoring with ZabbixGoogle Cloud Platform monitoring with Zabbix
Google Cloud Platform monitoring with ZabbixMax Kuzkin
 
MySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software TestMySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software TestI Goo Lee
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
AWS로 게임 기반 다지기 - 김병수, 박진성 :: AWS Game Master 온라인 세미나 #3
AWS로 게임 기반 다지기 - 김병수, 박진성 :: AWS Game Master 온라인 세미나 #3 AWS로 게임 기반 다지기 - 김병수, 박진성 :: AWS Game Master 온라인 세미나 #3
AWS로 게임 기반 다지기 - 김병수, 박진성 :: AWS Game Master 온라인 세미나 #3 Amazon Web Services Korea
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon Web Services
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKYoungHeon (Roy) Kim
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요Jo Hoon
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on LinuxEtsuji Nakai
 
AWS Direct Connect & VPN's - Pop-up Loft Tel Aviv
AWS Direct Connect & VPN's - Pop-up Loft Tel AvivAWS Direct Connect & VPN's - Pop-up Loft Tel Aviv
AWS Direct Connect & VPN's - Pop-up Loft Tel AvivAmazon Web Services
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerYongseok Oh
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsBrendan Gregg
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법Open Source Consulting
 
Secrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesSecrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesBruno Borges
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OGeorge Cao
 
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...Amazon Web Services
 

What's hot (20)

[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Google Cloud Platform monitoring with Zabbix
Google Cloud Platform monitoring with ZabbixGoogle Cloud Platform monitoring with Zabbix
Google Cloud Platform monitoring with Zabbix
 
MySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software TestMySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software Test
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
AWS로 게임 기반 다지기 - 김병수, 박진성 :: AWS Game Master 온라인 세미나 #3
AWS로 게임 기반 다지기 - 김병수, 박진성 :: AWS Game Master 온라인 세미나 #3 AWS로 게임 기반 다지기 - 김병수, 박진성 :: AWS Game Master 온라인 세미나 #3
AWS로 게임 기반 다지기 - 김병수, 박진성 :: AWS Game Master 온라인 세미나 #3
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on Linux
 
AWS Direct Connect & VPN's - Pop-up Loft Tel Aviv
AWS Direct Connect & VPN's - Pop-up Loft Tel AvivAWS Direct Connect & VPN's - Pop-up Loft Tel Aviv
AWS Direct Connect & VPN's - Pop-up Loft Tel Aviv
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
 
Secrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesSecrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on Kubernetes
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
 
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
 

Similar to How Netflix Tunes EC2 Instances for Performance

How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017Amazon Web Services
 
High Performance Computing on AWS
High Performance Computing on AWSHigh Performance Computing on AWS
High Performance Computing on AWSAmazon Web Services
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Amazon Web Services
 
CMP207_High Performance Computing on AWS
CMP207_High Performance Computing on AWSCMP207_High Performance Computing on AWS
CMP207_High Performance Computing on AWSAmazon Web Services
 
Amazon Elastic File System (EFS) for File Storage
Amazon Elastic File System (EFS) for File StorageAmazon Elastic File System (EFS) for File Storage
Amazon Elastic File System (EFS) for File StorageAmazon Web Services
 
Amazon EC2 and Amazon VPC Hands-On Workshop
Amazon EC2 and Amazon VPC Hands-On WorkshopAmazon EC2 and Amazon VPC Hands-On Workshop
Amazon EC2 and Amazon VPC Hands-On WorkshopAmazon Web Services
 
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Amazon Web Services
 
Amazon EC2 Foundations - CMP203 - re:Invent 2017
Amazon EC2 Foundations - CMP203 - re:Invent 2017Amazon EC2 Foundations - CMP203 - re:Invent 2017
Amazon EC2 Foundations - CMP203 - re:Invent 2017Amazon Web Services
 
Aem asset optimizations & best practices
Aem asset optimizations & best practicesAem asset optimizations & best practices
Aem asset optimizations & best practicesKanika Gera
 
Technical Essentials Training: AWS Innovate Ottawa
Technical Essentials Training: AWS Innovate OttawaTechnical Essentials Training: AWS Innovate Ottawa
Technical Essentials Training: AWS Innovate OttawaAmazon Web Services
 
Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319 Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319 Amazon Web Services
 
AWS Storage Tiering for Enterprise Workloads
AWS Storage Tiering for Enterprise WorkloadsAWS Storage Tiering for Enterprise Workloads
AWS Storage Tiering for Enterprise WorkloadsTom Laszewski
 
Relational Database Services on AWS - Bill Baldwin
Relational Database Services on AWS - Bill BaldwinRelational Database Services on AWS - Bill Baldwin
Relational Database Services on AWS - Bill BaldwinAmazon Web Services
 
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...Amazon Web Services
 
CMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesCMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesAmazon Web Services
 
Module 2: AWS Foundational Services - AWSome Day Online Conference
Module 2: AWS Foundational Services - AWSome Day Online ConferenceModule 2: AWS Foundational Services - AWSome Day Online Conference
Module 2: AWS Foundational Services - AWSome Day Online ConferenceAmazon Web Services
 
Module 2 AWS Foundational Services - AWSome Day Online Conference
Module 2 AWS Foundational Services - AWSome Day Online Conference Module 2 AWS Foundational Services - AWSome Day Online Conference
Module 2 AWS Foundational Services - AWSome Day Online Conference Amazon Web Services
 

Similar to How Netflix Tunes EC2 Instances for Performance (20)

How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
 
High Performance Computing on AWS
High Performance Computing on AWSHigh Performance Computing on AWS
High Performance Computing on AWS
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
 
CMP207_High Performance Computing on AWS
CMP207_High Performance Computing on AWSCMP207_High Performance Computing on AWS
CMP207_High Performance Computing on AWS
 
Amazon Elastic File System (EFS) for File Storage
Amazon Elastic File System (EFS) for File StorageAmazon Elastic File System (EFS) for File Storage
Amazon Elastic File System (EFS) for File Storage
 
Amazon EC2 and Amazon VPC Hands-On Workshop
Amazon EC2 and Amazon VPC Hands-On WorkshopAmazon EC2 and Amazon VPC Hands-On Workshop
Amazon EC2 and Amazon VPC Hands-On Workshop
 
EC2 Foundations - Laura Thomson
EC2 Foundations - Laura ThomsonEC2 Foundations - Laura Thomson
EC2 Foundations - Laura Thomson
 
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
 
Amazon EC2 Foundations - CMP203 - re:Invent 2017
Amazon EC2 Foundations - CMP203 - re:Invent 2017Amazon EC2 Foundations - CMP203 - re:Invent 2017
Amazon EC2 Foundations - CMP203 - re:Invent 2017
 
Aem asset optimizations & best practices
Aem asset optimizations & best practicesAem asset optimizations & best practices
Aem asset optimizations & best practices
 
Cost Optimization at Scale
Cost Optimization at ScaleCost Optimization at Scale
Cost Optimization at Scale
 
Technical Essentials Training: AWS Innovate Ottawa
Technical Essentials Training: AWS Innovate OttawaTechnical Essentials Training: AWS Innovate Ottawa
Technical Essentials Training: AWS Innovate Ottawa
 
Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319 Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319
 
AWS Storage Tiering for Enterprise Workloads
AWS Storage Tiering for Enterprise WorkloadsAWS Storage Tiering for Enterprise Workloads
AWS Storage Tiering for Enterprise Workloads
 
Relational Database Services on AWS - Bill Baldwin
Relational Database Services on AWS - Bill BaldwinRelational Database Services on AWS - Bill Baldwin
Relational Database Services on AWS - Bill Baldwin
 
Amazon RDS_Deep Dive - SRV310
Amazon RDS_Deep Dive - SRV310 Amazon RDS_Deep Dive - SRV310
Amazon RDS_Deep Dive - SRV310
 
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
 
CMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesCMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 Instances
 
Module 2: AWS Foundational Services - AWSome Day Online Conference
Module 2: AWS Foundational Services - AWSome Day Online ConferenceModule 2: AWS Foundational Services - AWSome Day Online Conference
Module 2: AWS Foundational Services - AWSome Day Online Conference
 
Module 2 AWS Foundational Services - AWSome Day Online Conference
Module 2 AWS Foundational Services - AWSome Day Online Conference Module 2 AWS Foundational Services - AWSome Day Online Conference
Module 2 AWS Foundational Services - AWSome Day Online Conference
 

More from Brendan Gregg

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixBrendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
 

More from Brendan Gregg (20)

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

How Netflix Tunes EC2 Instances for Performance

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How Netflix Tunes EC2 Instances for Performance B r e n d a n G r e g g , P e r f o r m a n c e a n d O S E n g i n e e r i n g T e a m C M P 3 2 5 N o v e m b e r 2 8 , 2 0 1 7
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix performance and operating systems team •  Evaluate technology -  Instance types, Amazon Elastic Compute Cloud (EC2) options •  Recommendations and best practices -  Instance kernel tuning, assist app tuning •  Develop performance tools -  Develop tools for observability and analysis •  Project support -  New database, programming language, software change •  Incident response -  Performance issues, scalability issues
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda 1.  Instance selection 2.  Amazon EC2 features 3.  Kernel tuning 4.  Methodologies 5.  Observability
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Warnings •  This is what’s in our medicine cabinet •  Consider these “best before: 2018” •  Take only if prescribed by a performance engineer
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Instance selection
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Netflix cloud Many application workloads: Compute, storage, caching… S3 EC2 Cassandra EVCache Applications (services) ELB Elasticsearch SQSSES
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix AWS environment •  Elastic Load Balancing allows real load testing 1.  Single instance canary, then, 2.  Auto scaling group •  Much better than micro- benchmarking alone, which is error prone … ASG-v011 … ASG-v010 ASG Cluster prod1 Canary ELB Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Current generation instances •  Families: -  m4: General purpose •  Balanced -  c5: Compute-optimized •  Latest CPUs, lowest price/compute perf -  i3, d2: Storage-optimized •  SSD large capacity storage -  r4, x1: Memory optimized •  Lowest cost/Gbyte -  p2, g3, f1: Accelerated computing •  GPUs, FPGAs… •  Types: Range from medium to 16x large+, depending on family •  Netflix uses over 30 different instance types
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix instance type selection A.  Flow chart B.  By-resource C.  Brute force
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A. Instance selection flow chart Start i3 Need large disk capacity? Disk I/O bound? Can cache? Select memory to cache working set Find best balance d2
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. B. By-resource approach 1.  Determine bounding resource -  E.g.: CPU, disk I/O, or network I/O -  Found using: •  Estimation (expertise) •  Resource observability with an existing real workload •  Resource observability with a benchmark or load test (experimentation) 2.  Choose instance type for the bounding resource -  If disk I/O, consider caching, and a memory-optimized type -  We have tools to aid this choice: Nomogram Visualization This focuses on optimizing a given workload More efficiency can be found by adjusting the workload to suit instance types
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Nomogram Visualization tool 2. Select resources 3. From any resource, see types and cost 1. Select instance families (cost redacted)
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. C. Brute force choice 1.  Run load test on ALL instance types -  Optionally, different workload configurations as well 2.  Measure throughput -  And check for acceptable latency 3.  Calculate price/performance for all types 4.  Choose most efficient type
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Latency requirements •  Check for an acceptable latency distribution when optimizing for price/performance Headroom UnacceptableAcceptable
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix instance type re-selection A.  Usage B.  Cost C.  Variance
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A. Instance usage •  Older instance types can be identified, analyzed, and upgraded to newer types Types (redacted)
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. B. Instance cost •  Also checked regularly. Tuning the price in price/perf. Cost per hour Breakdowns Details (redacted)
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. C. Instance variance •  An instance type may be resource-constrained only occasionally, or after warmup, or a code change •  Continually monitor performance, analyze variance/outliers
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 2. Amazon EC2 features
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EC2 virtualization slide updated after talk. see: http://www.brendangregg.com/blog/2017-11-29/aws-ec2-virtualization-2017.html
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Networking SR-IOV •  AWS "enhanced networking" -  Uses SR-IOV: Single Root I/O Virtualization -  PCIe device provides virtualized instances -  Some instance types, VPC only •  "Bare metal" network access -  Higher network throughput, reduced RTT and jitter -  ixgbe driver types: Up to 10 Gbps -  ena driver types: Up to 25 Gbps
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storage SR-IOV •  New in 2017, first used by i3s •  Should be called "enhanced storage" -  Some instance types only -  Accesses NVMe attached storage (faster transport than SATA) -  Uses VT-d for I/O virtualization •  "Bare metal" disk access -  i3.16xl can exceed 3 million IOPS https://aws.amazon.com/blogs/aws/now-available-i3-instances-for-demanding-io-intensive-applications/
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3. Kernel tuning
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kernel tuning •  Typically 1-30% wins, for average performance -  Adds up to significant savings for the Netflix cloud •  Bigger wins when reducing latency outliers •  Deploying tuning: -  Generic performance tuning is baked into our base AMI -  Experimental tuning is a package add-on (nflx-kernel-tunables) -  Workload-specific tuning is configured in application AMIs -  Remember to tune the workload with the tunables •  We run Ubuntu Linux
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tuning targets 1.  CPU scheduler 2.  Virtual memory 3.  Huge pages 4.  NUMA 5.  File System 6.  Storage I/O 7.  Networking 8.  Hypervisor (Xen)
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. CPU scheduler •  Tunables: -  Scheduler class, priorities, migration latency, tasksets… •  Usage: -  Some apps benefit from reducing migrations using taskset(1), numactl(8), cgroups, and tuning sched_migration_cost_ns -  Some Java apps have benefited from SCHED_BATCH, to reduce context switching. E.g.: # schedtool –B PID
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 2. Virtual memory •  Tunables: -  Swappiness, overcommit, OOM behavior… •  Usage: -  Swappiness is set to zero to disable swapping and favor ditching the file system page cache first to free memory. (This tunable doesn’t make much difference, as swap devices are usually absent.) vm.swappiness = 0 # from 60
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3. Huge pages •  Tunables: -  Explicit huge page usage, transparent huge pages (THPs) -  Using 2 or 4 Mbytes, instead of 4k, should reduce various CPU overheads and improve MMU page translation cache reach •  Usage: -  THPs (enabled in later Ubuntu kernels) depending on the workload and CPUs, sometimes improve perf on HVM instances (~5% lower CPU), but sometimes hurt perf (~25% higher CPU during %usr, and more during %sys refrag) -  We switched it back to madvise: # echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 4. NUMA •  Tunables: -  NUMA balancing •  Usage: -  On multi-NUMA systems (largest instances) and earlier kernels (around 3.13), NUMA page rebalance was too aggressive, and could consume 60% CPU alone. -  We disable it. Will re-enable/tune later. kernel.numa_balancing = 0
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 5. File system •  Tunables: -  Page cache flushing behavior, file system type and its own tunables (e.g., ZFS on Linux) •  Usage: -  Page cache flushing is tuned to provide a more even behavior: Background flush earlier, aggressive flush later -  Access timestamps disabled, and other options depending on the FS vm.dirty_ratio = 80 # from 40 vm.dirty_background_ratio = 5 # from 10 vm.dirty_expire_centisecs = 12000 # from 3000 mount -o defaults,noatime,discard,nobarrier …
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 6. Storage I/O •  Tunables: -  Read ahead size, number of in-flight requests, I/O scheduler, volume stripe width… •  Usage: -  Some workloads, e.g., Cassandra, can be sensitive to read ahead size -  SSDs can perform better with the “noop” scheduler (if not default already) -  Tuning md chunk size and stripe width to match workload /sys/block/*/queue/rq_affinity 2 /sys/block/*/queue/scheduler noop /sys/block/*/queue/nr_requests 256 /sys/block/*/queue/read_ahead_kb 256 mdadm –chunk=64 ...
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 7. Networking •  Tunables: -  TCP buffer sizes, TCP backlog, device backlog, TCP reuse… •  Usage: net.core.somaxconn = 1024 net.core.netdev_max_backlog = 5000 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_wmem = 4096 12582912 16777216 net.ipv4.tcp_rmem = 4096 12582912 16777216 net.ipv4.tcp_max_syn_backlog = 8096 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_tw_reuse = 1 net.ipv4.ip_local_port_range = 10240 65535 net.ipv4.tcp_abort_on_overflow = 1 # maybe
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 8. Hypervisor (Xen) •  Tunables: -  PV/HVM (baked into AMI) -  Kernel clocksource. From slow to fast: hpet, xen, tsc •  Usage: -  We’ve encountered a Xen clocksource regression in the past (Ubuntu Trusty). Fixed by tuning clocksource to TSC (although beware of clock drift). -  Best case example (so far): CPU usage reduced by 30%, and average app latency reduced by 43%. echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 4. Methodologies Techniques of performance analysis
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. RPS, CPU 2. Volume 6. Load avg 3. Instances 4. Scaling 5. CPU/RPS 7. Java heap 8. ParNew 9. Latency 10. 99th tile Checklists: e.g., Netflix perf vitals dashboard
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analysis perspectives Application System libraries System calls Kernel Devices
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. USE Method •  For every hardware and software resource, check: 1.  Utilization 2.  Saturation 3.  Errors •  Resource constraints show as saturation or high utilization -  Resize or change instance type -  Investigate tunables for the resource •  The USE Method poses questions to answer Resource utilization (%)X
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On-CPU and off-CPU analysis State transi*on diagram Can be analyzed using: •  On-CPU: Sampling •  Off-CPU: Scheduler tracing
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 5. Observability Finding, quantifying, and confirming tunables Discovering system wins (5-25%’s) and application wins (2-10x’s)
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Statistical tools •  vmstat, pidstat, sar, etc., used mostly normally $ sar -n TCP,ETCP,DEV 1 Linux 3.2.55 (test-e4f1a80b) 08/18/2014 _x86_64_ (8 CPU) 09:10:43 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s 09:10:44 PM lo 14.00 14.00 1.34 1.34 0.00 0.00 0.00 09:10:44 PM eth0 4114.00 4186.00 4537.46 28513.24 0.00 0.00 0.00 09:10:43 PM active/s passive/s iseg/s oseg/s 09:10:44 PM 21.00 4.00 4107.00 22511.00 09:10:43 PM atmptf/s estres/s retrans/s isegerr/s orsts/s 09:10:44 PM 0.00 0.00 36.00 0.00 1.00 […]
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Host perf analysis in 60s http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html 1.  uptime 2.  dmesg | tail 3.  vmstat 1 4.  mpstat -P ALL 1 5.  pidstat 1 6.  iostat -xz 1 7.  free -m 8.  sar -n DEV 1 9.  sar -n TCP,ETCP 1 10.  top load averages kernel errors overall stats by *me CPU balance process usage disk I/O memory usage network I/O TCP stats check overview
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. System profilers •  perf -  Standard Linux profiler. In the Linux source tree. -  Interval sampling, CPU performance counter events. -  User and kernel static and dynamic tracing. •  perf CPU flame graphs: # git clone https://github.com/brendangregg/FlameGraph # cd FlameGraph # perf record -F 49 -ag -- sleep 30 # perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg https://medium.com/netflix-techblog/java-in-flames-e763b3d32166
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Java (Broken stacks: No frame pointer) Kernel (C) JVM (C++) AWS re:Invent 2014
  • 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Java Kernel (C) JVM (C++) User (C) AWS re:Invent 2017
  • 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tracing Tools: ftrace •  Part of the Linux kernel -  First added in 2.6.27 (2008), and enhanced in later releases -  Already available in all Netflix Linux instances •  Front-end tools aid usage: perf-tools collection -  https://github.com/brendangregg/perf-tools -  Unsupported hacks: see WARNINGs -  Also see the trace-cmd front-end, as well as perf
  • 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ftrace tool: iosnoop # /apps/perf-tools/bin/iosnoop –ts Tracing block I/O. Ctrl-C to end. STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms 5982800.302061 5982800.302679 supervise 1809 W 202,1 17039600 4096 0.62 5982800.302423 5982800.302842 supervise 1809 W 202,1 17039608 4096 0.42 5982800.304962 5982800.305446 supervise 1801 W 202,1 17039616 4096 0.48 5982800.305250 5982800.305676 supervise 1801 W 202,1 17039624 4096 0.43 […] # /apps/perf-tools/bin/iosnoop –h USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration] -d device # device string (eg, "202,1) -i iotype # match type (eg, '*R*' for all reads) -n name # process name to match on I/O issue -p PID # PID to match on I/O issue -Q # include queueing time in LATms -s # include start time of I/O (s) -t # include completion time of I/O (s) […]
  • 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tracing tools: perf # perf record –e skb:consume_skb –ag -- sleep 10 # perf report [...] 74.42% swapper [kernel.kallsyms] [k] consume_skb | --- consume_skb arp_process arp_rcv __netif_receive_skb_core __netif_receive_skb netif_receive_skb virtnet_poll net_rx_action __do_softirq irq_exit do_IRQ ret_from_intr […] Summarizing stack traces for a tracepoint perf can do many things, it is hard to pick just one example
  • 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tracing tools: BPF •  Enhanced Berkeley Packet Filter (BPF, aka eBPF) -  Safe, efficient, advanced, production tracing. Best on Linux 4.9+. BPF bytecode Observability Program Kernel tracepoints kprobes uprobes BPF maps per-event data statistics verifier output static tracing dynamic tracing async copy perf events sampling, PMCs BPF program event config attach load
  • 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. BPF: tcplife # /usr/share/bcc/tools/tcplife PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS 2509 java 100.82.34.63 8078 100.82.130.159 12410 0 0 5.44 2509 java 100.82.34.63 8078 100.82.78.215 55564 0 0 135.32 2509 java 100.82.34.63 60778 100.82.207.252 7001 0 13 15126.87 2509 java 100.82.34.63 38884 100.82.208.178 7001 0 0 15568.25 2509 java 127.0.0.1 4243 127.0.0.1 42166 0 0 0.61 12030 upload-mes 127.0.0.1 34020 127.0.0.1 8078 11 0 3.38 12030 upload-mes 127.0.0.1 21196 127.0.0.1 7101 0 0 12.61 3964 mesos-slav 127.0.0.1 7101 127.0.0.1 21196 0 0 12.64 12021 upload-sys 127.0.0.1 34022 127.0.0.1 8078 372 0 15.28 2509 java 127.0.0.1 8078 127.0.0.1 34022 0 372 15.31 2235 dockerd 100.82.34.63 13730 100.82.136.233 7002 0 4 18.50 2235 dockerd 100.82.34.63 34314 100.82.64.53 7002 0 8 56.73 [...] Dynamic tracing of TCP set state only; does not trace send/receive https://github.com/iovisor/bcc includes other TCP tools
  • 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hardware counters •  Model Specific Registers (MSRs) -  Basic details: Timestamp clock, temperature, power -  Some are available in Amazon EC2 •  Performance Monitoring Counters (PMCs) -  Advanced details: Cycles, stall cycles, cache misses… -  Availability depends on instance type: either none, some, or all •  Root cause CPU usage at the cycle level -  E.g., higher CPU usage due to more memory stall cycles
  • 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. MSRs •  Can be used to verify real CPU clock rate -  Can vary with turboboost. Important to know for perf comparisons. -  Tool from https://github.com/brendangregg/msr-cloud-tools: ec2-guest# ./showboost CPU MHz : 2500 Turbo MHz : 2900 (10 active) Turbo Ratio : 116% (10 active) CPU 0 summary every 5 seconds... TIME C0_MCYC C0_ACYC UTIL RATIO MHz 06:11:35 6428553166 7457384521 51% 116% 2900 06:11:40 6349881107 7365764152 50% 115% 2899 06:11:45 6240610655 7239046277 49% 115% 2899 [...] Real CPU MHz
  • 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PMCs: Architectural •  Some instance types (e.g., m4.16xl) support the PMC architectural set: http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html
  • 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PMCs: All •  All PMCs are available on this c5.18xl: # perf stat -d -a -- sleep 5 Performance counter stats for 'system wide': 360195.435103 cpu-clock (msec) # 71.454 CPUs utilized 38,733 context-switches # 0.108 K/sec 504 cpu-migrations # 0.001 K/sec 861,393 page-faults # 0.002 M/sec 2,275,234,239 cycles # 0.006 GHz 191,859,050,716 instructions # 84.32 insn per cycle 38,989,119,249 branches # 108.244 M/sec 152,913,791 branch-misses # 0.39% of all branches 40,262,604,776 L1-dcache-loads # 111.780 M/sec 283,924,939 L1-dcache-load-misses # 0.71% of all L1-dcache hits [...]
  • 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix Atlas •  Cloud-wide and instance monitoring: Application Metrics Presentation Interactive graph Summary statistics Region Time range
  • 56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix Atlas •  All metrics in one system •  System metrics: -  CPU usage, disk I/O, memory… •  Application metrics: -  Requests completed, latency percentiles, errors… •  Filters/breakdowns by region, application, ASG, metric, instance
  • 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix Vector •  Real-time per-second instance metrics: Utilization Per-device Saturation Errors Breakdowns
  • 58. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Vector on-demand flame graphs
  • 59. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Vector •  Given an instance, analyze low-level performance •  On-demand flame graphs -  CPU, off-CPU, context switch, IPC, page fault, disk I/O -  These use perf or BPF •  Quick -  GUI-driven root cause analysis •  Scalable -  Other teams can use it easily
  • 60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Summary 1.  Instance selection 2.  Amazon EC2 features 3.  Kernel tuning 4.  Methodologies 5.  Observability
  • 61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. References & links •  Amazon EC2: -  http://aws.amazon.com/ec2/instance-types/ -  http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html -  http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html -  https://www.slideshare.net/AmazonWebServices/cmp402-amazon-ec2-instances-deep-dive -  http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html •  Netflix on EC2: -  http://www.slideshare.net/cpwatson/cpn302-yourlinuxamioptimizationandperformance -  http://www.brendangregg.com/blog/2014-09-27/from-clouds-to-roots.html -  http://techblog.cloudperf.net/2016/05/2-million-packets-per-second-on-public.html -  http://techblog.cloudperf.net/2017/04/3-million-storage-iops-on-aws-cloud.html •  Performance Analysis: -  http://www.brendangregg.com/linuxperf.html -  http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html -  https://github.com/iovisor/bcc https://github.com/brendangregg/perf-tools -  https://www.slideshare.net/brendangregg/velocity-2015-linux-perf-tools -  http://www.brendangregg.com/USEmethod/use-linux.html -  https://medium.com/netflix-techblog/java-in-flames-e763b3d32166 -  http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java -  https://github.com/brendangregg/FlameGraph
  • 62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix talks @ re:Invent Monday 10:45am ARC208:Walking the tightrope: Balancing Innovation, Reliability, Security, and Efficiency (Venetian) 12:15pm SID206: Best Practices for Managing Security on AWS (MGM) Tuesday 10:45am ARC209: A Day in the Life of a Netflix Engineer (Venetian) 11:30am CMP325: How Netflix Tunes EC2 Instances for Performance (Venetian) Wednesday 11:30am MCL317: Orchestrating ML Training for Netflix Recommendations (Venetian) 12:15pm NET303: A day in the life of a Cloud Network Engineer at Netflix (Venetian) 1:00pm ARC312: Why Regional Reservations are a Game Changer for Netflix (Venetian) 1:00pm SID304: SecOps 2021 Today: Using AWS Services to Deliver SecOps (MGM) 1:45pm DEV334: Performing Chaos at Netflix Scale (Venetian) 4:45pm SID316: Using Access Advisor to Strike the Balance Between Security and Usability (MGM) Thursday 12:15pm CMP311: Auto Scaling Made Easy: How Target Tracking Scaling Policies Hit the Bullseye (Palazzo) 12:15pm DAT308: Codex: Conditional Modules Strike Back (Venetian) 12:55pm CMP309: How Netflix Encodes at Scale (Venetian) 5:00pm ABD401: How Netflix Monitors Applications Real Time with Kinesis (Aria) Friday 8:30am ABD319: Tooling Up For Efficiency: DIY Solutions @ Netflix (Aria) 10:00am ABD401: Netflix Keystone SPaaS - Real-time Stream Processing as a Service (Aria)
  • 63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! B r e n d a n G r e g g , N e t fl i x P e r f o r m a n c e a n d O p e r a t i n g S y s t e m s T e a m @ b r e n d a n g r e g g C M P 3 2 5