SlideShare une entreprise Scribd logo
1  sur  79
Télécharger pour lire hors ligne
Using Linux perf
at Netflix
Brendan Gregg
Senior Performance Architect
Sep 2017
Case	Study:	ZFS	is	ea/ng	my	CPU	
•  Easy	to	debug	using	Ne9lix	Vector	&	flame	graphs	
•  How	I	expected	it	to	look:
Case	Study:	ZFS	is	ea/ng	my	CPU	(cont.)	
•  How	it	really	looked:	
Applica/on	(truncated)	
38%	kernel	/me	(why?)
Case	Study:	ZFS	is	ea/ng	my	CPU	(cont.)	
•  ZFS	ARC	(adap/ve	replacement	cache)	reclaim.	
•  But…	ZFS	is	not	in	use.	No	pools,	datasets,	or	ARC	buffers.	
•  CPU	/me	is	in	random	entropy,	picking	which	(empty)	list	to	evict.	
		
Bug:	hUps://github.com/zfsonlinux/zfs/issues/6531		
Zoomed:
Agenda	
1.  Why	Ne9lix	Needs	Linux	Profiling	
2.  perf	Basics	
3.  CPU	Profiling	&	Gotchas	
–  Stacks	(gcc,	Java)	
–  Symbols	(Node.js,	Java)	
–  Guest	PMCs	
–  PEBS	
–  Overheads	
4.  perf	Advanced
1.	Why	Ne)lix	Needs	Linux	Profiling
Understand CPU usage quickly and completely
Quickly	
Flame Graphs
Heat Maps
…
Eg,	Ne9lix	Vector	(self-service	UI):
Java	
Kernel	
(C)	 JVM	
(C++)	
Completely	
User	
(C)	
CPU	Flame	Graph
Why	Linux	perf?	
•  Available	
–  Linux,	open	source	
•  Low	overhead	
–  Tunable	sampling,	ring	buffers	
•  Accurate	
–  Applica/on-basic	samplers	don't	know	what's	really	RUNNING;	eg,	Java	and	epoll	
•  No	blind	spots	
–  See	user,	library,	kernel	with	CPU	sampling	
–  With	some	work:	hardirqs	&	SMI	as	well	
•  No	sample	skew	
–  Unlike	Java	safety	point	skew
Why	is	this	so	important	
•  We	typically	scale	microservices	based	on	%CPU	
–  Small	%CPU	improvements	can	mean	big	$avings	
•  CPU	profiling	is	used	by	many	ac/vi/es	
–  Explaining	regressions	in	new	sooware	versions	
–  Incident	response	
–  3rd	party	sooware	evalua/ons	
–  Iden/fy	performance	tuning	targets	
–  Part	of	CPU	workload	characteriza/on	
•  perf	does	lots	more,	but	we	spend	~95%	of	our	/me	looking	
at	CPU	profiles,	and	5%	on	everything	else	
–  With	new	BPF	capabili/es	(off-CPU	analysis),	that	might	go	from	95	to	90%
CPU profiling should be easy, but…
JIT runtimes
no frame pointers
no debuginfo
stale symbol maps
container namespaces
…
2.	perf	Basics
perf	(aka	"perf_events")	
•  The	official	Linux	profiler	
–  In	the	linux-tools-common	package	
–  Source	code	&	docs	in	Linux:	tools/perf	
•  Supports	many	profiling/tracing	features:	
–  CPU	Performance	Monitoring	Counters	(PMCs)	
–  Sta/cally	defined	tracepoints	
–  User	and	kernel	dynamic	tracing	
–  Kernel	line	and	local	variable	tracing	
–  Efficient	in-kernel	counts	and	filters	
–  Stack	tracing,	libunwind	
–  Code	annota/on	
•  Some	bugs	in	the	past;	has	been	stable	for	us	
perf_events	
ponycorn
A	Mul/tool	of	Subcommands	
# perf
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
The most commonly used perf commands are:
annotate Read perf.data (created by perf record) and display annotated code
archive Create archive with object files with build-ids found in perf.data file
bench General framework for benchmark suites
buildid-cache Manage build-id cache.
buildid-list List the buildids in a perf.data file
c2c Shared Data C2C/HITM Analyzer.
config Get and set variables in a configuration file.
data Data file related processing
diff Read perf.data files and display the differential profile
evlist List the event names in a perf.data file
ftrace simple wrapper for kernel's ftrace functionality
inject Filter to augment the events stream with additional information
kallsyms Searches running kernel for symbols
kmem Tool to trace/measure kernel memory properties
kvm Tool to trace/measure kvm guest os
list List all symbolic event types
lock Analyze lock events
mem Profile memory accesses
record Run a command and record its profile into perf.data
report Read perf.data (created by perf record) and display the profile
sched Tool to trace/measure scheduler properties (latencies)
script Read perf.data (created by perf record) and display trace output
stat Run a command and gather performance counter statistics
test Runs sanity tests.
timechart Tool to visualize total system behavior during a workload
top System profiling tool.
probe Define new dynamic tracepoints
trace strace inspired tool
See 'perf help COMMAND' for more information on a specific command. from	Linux	4.13
perf	Basic	Workflow	
1.  	list	->	find	events	
2.  	stat	->	count	them	
3.  	record->	write	event	data	to	file	
4.  	report	->	browse	summary	
5.  	script	->	event	dump	for	post	processing
Basic	Workflow	Example	
# perf list sched:*
[…]
sched:sched_process_exec [Tracepoint event]
[…]
# perf stat -e sched:sched_process_exec -a -- sleep 10
Performance counter stats for 'system wide':
19 sched:sched_process_exec
10.001327817 seconds time elapsed
# perf record -e sched:sched_process_exec -a -g -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.212 MB perf.data (21 samples) ]
# perf report -n --stdio
# Children Self Samples Trace output
# ........ ........ ............ .................................................
4.76% 4.76% 1 filename=/bin/bash pid=7732 old_pid=7732
|
---_start
return_from_SYSCALL_64
do_syscall_64
sys_execve
do_execveat_common.isra.35
[…]
# perf script
sleep 7729 [003] 632804.699184: sched:sched_process_exec: filename=/bin/sleep pid=7729 old_pid=7729
44b97e do_execveat_common.isra.35 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
44bc01 sys_execve (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
203acb do_syscall_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
acd02b return_from_SYSCALL_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
c30 _start (/lib/x86_64-linux-gnu/ld-2.23.so)
[…]
1.  found	an	event	of	interest	
	
2.  19	per	10	sec	is	a	very	low	
rate,	so	safe	to	record	
	
3.  21	samples	captured	
	
4.  summary	style	may	be	
sufficient,	or,	
5.  script	output	in	/me	order
perf	stat/record	Format	
•  These	have	three	main	parts:	ac/on,	event,	scope.	
•  e.g.,	profiling	on-CPU	stack	traces:	
perf record -F 99 -a -g -- sleep 10
Event:	99	Hertz	
Scope:	all	CPUs	
AcCon:	record	stack	traces	
Note:	sleep	10	is	a	dummy	command	to	set	the	dura/on
perf	Ac/ons	
•  Count	events	(perf	stat	…)	
–  Uses	an	efficient	in-kernel	counter,	and	prints	the	results	
•  Sample	events	(perf	record	…)	
–  Records	details	of	every	event	to	a	dump	file	(perf.data)		
•  Timestamp,	CPU,	PID,	instruc/on	pointer,	…	
–  This	incurs	higher	overhead,	rela/ve	to	the	rate	of	events	
–  Include	the	call	graph	(stack	trace)	using	-g	
•  Other	ac/ons	include:	
–  List	events	(perf	list)	
–  Report	from	a	perf.data	file	(perf	report)	
–  Dump	a	perf.data	file	as	text	(perf	script)	
–  top	style	profiling	(perf	top)
perf	Events	
•  Custom	/mers	
–  e.g.,	99	Hertz	(samples	per	second)	
•  Hardware	events	
–  CPU	Performance	Monitoring	Counters	(PMCs)	
•  Tracepoints	
–  Sta/cally	defined	in	sooware	
•  Dynamic	tracing	
–  Created	using	uprobes	(user)	or	kprobes	(kernel)	
–  Can	do	kernel	line	tracing	with	local	variables	(needs	kernel	debuginfo)
perf	Events:	Map
perf	Events:	List	
# perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
[…]
cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]
[…]
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
[…]
skb:kfree_skb [Tracepoint event]
skb:consume_skb [Tracepoint event]
skb:skb_copy_datagram_iovec [Tracepoint event]
net:net_dev_xmit [Tracepoint event]
net:net_dev_queue [Tracepoint event]
net:netif_receive_skb [Tracepoint event]
net:netif_rx [Tracepoint event]
[…]
perf	Scope	
•  System-wide:	all	CPUs	(-a)	
•  Target	PID	(-p PID)	
•  Target	command	(…)	
•  Specific	CPUs	(-c …)	
•  User-level	only	(<event>:u)	
•  Kernel-level	only	(<event>:k)	
•  A	custom	filter	to	match	variables	(--filter …)	
•  This	cgroup	(container)	only	(--cgroup	…)
One-Liners:	Lis/ng	Events	
# Listing all currently known events:
perf list
# Searching for "sched" tracepoints:
perf list | grep sched
# Listing sched tracepoints:
perf list 'sched:*'
Dozens	of	perf	one-liners:	
hUp://www.brendangregg.com/perf.html#OneLiners
One-Liners:	Coun/ng	Events	
# CPU counter statistics for the specified command:
perf stat command
# CPU counter statistics for the entire system, for 5 seconds:
perf stat -a sleep 5
# Detailed CPU counter statistics for the specified PID, until Ctrl-C:
perf stat -dp PID
# Various CPU last level cache statistics for the specified command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command
# Count system calls for the specified PID, until Ctrl-C:
perf stat -e 'syscalls:sys_enter_*' -p PID
# Count block device I/O events for the entire system, for 10 seconds:
perf stat -e 'block:*' -a sleep 10
# Show system calls by process, refreshing every 2 seconds:
perf top -e raw_syscalls:sys_enter -ns comm
One-Liners:	Profiling	Events	
# Sample on-CPU functions for the specified command, at 99 Hertz:
perf record -F 99 command
# Sample CPU stack traces for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID -g -- sleep 10
# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds:
perf record -F 99 -ag -- sleep 10
# Sample CPU stacks, once every 10,000 Level 1 data cache misses, for 5 secs:
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5
# Sample CPU stack traces, once every 100 last level cache misses, for 5 secs:
perf record -e LLC-load-misses -c 100 -ag -- sleep 5
# Sample on-CPU kernel instructions, for 5 seconds:
perf record -e cycles:k -a -- sleep 5
# Sample on-CPU user instructions, for 5 seconds:
perf record -e cycles:u -a -- sleep 5
One-Liners:	Repor/ng	
# Show perf.data in an ncurses browser (TUI) if possible:
perf report
# Show perf.data with a column for sample count:
perf report -n
# Show perf.data as a text report, with data coalesced and percentages:
perf report --stdio
# List all raw events from perf.data:
perf script
# List all raw events from perf.data, with customized fields:
perf script -f comm,tid,pid,time,cpu,event,ip,sym,dso
# Dump raw contents from perf.data as hex (for debugging):
perf script -D
# Disassemble and annotate instructions with percentages (needs debuginfo):
perf annotate --stdio
3.	CPU	Profiling
CPU	Profiling	
•  Record	stacks	at	a	/med	interval:	simple	and	effec/ve		
–  Pros:	Low	(determinis/c)	overhead	
–  Cons:	Coarse	accuracy,	but	usually	sufficient	
A
B
block																												interrupt	
on-CPU	 off-CPU	
A
B
A A
B
A
syscall	
/me	
stack	
samples:	 A
perf	Record	
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf report -n --stdio
1.40% 162 java [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--63.21%-- try_to_wake_up
| |
| |--63.91%-- default_wake_function
| | |
| | |--56.11%-- __wake_up_common
| | | __wake_up_locked
| | | ep_poll_callback
| | | __wake_up_common
| | | __wake_up_sync_key
| | | |
| | | |--59.19%-- sock_def_readable
[…78,000 lines truncated…]
Sampling	full	
stack	traces	
at	99	Hertz
perf	Repor/ng	
•  perf	report	summarizes	by	combining	common	paths	
•  Previous	output	truncated	78,000	lines	of	summary	
•  The	following	is	what	a	mere	8,000	lines	looks	like…
perf	report
…	as	a	Flame	Graph
Flame	Graphs	
•  Flame	Graphs:	
–  x-axis:	alphabe/cal	stack	sort,	to	maximize	merging	
–  y-axis:	stack	depth	
–  color:	random,	or	hue	can	be	a	dimension	
•  e.g.,	sooware	type,	or	difference	between	two	profiles	for	
non-regression	tes/ng	("differen/al	flame	graphs")	
–  interpreta/on:	top	edge	is	on-CPU,	beneath	it	is	ancestry	
•  Just	a	Perl	program	to	convert	perf	stacks	into	SVG	
–  Includes	JavaScript:	open	in	a	browser	for	interac/vity	
•  Easy	to	get	working	hUp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html		
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
flamegraph.pl	Op/ons	
$ flamegraph.pl --help
USAGE: flamegraph.pl [options] infile > outfile.svg
--title TEXT # change title text
--subtitle TEXT # second level title (optional)
--width NUM # width of image (default 1200)
--height NUM # height of each frame (default 16)
--minwidth NUM # omit smaller functions (default 0.1 pixels)
--fonttype FONT # font type (default "Verdana")
--fontsize NUM # font size (default 12)
--countname TEXT # count type label (default "samples")
--nametype TEXT # name type label (default "Function:")
--colors PALETTE # set color palette. choices are: hot (default), mem,
# io, wakeup, chain, java, js, perl, red, green, blue,
# aqua, yellow, purple, orange
--hash # colors are keyed by function name hash
--cp # use consistent palette (palette.map)
--reverse # generate stack-reversed flame graph
--inverted # icicle graph
--negate # switch differential hues (blue<->red)
--notes TEXT # add notes comment in SVG (for debugging)
--help # this message
eg,
flamegraph.pl --title="Flame Graph: malloc()" trace.txt > graph.svg
perf	Flame	Graph	Workflow	(Linux	2.6+)	
perf stat perf record
perf report perf script
count	events	 capture	stacks	
text	UI	 dump	profile	
stackcollapse-perf.pl
perf.data	
flame	graph	
visualiza/on	
perf list
list	events	
Typical	
Workflow	
flamegraph.pl
perf	Flame	Graph	Workflow	(Linux	4.5+)	
perf report perf report
-g folded
text	UI	 summary	
awk
flame	graph	
visualiza/on	
flamegraph.pl
perf script
dump	
perf stat perf record
count	events	 capture	stacks	
perf.data	
perf list
list	events	
Typical	
Workflow
Flame	Graph	Op/miza/ons	
Linux	2.6	
perf record
perf script
capture	stacks	
write	text	
stackcollapse-perf.pl
flamegraph.pl
perf.data	
write	samples	
reads	samples	
folded	output	
perf record
perf report
–g folded
capture	stacks	
folded	report	
awk
flamegraph.pl
perf.data	
write	samples	
reads	samples	
folded	output	
Linux	4.5	
count	stacks	(BPF)	
folded	
output	
flamegraph.pl
profile.py
Linux	4.9	
not	perf
Gotchas
When	we've	tried	to	use	perf	
•  Stacks	don't	work	(missing)	
•  Symbols	don't	work	(hex	numbers)	
•  Instruc/on	profiling	looks	bogus	
•  PMCs	don't	work	in	VM	guests	
•  Container	break	things	
•  Overhead	is	too	high
How	to	really	get	started	
1.  Get	"perf"	to	work	
2.  Get	stack	walking	to	work	
3.  Fix	symbol	transla/on	
4.  Get	IPC	to	work	
5.  Test	perf	under	load	
Install	perf-tools-common	and	
perf-tools-`uname	-r`	packages;	
Or	compile	in	the	Linux	source:	
tools/perf	
The	"gotchas"…
Gotcha	#1	Broken	Stacks	
1.  Take	a	CPU	profile	
2.  Run	perf	report	
3.  If	stacks	are	ooen	<	3	frames,	or	don't	reach	"thread	start"	or	
"main",	they	are	probably	broken.	Fix	them.	
perf record -F 99 -a –g -- sleep 30
perf report -n --stdio
Iden/fying	Broken	Stacks	
|
|--96.78%-- re_search_stub
| rpl_re_search
| match_regex
| do_subst
| execute_program
| process_files
| main
| __libc_start_main
|
--3.22%-- rpl_re_search
match_regex
do_subst
execute_program
process_files
main
__libc_start_main
28.10% 146 sed libc-2.19.so [.] re_search_internal
|
--- re_search_internal
|
|--12.25%-- 0x3
| 0x100007
broken	
not	broken
Iden/fying	Broken	Stacks	
|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1
| 0x402689
| 0x7fa1cd08aec5
|
|--1.33%-- 0x40a4a1
| |
| |--60.01%-- 0x40659a
| | 0x408dd8
| | 0x408ed1
| | 0x402689
| | 0x7fa1cd08aec5
78.50% 409 sed libc-2.19.so [.] 0x00000000000dd7d4
|
|--3.65%-- 0x7f2516d5d10d
|
|--2.19%-- 0x7f2516d0332f
|
|--1.22%-- 0x7f2516cffbd2
|
|--1.22%-- 0x7f2516d5d5ad
broken	
probably	not	broken	
missing	symbols,	but	
that's	another	problem
Broken	Stacks	Flame	Graph	
Broken	Java	stacks	
(missing	frame	pointer)	
Java	==	green	
system	==	red	
C++	==	yellow
Fixing	Broken	Stacks	
•  Either:	
•  Fix	frame	pointer-based	stack	walking	(the	default)	
–  Pros:	simple,	supports	any	system	stack	walker	
–  Cons:	might	cost	a	liUle	extra	CPU	to	make	available	
•  Use	libunwind	and	DWARF:	perf	record	-g	dwarf	
–  Pros:	more	debug	info	
–  Cons:	not	on	older	kernels,	and	inflates	instance	size	
–  …	there's	also	ORC	on	the	latest	kernel	
•  Applica/on	support	
–  hUps://github.com/jvm-profiling-tools/async-profiler			
•  Our	current	preference	is	(A),	but	(C)	is	also	promising	
–  So	how	do	we	fix	the	frame	pointer…
gcc	-fno-omit-frame-pointer	
•  Once upon a time,	x86	had	fewer	registers,	and	the	frame	
pointer	register	was	reused	for	general	purpose	to	improve	
performance.	This	breaks	system	stack	walking.	
•  gcc	provides	-fno-omit-frame-pointer to	fix	this	
–  Please	make	this	the	default	in	gcc!
Java	-XX:+PreserveFramePointer	
•  I	hacked	frame	pointers	in	the	JVM	(JDK-8068945)	and	Oracle	rewrote	
it	as	-XX:+PreserveFramePointer.	Lets	perf	do	FP	stack	walks	of	Java.	
	
•  Costs	some	overhead	to	use.	Usually	<1%.	Rare	cases	10%.	
--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …
@@ -5236,6 +5236,7 @@
// We always push rbp, so that on return to interpreter rbp, will be
// restored correctly and we can correct the stack.
push(rbp);
+ mov(rbp, rsp);
// Remove word for ebp
framesize -= wordSize;
--- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
+++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
[...]
Involved	changes	like	this:	
fixing	x86-64	func/on	
prologues
Broken	Java	Stacks	
•  Check	with	"perf	script"	to	see	
stack	samples	
•  These	are	1	or	2	levels	deep	(junk	
values)	
# perf script
[…]
java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
java 4579 cpu-clock:
7f4179349aec [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101d0f [unknown] (/tmp/perf-4458.map)
[…]
Fixed	Java	Stacks	
•  With	-XX:+PreserveFramePointer	
stacks	are	full,	and	go	all	the	way	to	
start_thread()	
•  This	is	what	the	CPUs	are	really	
running:	inlined	frames	are	not	
present	
# perf script
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
7fd30184def8 [unknown] (/tmp/perf-8131.map)
7fd30174f544 [unknown] (/tmp/perf-8131.map)
7fd30175d3a8 [unknown] (/tmp/perf-8131.map)
7fd30166d51c [unknown] (/tmp/perf-8131.map)
7fd301750f34 [unknown] (/tmp/perf-8131.map)
7fd3016c2280 [unknown] (/tmp/perf-8131.map)
7fd301b02ec0 [unknown] (/tmp/perf-8131.map)
7fd3016f9888 [unknown] (/tmp/perf-8131.map)
7fd3016ece04 [unknown] (/tmp/perf-8131.map)
7fd30177783c [unknown] (/tmp/perf-8131.map)
7fd301600aa8 [unknown] (/tmp/perf-8131.map)
7fd301a4484c [unknown] (/tmp/perf-8131.map)
7fd3010072e0 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd3010004e7 [unknown] (/tmp/perf-8131.map)
7fd3171df76a JavaCalls::call_helper(JavaValue*,…
7fd3171dce44 JavaCalls::call_virtual(JavaValue*…
7fd3171dd43a JavaCalls::call_virtual(JavaValue*…
7fd31721b6ce thread_entry(JavaThread*, Thread*)…
7fd3175389e0 JavaThread::thread_main_inner() (/…
7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…
7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…
7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
Fixed	Stacks	Flame	Graph	
Java	
(no	symbols)
Gotcha	#2	Missing	Symbols	
•  Missing	symbols	should	be	obvious	in	perf	report/script:	
12.06% 62 sed sed [.] re_search_internal
|
--- re_search_internal
|
|--96.78%-- re_search_stub
| rpl_re_search
| match_regex
| do_subst
| execute_program
| process_files
| main
| __libc_start_main
71.79% 334 sed sed [.] 0x000000000001afc1
|
|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1
| 0x402689
| 0x7fa1cd08aec5
broken	
not	broken
Fixing	Symbols	
•  For	installed	packages:	
A.  Add	a	-dbgsym	package,	if	available	
B.  Recompile	from	source	
•  For	JIT	(Java,	Node.js,	…):	
A.  Create	a	/tmp/perf-PID.map	file.	perf	already	looks	for	this	
•  Map	format	is	"START	SIZE	symbolname"	
B.  Or	use	a	symbol	loggers.	Eg	tools/perf/jvm/.	
# perf script
Failed to open /tmp/perf-8131.map, continuing without symbols
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
[…]
Java	Symbols	
•  perf-map-agent	
–  Agent	aUaches	and	writes	the	map	file	on	demand	(previous	versions	aUached	on	Java	
start,	and	wrote	con/nually)	
–  hUps://github.com/jvm-profiling-tools/perf-map-agent		
(was	hUps://github.com/jrudolph/perf-map-agent)	
•  Automa/on:	jmaps	
–  We	use	scripts	to	find	Java	processes	and	dump	their	map	files,	paying	aUen/on	to	file	
ownership	etc	
–  hUps://github.com/brendangregg/FlameGraph/blob/master/jmaps		
–  Needs	to	run	as	close	as	possible	to	the	profile,	to	minimize	symbol	churn	
# perf record -F 99 -a -g -- sleep 30; jmaps
Java	
Kernel	
(C)	
JVM	
(C++)	
Java	Flame	Graph:	Stacks	&	Symbols	
User	
(C)	
flamegraph.pl --color=java
Java:	Inlining	
A.  Disabling	inlining:	
–  -XX:-Inline	
–  Many	more	Java	frames	
–  80%	slower	(in	this	case)	
–  May	not	be	necessary:	inlined	flame	
graphs	ooen	make	enough	sense	
–  Or	tune	-XX:MaxInlineSize	and	-
XX:InlineSmallCode	to	reveal	more	
frames,	without	cos/ng	much	perf:	can	
even	go	faster!	
B.  Symbol	agents	can	uninline	
–  perf-map-agent	unfoldall	
–  We	some/mes	need	and	use	this		
No	inlining
Node.js:	Stacks	&	Symbols	
•  Frame	pointer	stacks	work	
•  Symbols	currently	via	a	logger	
–  --perf-basic-prof:	everything.	We	found	it	can	log	over	1	Gbyte/day.	
–  --perf-basic-prof-only-functions:	tries	to	only	log	symbols	we	care	about.	
•  perf	may	not	use	the	most	recent	symbol	in	the	log	
–  We	/dy	logs	before	using	them:	
hUps://raw.githubusercontent.com/brendangregg/Misc/master/perf_events/
perfmap/dy.pl	
•  Future	v8's	may	support	on-demand	symbol	dumps
Gotcha	#3	Instruc/on	Profiling	
# perf annotate -i perf.data.noplooper --stdio
Percent | Source code & Disassembly of noplooper
--------------------------------------------------------
: Disassembly of section .text:
:
: 00000000004004ed <main>:
0.00 : 4004ed: push %rbp
0.00 : 4004ee: mov %rsp,%rbp
20.86 : 4004f1: nop
0.00 : 4004f2: nop
0.00 : 4004f3: nop
0.00 : 4004f4: nop
19.84 : 4004f5: nop
0.00 : 4004f6: nop
0.00 : 4004f7: nop
0.00 : 4004f8: nop
18.73 : 4004f9: nop
0.00 : 4004fa: nop
0.00 : 4004fb: nop
0.00 : 4004fc: nop
19.08 : 4004fd: nop
0.00 : 4004fe: nop
0.00 : 4004ff: nop
0.00 : 400500: nop
21.49 : 400501: jmp 4004f1 <main+0x4>
16	NOPs	in	a	loop	
	
Let's	profile	instruc/ons	
to	see	which	are	hot!	
	
(have	I	lost	my	mind?)
Instruc/on	Profiling	
•  Even	distribu/on	(A)?	Or	something	else?	
(A)	 (B)	
(C)	 (D)
Instruc/on	Profiling	
# perf annotate -i perf.data.noplooper --stdio
Percent | Source code & Disassembly of noplooper
--------------------------------------------------------
: Disassembly of section .text:
:
: 00000000004004ed <main>:
0.00 : 4004ed: push %rbp
0.00 : 4004ee: mov %rsp,%rbp
20.86 : 4004f1: nop
0.00 : 4004f2: nop
0.00 : 4004f3: nop
0.00 : 4004f4: nop
19.84 : 4004f5: nop
0.00 : 4004f6: nop
0.00 : 4004f7: nop
0.00 : 4004f8: nop
18.73 : 4004f9: nop
0.00 : 4004fa: nop
0.00 : 4004fb: nop
0.00 : 4004fc: nop
19.08 : 4004fd: nop
0.00 : 4004fe: nop
0.00 : 4004ff: nop
0.00 : 400500: nop
21.49 : 400501: jmp 4004f1 <main+0x4>
Go	home	instruc/on	pointer,	you're	drunk
PEBS	
•  I	believe	this	is	due	to	parallel	and	out-of-order	execu/on	of	
micro-ops:	the	sampled	IP	is	the	resump/on	instruc/on,	not	
what	is	currently	execu/ng.	And	skid.	
•  PEBS	may	help:	Intel's	Precise	Event	Based	Sampling	
•  perf_events	has	support:	
–  perf	record	-e	cycles:pp	
–  The	'p'	can	be	specified	mul/ple	/mes:	
•  	0	-	SAMPLE_IP	can	have	arbitrary	skid	
•  	1	-	SAMPLE_IP	must	have	constant	skid	
•  	2	-	SAMPLE_IP	requested	to	have	0	skid	
•  	3	-	SAMPLE_IP	must	have	0	skid	
–  …	from	tools/perf/Documenta/on/perf-list.txt
Gotcha	#4	VM	Guests	
•  Using	PMCs	from	most	VM	guests:	
# perf stat -a -d sleep 5
Performance counter stats for 'system wide':
10003.718595 task-clock (msec) # 2.000 CPUs utilized [100.00%]
323 context-switches # 0.032 K/sec [100.00%]
17 cpu-migrations # 0.002 K/sec [100.00%]
233 page-faults # 0.023 K/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
<not supported> L1-dcache-loads
<not supported> L1-dcache-load-misses
<not supported> LLC-loads
<not supported> LLC-load-misses
5.001607197 seconds time elapsed
VM	Guest	PMCs	
•  Without	PMCs,	%CPU	is	ambiguous.	We	need	IPC.	
–  Can't	measure	instruc/ons	per	cycle	(IPC),	cache	hits/misses,	MMU/TLB	events,	etc.	
•  Is	fixable:	eg,	Xen	can	enable	PMCs	(vpmu	boot	op/on)	
–  I	added	vpmu	support	for	subsets,	eg,	vpmu=arch	for	Intel	architectural	set	(7	PMCs	only)	
–  hUp://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html		
–  Now	available	on	the	largest	AWS	EC2	instance	types	
architectural	
set
VM	Guest	MSRs	
•  Model	Specific	Registers	(MSRs)	may	be	exposed	when	PMCs	are	not	
•  BeUer	than	nothing.	Can	solve	some	issues.	
	
	
–  showboost	is	from	my	msr-cloud-tools	collec/on	(on	github)	
# ./showboost
CPU MHz : 2500
Turbo MHz : 2900 (10 active)
Turbo Ratio : 116% (10 active)
CPU 0 summary every 5 seconds...
TIME C0_MCYC C0_ACYC UTIL RATIO MHz
17:28:03 4226511637 4902783333 33% 116% 2900
17:28:08 4397892841 5101713941 35% 116% 2900
17:28:13 4550831380 5279462058 36% 116% 2900
17:28:18 4680962051 5429605341 37% 115% 2899
17:28:23 4782942155 5547813280 38% 115% 2899
[...]
VM	Guest	PEBS	
•  Not	possible	yet	in	Xen	
–  please	fix	
•  DiUo	for	LBR,	BTS,	processor	trace
Gotcha	#5	Containers	
•  perf	from	the	host	can't	find	symbol	files	in	different	mount	
namespaces	
•  We	currently	workaround	it	
–  hUp://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/	
•  Should	be	fixed	in	4.14	
–  Krister	Johansen's	patches
Gotcha	#6	Overhead	
•  Overhead	is	rela/ve	to	the	rate	of	events	instrumented	
•  perf stat does	in-kernel	counts:	rela/vely	low	overhead	
•  perf record writes	perf.data,	which	has	slightly	higher	
CPU	overhead,	plus	file	system	and	disk	I/O	
•  Test	before	use	
–  In	the	lab	
–  Run	perf	stat	to	understand	rate,	before	perf	record	
•  Also	consider	--filter,	to	filter	events	in-kernel
4.	perf	Advanced
perf	for	Tracing	Events
# perf record -e block:block_rq_insert -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.172 MB perf.data (~7527 samples) ]
# perf script
[…]
java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java]
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java]
java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java]
[…]
Tracepoints	
include/trace/events/block.h: java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]
DECLARE_EVENT_CLASS(block_rq,
[...]
TP_printk("%d,%d %s %u (%s) %llu + %u [%s]",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->rwbs, __entry->bytes, __get_str(cmd),
(unsigned long long)__entry->sector,
__entry->nr_sector, __entry->comm)
format	string	process	PID	[CPU]	/mestamp:	eventname:	
kernel	source	
may	be	the	
only	docs	
Also	see:	cat	/sys/kernel/debug/tracing/events/block/block_rq_insert/format
One-Liners:	Sta/c	Tracing	
# Trace new processes, until Ctrl-C:
perf record -e sched:sched_process_exec -a
# Trace all context-switches with stack traces, for 1 second:
perf record -e context-switches –ag -- sleep 1
# Trace CPU migrations, for 10 seconds:
perf record -e migrations -a -- sleep 10
# Trace all connect()s with stack traces (outbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_connect –ag
# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:
perf record -e block:block_rq_insert -ag
# Trace all block device issues and completions (has timestamps), until Ctrl-C:
perf record -e block:block_rq_issue -e block:block_rq_complete -a
# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'nr_sector > 200'
# Trace all block completions, synchronous writes only, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs == "WS"'
# Trace all block completions, all types of writes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'
# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:
perf record -e 'ext4:*' -o /tmp/perf.data -a
One-Liners:	Dynamic	Tracing	
# Add a tracepoint for the kernel tcp_sendmsg() function entry (--add optional):
perf probe --add tcp_sendmsg
# Remove the tcp_sendmsg() tracepoint (or use --del):
perf probe -d tcp_sendmsg
# Add a tracepoint for the kernel tcp_sendmsg() function return:
perf probe 'tcp_sendmsg%return'
# Show avail vars for the tcp_sendmsg(), plus external vars (needs debuginfo):
perf probe -V tcp_sendmsg --externs
# Show available line probes for tcp_sendmsg() (needs debuginfo):
perf probe -L tcp_sendmsg
# Add a tracepoint for tcp_sendmsg() line 81 with local var seglen (debuginfo):
perf probe 'tcp_sendmsg:81 seglen'
# Add a tracepoint for do_sys_open() with the filename as a string (debuginfo):
perf probe 'do_sys_open filename:string'
# Add a tracepoint for myfunc() return, and include the retval as a string:
perf probe 'myfunc%return +0($retval):string'
# Add a tracepoint for the user-level malloc() function from libc:
perf probe -x /lib64/libc.so.6 malloc
# List currently available dynamic probes:
perf probe -l
One-Liners:	Advanced	Dynamic	Tracing	
# Add a tracepoint for tcp_sendmsg(), with three entry regs (platform specific):
perf probe 'tcp_sendmsg %ax %dx %cx'
# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for %cx register:
perf probe 'tcp_sendmsg bytes=%cx'
# Trace previously created probe when the bytes (alias) var is greater than 100:
perf record -e probe:tcp_sendmsg --filter 'bytes > 100'
# Add a tracepoint for tcp_sendmsg() return, and capture the return value:
perf probe 'tcp_sendmsg%return $retval'
# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (debuginfo):
perf probe 'tcp_sendmsg size'
# Add a tracepoint for tcp_sendmsg(), with size and socket state (debuginfo):
perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'
# Trace previous probe when size > 0, and state != TCP_ESTABLISHED(1) (debuginfo):
perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a
•  Kernel debuginfo is an onerous requirement for the Netflix cloud
•  We can use registers instead (as above). But which registers?
The	RoseUa	Stone	of	Registers	
# perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'
[…]
Added new event:
Writing event: p:probe/tcp_sendmsg tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8
probe:tcp_sendmsg (on tcp_sendmsg with size skc_state=sk->__sk_common.skc_state)
You can now use it in all perf tools, such as:
perf record -e probe:tcp_sendmsg -aR sleep 1
One server instance with kernel debuginfo, and -nv (dry run, verbose):
# perf probe 'tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8'
Failed to find path of kernel module.
Added new event:
probe:tcp_sendmsg (on tcp_sendmsg with size=%cx:u64 skc_state=+18(%si):u8)
You can now use it in all perf tools, such as:
perf record -e probe:tcp_sendmsg -aR sleep 1
All other instances (of the same kernel version):
Copy-n-paste!	
Masami Hiramatsu was investigating a way to better automate this
perf	Visualiza/ons:	Block	I/O	Latency	Heat	Map	
•  We	automated	this	for	analyzing	disk	I/O	latency	issues	
hUp://www.brendangregg.com/blog/2014-07-01/perf-heat-maps.html	
SSD	I/O	
(fast,	with	queueing)	
HDD	I/O	
(random,	modes)
There's	s/ll	a	lot	more	to	perf…	
•  Using	PMCs	
•  perf	scrip/ng	interface	
•  perf	+	eBPF	
•  perf	sched	
•  perf	/mechart	
•  perf	trace	
•  perf	c2c	(new!)	
•  perf	orace	(new!)	
•  …
Links	&	References	
•  perf_events	
•  Kernel	source:	tools/perf/DocumentaCon	
•  hUps://perf.wiki.kernel.org/index.php/Main_Page	
•  hUp://www.brendangregg.com/perf.html	
•  hUp://web.eece.maine.edu/~vweaver/projects/perf_events/	
•  Mailing	list	hUp://vger.kernel.org/vger-lists.html#linux-perf-users	
•  perf-tools:	hUps://github.com/brendangregg/perf-tools	
•  PMU	tools:	hUps://github.com/andikleen/pmu-tools	
•  perf,	orace,	and	more:	hUp://www.brendangregg.com/linuxperf.html	
•  Java	frame	pointer	patch	
•  hUp://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-December/016477.html	
•  hUps://bugs.openjdk.java.net/browse/JDK-8068945	
•  Node.js:	hUp://techblog.ne9lix.com/2014/11/nodejs-in-flames.html	
•  Methodology:	hUp://www.brendangregg.com/methodology.html	
•  Flame	graphs:	hUp://www.brendangregg.com/flamegraphs.html	
•  Heat	maps:	hUp://www.brendangregg.com/heatmaps.html	
•  eBPF:	hUp://lwn.net/Ar/cles/603983/
Thank You
–  Ques/ons?	
–  hUp://www.brendangregg.com	
–  hUp://slideshare.net/brendangregg		
–  bgregg@ne9lix.com	
–  @brendangregg

Contenu connexe

Tendances

Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingViller Hsiao
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsBrendan Gregg
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringGeorg Schönberger
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 InstancesBrendan Gregg
 
JavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsJavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsBrendan Gregg
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM ShapeBlue
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
 
Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux KernelAdrian Huang
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
Memory management in Linux
Memory management in LinuxMemory management in Linux
Memory management in LinuxRaghu Udiyar
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceSUSE Labs Taipei
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File SystemAdrian Huang
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningMilind Koyande
 

Tendances (20)

Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame Graphs
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 
JavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsJavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame Graphs
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux Kernel
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
Memory management in Linux
Memory management in LinuxMemory management in Linux
Memory management in Linux
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to Userspace
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and Tuning
 

En vedette

Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance AnalysisBrendan Gregg
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesBrendan Gregg
 
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan GreggKernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan GreggAnne Nicolas
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016Brendan Gregg
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsBrendan Gregg
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsBrendan Gregg
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to RootsBrendan Gregg
 
(2013 DEVIEW) 멀티쓰레드 프로그래밍이 왜이리 힘드나요?
(2013 DEVIEW) 멀티쓰레드 프로그래밍이  왜이리 힘드나요? (2013 DEVIEW) 멀티쓰레드 프로그래밍이  왜이리 힘드나요?
(2013 DEVIEW) 멀티쓰레드 프로그래밍이 왜이리 힘드나요? 내훈 정
 
그런데 스타트업이 뭐더라
그런데 스타트업이 뭐더라그런데 스타트업이 뭐더라
그런데 스타트업이 뭐더라Hyun-woo Park
 
Linux network stack
Linux network stackLinux network stack
Linux network stackTakuya ASADA
 
170718_테헤란로 런치클럽_스타트업 성공을 위한 AWS 첫걸음 _HB Smith
170718_테헤란로 런치클럽_스타트업 성공을 위한 AWS 첫걸음 _HB Smith170718_테헤란로 런치클럽_스타트업 성공을 위한 AWS 첫걸음 _HB Smith
170718_테헤란로 런치클럽_스타트업 성공을 위한 AWS 첫걸음 _HB SmithStartupAlliance
 
Android 5.0 & Material Design
Android 5.0 & Material DesignAndroid 5.0 & Material Design
Android 5.0 & Material DesignManjong Han
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
 

En vedette (20)

Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan GreggKernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
 
(2013 DEVIEW) 멀티쓰레드 프로그래밍이 왜이리 힘드나요?
(2013 DEVIEW) 멀티쓰레드 프로그래밍이  왜이리 힘드나요? (2013 DEVIEW) 멀티쓰레드 프로그래밍이  왜이리 힘드나요?
(2013 DEVIEW) 멀티쓰레드 프로그래밍이 왜이리 힘드나요?
 
그런데 스타트업이 뭐더라
그런데 스타트업이 뭐더라그런데 스타트업이 뭐더라
그런데 스타트업이 뭐더라
 
Linux network stack
Linux network stackLinux network stack
Linux network stack
 
170718_테헤란로 런치클럽_스타트업 성공을 위한 AWS 첫걸음 _HB Smith
170718_테헤란로 런치클럽_스타트업 성공을 위한 AWS 첫걸음 _HB Smith170718_테헤란로 런치클럽_스타트업 성공을 위한 AWS 첫걸음 _HB Smith
170718_테헤란로 런치클럽_스타트업 성공을 위한 AWS 첫걸음 _HB Smith
 
Android 5.0 & Material Design
Android 5.0 & Material DesignAndroid 5.0 & Material Design
Android 5.0 & Material Design
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 
Mian Nawaz Sharif PMLN
Mian Nawaz Sharif PMLNMian Nawaz Sharif PMLN
Mian Nawaz Sharif PMLN
 

Similaire à Kernel Recipes 2017: Using Linux perf at Netflix

Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernellcplcp1
 
Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Amin Astaneh
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems PerformanceBrendan Gregg
 
php & performance
 php & performance php & performance
php & performancesimon8410
 
Systemtap
SystemtapSystemtap
SystemtapFeng Yu
 
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Valeriy Kravchuk
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in RustInfluxData
 
High Availability in 37 Easy Steps
High Availability in 37 Easy StepsHigh Availability in 37 Easy Steps
High Availability in 37 Easy StepsTim Serong
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2Stanley Ho
 
PHP & Performance
PHP & PerformancePHP & Performance
PHP & Performance毅 吕
 
Perf stat windows
Perf stat windowsPerf stat windows
Perf stat windowsAccenture
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisBrendan Gregg
 
linux_Commads
linux_Commadslinux_Commads
linux_Commadstastedone
 
Drupaljam 2017 - Deploying Drupal 8 onto Hosted Kubernetes in Google Cloud
Drupaljam 2017 - Deploying Drupal 8 onto Hosted Kubernetes in Google CloudDrupaljam 2017 - Deploying Drupal 8 onto Hosted Kubernetes in Google Cloud
Drupaljam 2017 - Deploying Drupal 8 onto Hosted Kubernetes in Google CloudDropsolid
 
Unit 10 investigating and managing
Unit 10 investigating and managingUnit 10 investigating and managing
Unit 10 investigating and managingroot_fibo
 
Node Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In ProductionNode Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In ProductionYunong Xiao
 
AIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge ShareAIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge Share.Gastón. .Bx.
 

Similaire à Kernel Recipes 2017: Using Linux perf at Netflix (20)

Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernel
 
Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
Introduction to Perf
Introduction to PerfIntroduction to Perf
Introduction to Perf
 
php & performance
 php & performance php & performance
php & performance
 
Systemtap
SystemtapSystemtap
Systemtap
 
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 
High Availability in 37 Easy Steps
High Availability in 37 Easy StepsHigh Availability in 37 Easy Steps
High Availability in 37 Easy Steps
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2
 
PHP & Performance
PHP & PerformancePHP & Performance
PHP & Performance
 
Perf stat windows
Perf stat windowsPerf stat windows
Perf stat windows
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance Analysis
 
linux_Commads
linux_Commadslinux_Commads
linux_Commads
 
Drupaljam 2017 - Deploying Drupal 8 onto Hosted Kubernetes in Google Cloud
Drupaljam 2017 - Deploying Drupal 8 onto Hosted Kubernetes in Google CloudDrupaljam 2017 - Deploying Drupal 8 onto Hosted Kubernetes in Google Cloud
Drupaljam 2017 - Deploying Drupal 8 onto Hosted Kubernetes in Google Cloud
 
Unit 10 investigating and managing
Unit 10 investigating and managingUnit 10 investigating and managing
Unit 10 investigating and managing
 
Node Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In ProductionNode Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In Production
 
Unix kernal
Unix kernalUnix kernal
Unix kernal
 
AIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge ShareAIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge Share
 

Plus de Brendan Gregg

Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixBrendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFBrendan Gregg
 

Plus de Brendan Gregg (18)

Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 

Dernier

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Dernier (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Kernel Recipes 2017: Using Linux perf at Netflix

  • 1. Using Linux perf at Netflix Brendan Gregg Senior Performance Architect Sep 2017
  • 4. Case Study: ZFS is ea/ng my CPU (cont.) •  ZFS ARC (adap/ve replacement cache) reclaim. •  But… ZFS is not in use. No pools, datasets, or ARC buffers. •  CPU /me is in random entropy, picking which (empty) list to evict. Bug: hUps://github.com/zfsonlinux/zfs/issues/6531 Zoomed:
  • 5.
  • 6. Agenda 1.  Why Ne9lix Needs Linux Profiling 2.  perf Basics 3.  CPU Profiling & Gotchas –  Stacks (gcc, Java) –  Symbols (Node.js, Java) –  Guest PMCs –  PEBS –  Overheads 4.  perf Advanced
  • 8. Understand CPU usage quickly and completely
  • 11. Why Linux perf? •  Available –  Linux, open source •  Low overhead –  Tunable sampling, ring buffers •  Accurate –  Applica/on-basic samplers don't know what's really RUNNING; eg, Java and epoll •  No blind spots –  See user, library, kernel with CPU sampling –  With some work: hardirqs & SMI as well •  No sample skew –  Unlike Java safety point skew
  • 12. Why is this so important •  We typically scale microservices based on %CPU –  Small %CPU improvements can mean big $avings •  CPU profiling is used by many ac/vi/es –  Explaining regressions in new sooware versions –  Incident response –  3rd party sooware evalua/ons –  Iden/fy performance tuning targets –  Part of CPU workload characteriza/on •  perf does lots more, but we spend ~95% of our /me looking at CPU profiles, and 5% on everything else –  With new BPF capabili/es (off-CPU analysis), that might go from 95 to 90%
  • 13. CPU profiling should be easy, but… JIT runtimes no frame pointers no debuginfo stale symbol maps container namespaces …
  • 15. perf (aka "perf_events") •  The official Linux profiler –  In the linux-tools-common package –  Source code & docs in Linux: tools/perf •  Supports many profiling/tracing features: –  CPU Performance Monitoring Counters (PMCs) –  Sta/cally defined tracepoints –  User and kernel dynamic tracing –  Kernel line and local variable tracing –  Efficient in-kernel counts and filters –  Stack tracing, libunwind –  Code annota/on •  Some bugs in the past; has been stable for us perf_events ponycorn
  • 16. A Mul/tool of Subcommands # perf usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS] The most commonly used perf commands are: annotate Read perf.data (created by perf record) and display annotated code archive Create archive with object files with build-ids found in perf.data file bench General framework for benchmark suites buildid-cache Manage build-id cache. buildid-list List the buildids in a perf.data file c2c Shared Data C2C/HITM Analyzer. config Get and set variables in a configuration file. data Data file related processing diff Read perf.data files and display the differential profile evlist List the event names in a perf.data file ftrace simple wrapper for kernel's ftrace functionality inject Filter to augment the events stream with additional information kallsyms Searches running kernel for symbols kmem Tool to trace/measure kernel memory properties kvm Tool to trace/measure kvm guest os list List all symbolic event types lock Analyze lock events mem Profile memory accesses record Run a command and record its profile into perf.data report Read perf.data (created by perf record) and display the profile sched Tool to trace/measure scheduler properties (latencies) script Read perf.data (created by perf record) and display trace output stat Run a command and gather performance counter statistics test Runs sanity tests. timechart Tool to visualize total system behavior during a workload top System profiling tool. probe Define new dynamic tracepoints trace strace inspired tool See 'perf help COMMAND' for more information on a specific command. from Linux 4.13
  • 17. perf Basic Workflow 1.  list -> find events 2.  stat -> count them 3.  record-> write event data to file 4.  report -> browse summary 5.  script -> event dump for post processing
  • 18. Basic Workflow Example # perf list sched:* […] sched:sched_process_exec [Tracepoint event] […] # perf stat -e sched:sched_process_exec -a -- sleep 10 Performance counter stats for 'system wide': 19 sched:sched_process_exec 10.001327817 seconds time elapsed # perf record -e sched:sched_process_exec -a -g -- sleep 10 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.212 MB perf.data (21 samples) ] # perf report -n --stdio # Children Self Samples Trace output # ........ ........ ............ ................................................. 4.76% 4.76% 1 filename=/bin/bash pid=7732 old_pid=7732 | ---_start return_from_SYSCALL_64 do_syscall_64 sys_execve do_execveat_common.isra.35 […] # perf script sleep 7729 [003] 632804.699184: sched:sched_process_exec: filename=/bin/sleep pid=7729 old_pid=7729 44b97e do_execveat_common.isra.35 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux) 44bc01 sys_execve (/lib/modules/4.13.0-rc1-virtual/build/vmlinux) 203acb do_syscall_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux) acd02b return_from_SYSCALL_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux) c30 _start (/lib/x86_64-linux-gnu/ld-2.23.so) […] 1.  found an event of interest 2.  19 per 10 sec is a very low rate, so safe to record 3.  21 samples captured 4.  summary style may be sufficient, or, 5.  script output in /me order
  • 19. perf stat/record Format •  These have three main parts: ac/on, event, scope. •  e.g., profiling on-CPU stack traces: perf record -F 99 -a -g -- sleep 10 Event: 99 Hertz Scope: all CPUs AcCon: record stack traces Note: sleep 10 is a dummy command to set the dura/on
  • 20. perf Ac/ons •  Count events (perf stat …) –  Uses an efficient in-kernel counter, and prints the results •  Sample events (perf record …) –  Records details of every event to a dump file (perf.data) •  Timestamp, CPU, PID, instruc/on pointer, … –  This incurs higher overhead, rela/ve to the rate of events –  Include the call graph (stack trace) using -g •  Other ac/ons include: –  List events (perf list) –  Report from a perf.data file (perf report) –  Dump a perf.data file as text (perf script) –  top style profiling (perf top)
  • 21. perf Events •  Custom /mers –  e.g., 99 Hertz (samples per second) •  Hardware events –  CPU Performance Monitoring Counters (PMCs) •  Tracepoints –  Sta/cally defined in sooware •  Dynamic tracing –  Created using uprobes (user) or kprobes (kernel) –  Can do kernel line tracing with local variables (needs kernel debuginfo)
  • 23. perf Events: List # perf list List of pre-defined events (to be used in -e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] […] cpu-clock [Software event] task-clock [Software event] page-faults OR faults [Software event] context-switches OR cs [Software event] cpu-migrations OR migrations [Software event] […] L1-dcache-loads [Hardware cache event] L1-dcache-load-misses [Hardware cache event] L1-dcache-stores [Hardware cache event] […] skb:kfree_skb [Tracepoint event] skb:consume_skb [Tracepoint event] skb:skb_copy_datagram_iovec [Tracepoint event] net:net_dev_xmit [Tracepoint event] net:net_dev_queue [Tracepoint event] net:netif_receive_skb [Tracepoint event] net:netif_rx [Tracepoint event] […]
  • 24. perf Scope •  System-wide: all CPUs (-a) •  Target PID (-p PID) •  Target command (…) •  Specific CPUs (-c …) •  User-level only (<event>:u) •  Kernel-level only (<event>:k) •  A custom filter to match variables (--filter …) •  This cgroup (container) only (--cgroup …)
  • 25. One-Liners: Lis/ng Events # Listing all currently known events: perf list # Searching for "sched" tracepoints: perf list | grep sched # Listing sched tracepoints: perf list 'sched:*' Dozens of perf one-liners: hUp://www.brendangregg.com/perf.html#OneLiners
  • 26. One-Liners: Coun/ng Events # CPU counter statistics for the specified command: perf stat command # CPU counter statistics for the entire system, for 5 seconds: perf stat -a sleep 5 # Detailed CPU counter statistics for the specified PID, until Ctrl-C: perf stat -dp PID # Various CPU last level cache statistics for the specified command: perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command # Count system calls for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID # Count block device I/O events for the entire system, for 10 seconds: perf stat -e 'block:*' -a sleep 10 # Show system calls by process, refreshing every 2 seconds: perf top -e raw_syscalls:sys_enter -ns comm
  • 27. One-Liners: Profiling Events # Sample on-CPU functions for the specified command, at 99 Hertz: perf record -F 99 command # Sample CPU stack traces for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID -g -- sleep 10 # Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds: perf record -F 99 -ag -- sleep 10 # Sample CPU stacks, once every 10,000 Level 1 data cache misses, for 5 secs: perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5 # Sample CPU stack traces, once every 100 last level cache misses, for 5 secs: perf record -e LLC-load-misses -c 100 -ag -- sleep 5 # Sample on-CPU kernel instructions, for 5 seconds: perf record -e cycles:k -a -- sleep 5 # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5
  • 28. One-Liners: Repor/ng # Show perf.data in an ncurses browser (TUI) if possible: perf report # Show perf.data with a column for sample count: perf report -n # Show perf.data as a text report, with data coalesced and percentages: perf report --stdio # List all raw events from perf.data: perf script # List all raw events from perf.data, with customized fields: perf script -f comm,tid,pid,time,cpu,event,ip,sym,dso # Dump raw contents from perf.data as hex (for debugging): perf script -D # Disassemble and annotate instructions with percentages (needs debuginfo): perf annotate --stdio
  • 30. CPU Profiling •  Record stacks at a /med interval: simple and effec/ve –  Pros: Low (determinis/c) overhead –  Cons: Coarse accuracy, but usually sufficient A B block interrupt on-CPU off-CPU A B A A B A syscall /me stack samples: A
  • 31. perf Record # perf record -F 99 -ag -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ] # perf report -n --stdio 1.40% 162 java [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--63.21%-- try_to_wake_up | | | |--63.91%-- default_wake_function | | | | | |--56.11%-- __wake_up_common | | | __wake_up_locked | | | ep_poll_callback | | | __wake_up_common | | | __wake_up_sync_key | | | | | | | |--59.19%-- sock_def_readable […78,000 lines truncated…] Sampling full stack traces at 99 Hertz
  • 35. Flame Graphs •  Flame Graphs: –  x-axis: alphabe/cal stack sort, to maximize merging –  y-axis: stack depth –  color: random, or hue can be a dimension •  e.g., sooware type, or difference between two profiles for non-regression tes/ng ("differen/al flame graphs") –  interpreta/on: top edge is on-CPU, beneath it is ancestry •  Just a Perl program to convert perf stacks into SVG –  Includes JavaScript: open in a browser for interac/vity •  Easy to get working hUp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record -F 99 -a –g -- sleep 30 perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
  • 36. flamegraph.pl Op/ons $ flamegraph.pl --help USAGE: flamegraph.pl [options] infile > outfile.svg --title TEXT # change title text --subtitle TEXT # second level title (optional) --width NUM # width of image (default 1200) --height NUM # height of each frame (default 16) --minwidth NUM # omit smaller functions (default 0.1 pixels) --fonttype FONT # font type (default "Verdana") --fontsize NUM # font size (default 12) --countname TEXT # count type label (default "samples") --nametype TEXT # name type label (default "Function:") --colors PALETTE # set color palette. choices are: hot (default), mem, # io, wakeup, chain, java, js, perl, red, green, blue, # aqua, yellow, purple, orange --hash # colors are keyed by function name hash --cp # use consistent palette (palette.map) --reverse # generate stack-reversed flame graph --inverted # icicle graph --negate # switch differential hues (blue<->red) --notes TEXT # add notes comment in SVG (for debugging) --help # this message eg, flamegraph.pl --title="Flame Graph: malloc()" trace.txt > graph.svg
  • 37. perf Flame Graph Workflow (Linux 2.6+) perf stat perf record perf report perf script count events capture stacks text UI dump profile stackcollapse-perf.pl perf.data flame graph visualiza/on perf list list events Typical Workflow flamegraph.pl
  • 38. perf Flame Graph Workflow (Linux 4.5+) perf report perf report -g folded text UI summary awk flame graph visualiza/on flamegraph.pl perf script dump perf stat perf record count events capture stacks perf.data perf list list events Typical Workflow
  • 39. Flame Graph Op/miza/ons Linux 2.6 perf record perf script capture stacks write text stackcollapse-perf.pl flamegraph.pl perf.data write samples reads samples folded output perf record perf report –g folded capture stacks folded report awk flamegraph.pl perf.data write samples reads samples folded output Linux 4.5 count stacks (BPF) folded output flamegraph.pl profile.py Linux 4.9 not perf
  • 41. When we've tried to use perf •  Stacks don't work (missing) •  Symbols don't work (hex numbers) •  Instruc/on profiling looks bogus •  PMCs don't work in VM guests •  Container break things •  Overhead is too high
  • 42. How to really get started 1.  Get "perf" to work 2.  Get stack walking to work 3.  Fix symbol transla/on 4.  Get IPC to work 5.  Test perf under load Install perf-tools-common and perf-tools-`uname -r` packages; Or compile in the Linux source: tools/perf The "gotchas"…
  • 43. Gotcha #1 Broken Stacks 1.  Take a CPU profile 2.  Run perf report 3.  If stacks are ooen < 3 frames, or don't reach "thread start" or "main", they are probably broken. Fix them. perf record -F 99 -a –g -- sleep 30 perf report -n --stdio
  • 44. Iden/fying Broken Stacks | |--96.78%-- re_search_stub | rpl_re_search | match_regex | do_subst | execute_program | process_files | main | __libc_start_main | --3.22%-- rpl_re_search match_regex do_subst execute_program process_files main __libc_start_main 28.10% 146 sed libc-2.19.so [.] re_search_internal | --- re_search_internal | |--12.25%-- 0x3 | 0x100007 broken not broken
  • 45. Iden/fying Broken Stacks |--11.65%-- 0x40a447 | 0x40659a | 0x408dd8 | 0x408ed1 | 0x402689 | 0x7fa1cd08aec5 | |--1.33%-- 0x40a4a1 | | | |--60.01%-- 0x40659a | | 0x408dd8 | | 0x408ed1 | | 0x402689 | | 0x7fa1cd08aec5 78.50% 409 sed libc-2.19.so [.] 0x00000000000dd7d4 | |--3.65%-- 0x7f2516d5d10d | |--2.19%-- 0x7f2516d0332f | |--1.22%-- 0x7f2516cffbd2 | |--1.22%-- 0x7f2516d5d5ad broken probably not broken missing symbols, but that's another problem
  • 47. Fixing Broken Stacks •  Either: •  Fix frame pointer-based stack walking (the default) –  Pros: simple, supports any system stack walker –  Cons: might cost a liUle extra CPU to make available •  Use libunwind and DWARF: perf record -g dwarf –  Pros: more debug info –  Cons: not on older kernels, and inflates instance size –  … there's also ORC on the latest kernel •  Applica/on support –  hUps://github.com/jvm-profiling-tools/async-profiler •  Our current preference is (A), but (C) is also promising –  So how do we fix the frame pointer…
  • 48. gcc -fno-omit-frame-pointer •  Once upon a time, x86 had fewer registers, and the frame pointer register was reused for general purpose to improve performance. This breaks system stack walking. •  gcc provides -fno-omit-frame-pointer to fix this –  Please make this the default in gcc!
  • 49. Java -XX:+PreserveFramePointer •  I hacked frame pointers in the JVM (JDK-8068945) and Oracle rewrote it as -XX:+PreserveFramePointer. Lets perf do FP stack walks of Java. •  Costs some overhead to use. Usually <1%. Rare cases 10%. --- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04… +++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 … @@ -5236,6 +5236,7 @@ // We always push rbp, so that on return to interpreter rbp, will be // restored correctly and we can correct the stack. push(rbp); + mov(rbp, rsp); // Remove word for ebp framesize -= wordSize; --- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp … +++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp … [...] Involved changes like this: fixing x86-64 func/on prologues
  • 50. Broken Java Stacks •  Check with "perf script" to see stack samples •  These are 1 or 2 levels deep (junk values) # perf script […] java 4579 cpu-clock: ffffffff8172adff tracesys ([kernel.kallsyms]) 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2… java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101c97 [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) java 4579 cpu-clock: 7f4179349aec [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101d0f [unknown] (/tmp/perf-4458.map) […]
  • 51. Fixed Java Stacks •  With -XX:+PreserveFramePointer stacks are full, and go all the way to start_thread() •  This is what the CPUs are really running: inlined frames are not present # perf script […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) 7fd30184def8 [unknown] (/tmp/perf-8131.map) 7fd30174f544 [unknown] (/tmp/perf-8131.map) 7fd30175d3a8 [unknown] (/tmp/perf-8131.map) 7fd30166d51c [unknown] (/tmp/perf-8131.map) 7fd301750f34 [unknown] (/tmp/perf-8131.map) 7fd3016c2280 [unknown] (/tmp/perf-8131.map) 7fd301b02ec0 [unknown] (/tmp/perf-8131.map) 7fd3016f9888 [unknown] (/tmp/perf-8131.map) 7fd3016ece04 [unknown] (/tmp/perf-8131.map) 7fd30177783c [unknown] (/tmp/perf-8131.map) 7fd301600aa8 [unknown] (/tmp/perf-8131.map) 7fd301a4484c [unknown] (/tmp/perf-8131.map) 7fd3010072e0 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd3010004e7 [unknown] (/tmp/perf-8131.map) 7fd3171df76a JavaCalls::call_helper(JavaValue*,… 7fd3171dce44 JavaCalls::call_virtual(JavaValue*… 7fd3171dd43a JavaCalls::call_virtual(JavaValue*… 7fd31721b6ce thread_entry(JavaThread*, Thread*)… 7fd3175389e0 JavaThread::thread_main_inner() (/… 7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf… 7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/… 7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
  • 53. Gotcha #2 Missing Symbols •  Missing symbols should be obvious in perf report/script: 12.06% 62 sed sed [.] re_search_internal | --- re_search_internal | |--96.78%-- re_search_stub | rpl_re_search | match_regex | do_subst | execute_program | process_files | main | __libc_start_main 71.79% 334 sed sed [.] 0x000000000001afc1 | |--11.65%-- 0x40a447 | 0x40659a | 0x408dd8 | 0x408ed1 | 0x402689 | 0x7fa1cd08aec5 broken not broken
  • 54. Fixing Symbols •  For installed packages: A.  Add a -dbgsym package, if available B.  Recompile from source •  For JIT (Java, Node.js, …): A.  Create a /tmp/perf-PID.map file. perf already looks for this •  Map format is "START SIZE symbolname" B.  Or use a symbol loggers. Eg tools/perf/jvm/. # perf script Failed to open /tmp/perf-8131.map, continuing without symbols […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) […]
  • 55. Java Symbols •  perf-map-agent –  Agent aUaches and writes the map file on demand (previous versions aUached on Java start, and wrote con/nually) –  hUps://github.com/jvm-profiling-tools/perf-map-agent (was hUps://github.com/jrudolph/perf-map-agent) •  Automa/on: jmaps –  We use scripts to find Java processes and dump their map files, paying aUen/on to file ownership etc –  hUps://github.com/brendangregg/FlameGraph/blob/master/jmaps –  Needs to run as close as possible to the profile, to minimize symbol churn # perf record -F 99 -a -g -- sleep 30; jmaps
  • 57. Java: Inlining A.  Disabling inlining: –  -XX:-Inline –  Many more Java frames –  80% slower (in this case) –  May not be necessary: inlined flame graphs ooen make enough sense –  Or tune -XX:MaxInlineSize and - XX:InlineSmallCode to reveal more frames, without cos/ng much perf: can even go faster! B.  Symbol agents can uninline –  perf-map-agent unfoldall –  We some/mes need and use this No inlining
  • 58. Node.js: Stacks & Symbols •  Frame pointer stacks work •  Symbols currently via a logger –  --perf-basic-prof: everything. We found it can log over 1 Gbyte/day. –  --perf-basic-prof-only-functions: tries to only log symbols we care about. •  perf may not use the most recent symbol in the log –  We /dy logs before using them: hUps://raw.githubusercontent.com/brendangregg/Misc/master/perf_events/ perfmap/dy.pl •  Future v8's may support on-demand symbol dumps
  • 59. Gotcha #3 Instruc/on Profiling # perf annotate -i perf.data.noplooper --stdio Percent | Source code & Disassembly of noplooper -------------------------------------------------------- : Disassembly of section .text: : : 00000000004004ed <main>: 0.00 : 4004ed: push %rbp 0.00 : 4004ee: mov %rsp,%rbp 20.86 : 4004f1: nop 0.00 : 4004f2: nop 0.00 : 4004f3: nop 0.00 : 4004f4: nop 19.84 : 4004f5: nop 0.00 : 4004f6: nop 0.00 : 4004f7: nop 0.00 : 4004f8: nop 18.73 : 4004f9: nop 0.00 : 4004fa: nop 0.00 : 4004fb: nop 0.00 : 4004fc: nop 19.08 : 4004fd: nop 0.00 : 4004fe: nop 0.00 : 4004ff: nop 0.00 : 400500: nop 21.49 : 400501: jmp 4004f1 <main+0x4> 16 NOPs in a loop Let's profile instruc/ons to see which are hot! (have I lost my mind?)
  • 61. Instruc/on Profiling # perf annotate -i perf.data.noplooper --stdio Percent | Source code & Disassembly of noplooper -------------------------------------------------------- : Disassembly of section .text: : : 00000000004004ed <main>: 0.00 : 4004ed: push %rbp 0.00 : 4004ee: mov %rsp,%rbp 20.86 : 4004f1: nop 0.00 : 4004f2: nop 0.00 : 4004f3: nop 0.00 : 4004f4: nop 19.84 : 4004f5: nop 0.00 : 4004f6: nop 0.00 : 4004f7: nop 0.00 : 4004f8: nop 18.73 : 4004f9: nop 0.00 : 4004fa: nop 0.00 : 4004fb: nop 0.00 : 4004fc: nop 19.08 : 4004fd: nop 0.00 : 4004fe: nop 0.00 : 4004ff: nop 0.00 : 400500: nop 21.49 : 400501: jmp 4004f1 <main+0x4> Go home instruc/on pointer, you're drunk
  • 62. PEBS •  I believe this is due to parallel and out-of-order execu/on of micro-ops: the sampled IP is the resump/on instruc/on, not what is currently execu/ng. And skid. •  PEBS may help: Intel's Precise Event Based Sampling •  perf_events has support: –  perf record -e cycles:pp –  The 'p' can be specified mul/ple /mes: •  0 - SAMPLE_IP can have arbitrary skid •  1 - SAMPLE_IP must have constant skid •  2 - SAMPLE_IP requested to have 0 skid •  3 - SAMPLE_IP must have 0 skid –  … from tools/perf/Documenta/on/perf-list.txt
  • 63. Gotcha #4 VM Guests •  Using PMCs from most VM guests: # perf stat -a -d sleep 5 Performance counter stats for 'system wide': 10003.718595 task-clock (msec) # 2.000 CPUs utilized [100.00%] 323 context-switches # 0.032 K/sec [100.00%] 17 cpu-migrations # 0.002 K/sec [100.00%] 233 page-faults # 0.023 K/sec <not supported> cycles <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend <not supported> instructions <not supported> branches <not supported> branch-misses <not supported> L1-dcache-loads <not supported> L1-dcache-load-misses <not supported> LLC-loads <not supported> LLC-load-misses 5.001607197 seconds time elapsed
  • 64. VM Guest PMCs •  Without PMCs, %CPU is ambiguous. We need IPC. –  Can't measure instruc/ons per cycle (IPC), cache hits/misses, MMU/TLB events, etc. •  Is fixable: eg, Xen can enable PMCs (vpmu boot op/on) –  I added vpmu support for subsets, eg, vpmu=arch for Intel architectural set (7 PMCs only) –  hUp://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html –  Now available on the largest AWS EC2 instance types architectural set
  • 65. VM Guest MSRs •  Model Specific Registers (MSRs) may be exposed when PMCs are not •  BeUer than nothing. Can solve some issues. –  showboost is from my msr-cloud-tools collec/on (on github) # ./showboost CPU MHz : 2500 Turbo MHz : 2900 (10 active) Turbo Ratio : 116% (10 active) CPU 0 summary every 5 seconds... TIME C0_MCYC C0_ACYC UTIL RATIO MHz 17:28:03 4226511637 4902783333 33% 116% 2900 17:28:08 4397892841 5101713941 35% 116% 2900 17:28:13 4550831380 5279462058 36% 116% 2900 17:28:18 4680962051 5429605341 37% 115% 2899 17:28:23 4782942155 5547813280 38% 115% 2899 [...]
  • 67. Gotcha #5 Containers •  perf from the host can't find symbol files in different mount namespaces •  We currently workaround it –  hUp://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/ •  Should be fixed in 4.14 –  Krister Johansen's patches
  • 68. Gotcha #6 Overhead •  Overhead is rela/ve to the rate of events instrumented •  perf stat does in-kernel counts: rela/vely low overhead •  perf record writes perf.data, which has slightly higher CPU overhead, plus file system and disk I/O •  Test before use –  In the lab –  Run perf stat to understand rate, before perf record •  Also consider --filter, to filter events in-kernel
  • 71. # perf record -e block:block_rq_insert -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.172 MB perf.data (~7527 samples) ] # perf script […] java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java] java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java] java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java] java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java] […] Tracepoints include/trace/events/block.h: java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java] DECLARE_EVENT_CLASS(block_rq, [...] TP_printk("%d,%d %s %u (%s) %llu + %u [%s]", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs, __entry->bytes, __get_str(cmd), (unsigned long long)__entry->sector, __entry->nr_sector, __entry->comm) format string process PID [CPU] /mestamp: eventname: kernel source may be the only docs Also see: cat /sys/kernel/debug/tracing/events/block/block_rq_insert/format
  • 72. One-Liners: Sta/c Tracing # Trace new processes, until Ctrl-C: perf record -e sched:sched_process_exec -a # Trace all context-switches with stack traces, for 1 second: perf record -e context-switches –ag -- sleep 1 # Trace CPU migrations, for 10 seconds: perf record -e migrations -a -- sleep 10 # Trace all connect()s with stack traces (outbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_connect –ag # Trace all block device (disk I/O) requests with stack traces, until Ctrl-C: perf record -e block:block_rq_insert -ag # Trace all block device issues and completions (has timestamps), until Ctrl-C: perf record -e block:block_rq_issue -e block:block_rq_complete -a # Trace all block completions, of size at least 100 Kbytes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'nr_sector > 200' # Trace all block completions, synchronous writes only, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs == "WS"' # Trace all block completions, all types of writes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"' # Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C: perf record -e 'ext4:*' -o /tmp/perf.data -a
  • 73. One-Liners: Dynamic Tracing # Add a tracepoint for the kernel tcp_sendmsg() function entry (--add optional): perf probe --add tcp_sendmsg # Remove the tcp_sendmsg() tracepoint (or use --del): perf probe -d tcp_sendmsg # Add a tracepoint for the kernel tcp_sendmsg() function return: perf probe 'tcp_sendmsg%return' # Show avail vars for the tcp_sendmsg(), plus external vars (needs debuginfo): perf probe -V tcp_sendmsg --externs # Show available line probes for tcp_sendmsg() (needs debuginfo): perf probe -L tcp_sendmsg # Add a tracepoint for tcp_sendmsg() line 81 with local var seglen (debuginfo): perf probe 'tcp_sendmsg:81 seglen' # Add a tracepoint for do_sys_open() with the filename as a string (debuginfo): perf probe 'do_sys_open filename:string' # Add a tracepoint for myfunc() return, and include the retval as a string: perf probe 'myfunc%return +0($retval):string' # Add a tracepoint for the user-level malloc() function from libc: perf probe -x /lib64/libc.so.6 malloc # List currently available dynamic probes: perf probe -l
  • 74. One-Liners: Advanced Dynamic Tracing # Add a tracepoint for tcp_sendmsg(), with three entry regs (platform specific): perf probe 'tcp_sendmsg %ax %dx %cx' # Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for %cx register: perf probe 'tcp_sendmsg bytes=%cx' # Trace previously created probe when the bytes (alias) var is greater than 100: perf record -e probe:tcp_sendmsg --filter 'bytes > 100' # Add a tracepoint for tcp_sendmsg() return, and capture the return value: perf probe 'tcp_sendmsg%return $retval' # Add a tracepoint for tcp_sendmsg(), and "size" entry argument (debuginfo): perf probe 'tcp_sendmsg size' # Add a tracepoint for tcp_sendmsg(), with size and socket state (debuginfo): perf probe 'tcp_sendmsg size sk->__sk_common.skc_state' # Trace previous probe when size > 0, and state != TCP_ESTABLISHED(1) (debuginfo): perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a •  Kernel debuginfo is an onerous requirement for the Netflix cloud •  We can use registers instead (as above). But which registers?
  • 75. The RoseUa Stone of Registers # perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state' […] Added new event: Writing event: p:probe/tcp_sendmsg tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8 probe:tcp_sendmsg (on tcp_sendmsg with size skc_state=sk->__sk_common.skc_state) You can now use it in all perf tools, such as: perf record -e probe:tcp_sendmsg -aR sleep 1 One server instance with kernel debuginfo, and -nv (dry run, verbose): # perf probe 'tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8' Failed to find path of kernel module. Added new event: probe:tcp_sendmsg (on tcp_sendmsg with size=%cx:u64 skc_state=+18(%si):u8) You can now use it in all perf tools, such as: perf record -e probe:tcp_sendmsg -aR sleep 1 All other instances (of the same kernel version): Copy-n-paste! Masami Hiramatsu was investigating a way to better automate this
  • 77. There's s/ll a lot more to perf… •  Using PMCs •  perf scrip/ng interface •  perf + eBPF •  perf sched •  perf /mechart •  perf trace •  perf c2c (new!) •  perf orace (new!) •  …
  • 78. Links & References •  perf_events •  Kernel source: tools/perf/DocumentaCon •  hUps://perf.wiki.kernel.org/index.php/Main_Page •  hUp://www.brendangregg.com/perf.html •  hUp://web.eece.maine.edu/~vweaver/projects/perf_events/ •  Mailing list hUp://vger.kernel.org/vger-lists.html#linux-perf-users •  perf-tools: hUps://github.com/brendangregg/perf-tools •  PMU tools: hUps://github.com/andikleen/pmu-tools •  perf, orace, and more: hUp://www.brendangregg.com/linuxperf.html •  Java frame pointer patch •  hUp://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-December/016477.html •  hUps://bugs.openjdk.java.net/browse/JDK-8068945 •  Node.js: hUp://techblog.ne9lix.com/2014/11/nodejs-in-flames.html •  Methodology: hUp://www.brendangregg.com/methodology.html •  Flame graphs: hUp://www.brendangregg.com/flamegraphs.html •  Heat maps: hUp://www.brendangregg.com/heatmaps.html •  eBPF: hUp://lwn.net/Ar/cles/603983/
  • 79. Thank You –  Ques/ons? –  hUp://www.brendangregg.com –  hUp://slideshare.net/brendangregg –  bgregg@ne9lix.com –  @brendangregg