This document discusses network performance on Intel server platforms. It provides an overview of packet I/O basics like receive and transmit processing. It describes how Data Direct I/O (DDIO) reduces memory accesses from I/O. PCIe bandwidth capabilities are discussed in relation to packet size. Ethernet packet rates and the CPU processing budget needed to support different packet sizes and throughput levels are examined. The document concludes by noting the IPV4 forwarding capacity of Intel platforms over the years.
3. Network Platforms Group 3
Intel Server Board S2600C0
6 PCIe 3.0 slots
including 4 x16
slots & 1 x8 slot,
and 1 x4 2.0 slot
Dual Intel® Xeon® E5-
2600 v3, v4 CPUs
135W/150W max
16 LR/U/R-DIMMs
at up to 1600Mhz
with ECC
5. TRANSFORMING NETWORKING & STORAGE
5
Rx Overview
INTEL INTERNAL ONLY
1. CPU Write Rx descriptor with
buffer
2. NIC Read Rx descriptor to get
buffer address
3. NIC Write Rx packet to buffer
address
4. NIC Write Rx descriptor
5. CPU Read Rx descriptor (polling)
6. CPU processes Rx descriptor
Memory
PCIe
RX
D
TX
D
BU
F
LLC
…Cores
…
1
2
34
5
6. TRANSFORMING NETWORKING & STORAGE
6
Tx Overview
INTEL INTERNAL ONLY
1. CPU Write data
2. CPU Write Tx descriptor
3. NIC Read Tx descriptor to get
buffer address
4. NIC Read Tx packet from buffer
address
5. NIC Write Tx descriptor
6. CPU Read Tx descriptor
Memory
PCIe
RX
D
TX
D
BU
F
LLC
…Cores
…
1
3
4
5
62
7. TRANSFORMING NETWORKING & STORAGE
7
Data Path Technologies: DDIO
Introduced with Intel® Xeon® processor E5-2600
Reduces memory accesses from I/O on local socket
• I/O data (descriptors, packets) ingress and egress directly from Last-level
cache
• No memory bandwidth consumed (until LRU eviction from cache)
RMW (partial line writes) merged in cache
• Packets that aren’t 64B aligned – cause read-modify-write when written to
DRAM directly.
INTEL INTERNAL ONLY Doc #xxxxx
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
complete information visit http://www.intel.com/performance.
8. Network Platforms Group 8
I/O Bandwidth
PCI-E bandwidth capability varies depending on packet size
• Smaller packet sizes (64B – 256B) place a heavier requirement on
bandwidth
A single PCI-E x8 Gen 3 slot can saturate 40GbE @ 256B
64B
Line Rate
256B
Line Rate
PCI-E Gen2 x8 2x10GbE 80%
23.8 MPPS
100%
PCI-E Gen3 x8 2x10GbE 100% 100%
4x10GbE 80%
~47 MPPS
100%
10. Network Platforms Group
0
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
140,000,000
160,000,000
10GbE Packets/sec
40GbE Packets/sec
100GbE Packets/sec
10
Ethernet packet rates
6.72 ns, 148.8 Mpkts/s
22 ns, ~45 Mpkts/s
~42.5ns, ~23.5 Mpkts/s
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
complete information visit http://www.intel.com/performance.
11. Network Platforms Group
CPU processing budget (instructions) …
Cycle Budget for processing 100G at various packet sizes on a 1S system running at 2.2 GHz
0
500
1000
1500
2000
2500
3000
0 2 4 6 8 10 12 14 16 18 20
Cyclestoprocess100Gworthofpackets
Core Count per socket
64
128
256
512
1024
228 c/p
403 c/p
1447 c/p
751 c/p
2840 c/p
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
complete information visit http://www.intel.com/performance.
12. Network Platforms Group 12
Takeaways
At 256B, an 18C CPU running 2 GHz can satisfy 100 GbE throughput as long as we stay
within 751 cycles/packet
• At 512B, the budget is 1447 cycles
If we run an Instructions/clock (IPC) of ~2
• 256B = 1502 instructions
• 512B = ~2894 instructions
If the IPC is 2.5 …
• 256B = 1877 instructions
• 512B = 3617 instructions
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
complete information visit http://www.intel.com/performance.
13. Network Platforms Group 13
Capacity of the platform
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components,
software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products. For more complete information visit http://www.intel.com/performance.
IPV4 L3 Forwarding Performance of 64Byte
Packets
* Other names and brands may be claimed as the property of others.
Broadwell EP System Configuration
Hardware
Platform SuperMicro® - X10DRX
CPU Intel® Xeon® Processor E5-2658 v4
Chipset Intel® C612 chipset
Sockets 2
Cores per Socket 14 (28 threads)
LL CACHE 30 MB
QPI/DMI 9.6GT/s
PCIe Gen3x8
MEMORY
DDR4 2400 MHz, 1Rx4 8GB (total 64GB), 4 Channel
per Socket
NIC
10 x Intel® Ethernet CNA XL710-QDA2PCI-Express
Gen3 x8 Dual Port 40 GbE Ethernet NIC
(1x40G/card)
NIC Mbps 40,000
BIOS BIOS version: 1.0c (02/12/2015)
Software
OS Debian 8.0
Kernel version 3.18.2
Other DPDK2.2.0
55
80.1
164.9
255
279.9
346.7
0
50
100
150
200
250
300
350
400
2010 (2S
WMR)
2011 (1S
SNB)
2012(2S
SNB)
2013 (2S
IVB)
2014 (2S
HSW)
2015 (2S
BDW)
L3FwdPerformance
(MPPS)
Year
37
Gbps
53.8
Gbps
110.8
Gbps
171.4
Gbps
187.2
Gbps
233
Gbps
2010
(2S WMR)
2011
(1S SNB)
2013
(2S IVB)
2012
(2S SNB)
2015
(2S BDW)
2014
(2S HSW)