Une très bonne présentation qui introduit la technologie NVM Express qui sera à coup sure l'interface du futur (proche) des "disques" SSD. Adieu SAS et SATA, bienvenu au PCI Express dans les serveurs (et postes clients)
9. 9
Why PCI Express* for SSDs?
Added PCI Express* SSD Benefits
• Even better performance
• Increased Data Center CPU I/O:
40 PCI Express Lanes per CPU
• Even lower latency
• No external IOC means
Lower power (~10W)
& cost (~$15)
10. 10
Agenda
• Why PCI Express* (PCIe) for SSDs?
– PCIe SSD in Client
– PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs?
– Overview NVMe
– Driver ecosystem update
– NVMe technology developments
• Deploying PCIe SSD with NVMe
11. 11
Client PCI Express* SSD Considerations
• Form Factors?
• Attach to CPU or PCH?
• PCI Express* x2 or x4?
• Path to NVM Express?
• What about battery life?
• Thermal concerns?
Trending well, but hurdles remain
12. 12
Card-based PCI Express* SSD Options
M.2
Socket 2
M.2
Socket 3
SATA
Yes, Shared Yes, Shared
PCIe x2
PCIe x4 No Yes
Comms
Support?
Yes No
Ref Clock Required Required
Max “Up to”
Performance
2 GB/s 4 GB/s
Bottom Line Flexibility Performance
Host Socket 2 Host Socket 3
Device w/ B&M Slots
22x80mm DS recommended for capacity
22x42mm SS recommended for size & weight
M.2 defines: single or double
sided SSDs in 5 lengths, and
2 SSD host sockets
13. 13
Card-based PCI Express* SSD Options
M.2
Socket 2
M.2
Socket 3
SATA
Yes, Shared Yes, Shared
PCIe x2
PCIe x4 No Yes
Comms
Support?
Yes No
Ref Clock Required Required
Max “Up to”
Performance
2 GB/s 4 GB/s
Bottom Line Flexibility Performance
Host Socket 2 Host Socket 3
Device w/ B&M Slots
22x80mm DS recommended for capacity
22x42mm SS recommended for size & weight
M.2 defines: single or double
sided SSDs in 5 lengths, and
2 SSD host sockets
Industry alignment for M.2 length will
lower costs and accelerate transitions
14. 14
PCI Express* SSD Connector Options
SATA
Express*
SFF-8639
SATA* Yes Yes
PCIe x2 x2 or x4
Host Mux Yes No
Ref Clock Optional Required
EMI SRIS Shielding
Height 7mm 15mm
Max “Up to”
Performance
2 GB/s 4 GB/s
Bottom Line
Flexibility
& Cost
Performance
SATA Express*: flexibility for HDD
Alignments on connectors for PCI Express* SSDs
will lower costs and accelerate transitions
Separate Refclk Independent
SSC (SRIS) removes clocks
from cables, reducing
emissions & costs of shielding
SFF-8639: Best performance
15. 15
PCI Express* SSD Connector Options
SATA
Express*
SFF-8639
SATA* Yes Yes
PCIe x2 x2 or x4
Host Mux Yes No
Ref Clock Optional Required
EMI SRIS Shielding
Height 7mm 15mm
Max “Up to”
Performance
2 GB/s 4 GB/s
Bottom Line
Flexibility
& Cost
Performance
SATA Express*: flexibility for HDD
Alignments on connectors for PCI Express* SSDs
will lower costs and accelerate transitions
Separate Refclk Independent
SSC (SRIS) removes clocks
from cables, reducing
emissions & costs of shielding
SFF-8639: Best performance
Use an M.2 interface without cables for
x4 PCI Express* performance, and lower cost
18. 18
• SSD can attach to
Processor (Gen 3.0) or
Chipset (Gen 2.0
today, Gen 3.0 in
future)
• SSD uses PCIe
x1, x2 or x4
• Driver interface can be
AHCI or NVM Express
Many Options to Connect PCI Express* SSDs
19. 19
• SSD can attach to
Processor (Gen 3.0) or
Chipset (Gen 2.0
today, Gen 3.0 in
future)
• SSD uses PCIe
x1, x2 or x4
• Driver interface can be
AHCI or NVM Express
Many Options to Connect PCI Express* SSDs
Chipset attached PCI Express* Gen 2.0 x2 SSDs provide
~2x SATA 6Gbps performance today
20. 20
PCI Express* Gen 3.0, x4 SSDs with NVM Express
provide even better SSD performance tomorrow
• SSD can attach to
Processor (Gen 3.0) or
Chipset (Gen 2.0
today, Gen 3.0 in
future)
• SSD uses PCIe
x1, x2 or x4
• Driver interface can be
AHCI or NVM Express
Many Options to Connect PCI Express* SSDs
21. 21
Intel® Rapid Storage Technology 13.x
Intel® RST driver support for PCI
Express Storage coming in 2014
PCI Express* Storage + Intel® RST driver delivers
power, performance and responsiveness across
innovative form-factors in 2014 Platforms
Detachables, Convertibles,
All-in-Ones
Mainstream &
Performance
Intel® Rapid Storage Technology (Intel® RST)
22. 22
Client SATA* vs. PCI Express*
SSD Power Management
Activity Device
State
SATA /
AHCI
State
SATA
I/O
Ready
Power
Example
PCIe
Link
State
Time to
Registe
r Read
PCIe
I/O
Ready
Active
D0/
D1/D2
Active NA ~500mW L0 NA ~ 60 µs
Light
Active
Partial 10 µs ~450mW
L1.2
< 150 µs ~ 5ms
Idle Slumber 10 ms ~350mW
Pervasive
Idle /
Lid down
D3_hot DevSlp
50 -
200 ms
~15mW < 500 µs ~ 100ms
D3_cold
/ RTD3
off < 1 s 0W L3 ~100ms ~300 ms
Autonomous transition
D3_cold/off, L1.2, autonomous transitions & two-step
resume improves PCI Express* SSD battery life
~5mW
23. 23
Client PCI Express* (PCIe) SSD
Peak Power Challenges
• Max Power:
100% Sequential Writes
• SATA*: ~3.5W @ ~400MB/s
• x2 PCIe 2.0: up to 2x (7W)
• x4 PCIe 3.0: up to ~15W2
0.00
1.00
2.00
3.00
4.00
5.00
1 2 3 4 5 Average
Power(Watts)
Drive
SATA 128K Sequential Write Power
Compressible Data, QD=321
Max
1. Data collected using Agilent* DC Power Analyzer N6705B. System configuration: Intel® Core™ i7-3960X (15MB L3 Cache, 3.3GHz) on Intel Desktop Board DX79SI, AMD* Radeon HD 6990
and driver 8.881.0.0, BIOS SIX791OJ.86A.0193.2011.0809.1137, Intel INF 9.1.2.1007, Memory 16GB (4X4GB) Triple-channel Samsung DDR3-1600, Microsoft* Windows* 7 MSAHCI storage
driver, Microsoft Windows 7 Ultimate 64-bit Build 7600 with SP1, Various SSDs. Results have been estimated based on internal Intel analysis and are provided for informational purposes only.
Any difference in system hardware or software design or configuration may affect actual performance. For more information go to http://www.intel.com/performance
2. M.2 Socket 3 has nine 3.3V supply pins, each capable of 0.5A for a total power capability of 14.85W
Attention needed for power supply, thermals, and benchmarking
Source: Intel
Motherboard
M.2 SSD
Thermal
Interface
Material
24. 24
Client PCI Express* SSD Accelerators
• The client ecosystem is ready:
Implement PCI Express* SSDs now!
• Use 42mm & 80mm length M.2 for client PCIe SSD
• Implement L1.2 and extend RTD3 software support
for optimal battery life
• Use careful power supply & thermal design
• High performance desktop and workstations can
consider SFF-8639 data center SSDs for PCI
Express* x4 performance today
Drive PCI Express* client adoption with
specification alignment and careful design
25. 25
Agenda
• Why PCI Express* (PCIe) for SSDs?
– PCIe SSD in Client
– PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs?
– Overview NVMe
– Driver ecosystem update
– NVMe technology developments
• Deploying PCIe SSD with NVMe
26. 26
2.5” Enterprise SFF-8639
PCI Express* SSDs
The path to mainstream: innovators begin shipping
2.5” enterprise PCI Express* SSDs!
Image sources: Samsung*, Micron*, and Dell*
27. 27
Datacenter PCI Express* SSD
Considerations
• Form Factor?
• Implementation options?
• Hot plug or remove?
• Traditional RAID?
• Thermal/peak power?
• Managements?
Developments are on the way
28. 28
PCI Express* Enterprise SSD Form Factor
• SFF-8639 supports 4
pluggable device types
• Host slots can be designed to
accept more than one type of
device
• Use PRSNT#, IfDet#, and
DualPortEn# pins for device
Presence Detect and device
type decoding
SFF-8639 enables multi-capable hosts
29. 29
SFF-8639 Connection Topologies
• Interconnect standards currently in process
• 2 & 3 connector designs
• “beyond the scope of this specification” a common
phrase for standards currently in development
Source: “PCI Express SFF-8639 Module Specification”, Rev. 0.3
Meeting PCI
Express 3.0*
jitter budgets
for 3 connector
designs is non-
trivial. Consider
active signal
conditioning to
accelerate
adoption.
30. 30
Solution Example – 5 Connectors
PCI Express* (PCIe) signal
retimers & switches are
available from multiple sources
Images: Dell* Poweredge* R720* PCIe drive interconnect.
Contact PLX* or IDT* for more information on retimers or switches
4
5
3
Retimer or Switch
Active signal conditioning enables
SFF-8639 solutions with more connectors
31. 31
Hot-Plug Use Cases
• Hot Add & Remove are software managed events
• During boot, the system must prepare for hot-plug:
– Configure PCI Express* Slot Capability registers
– Enable and register for hot plug events to higher level
storage software (e.g., RAID or tiering software)
– Pre-allocate slot resources (Bus IDs, interrupts, memory
regions) using ACPI* tables
Existing BIOS and Windows*/Linux* OS are
prepared to support PCI Express* Hot-Plug today
32. 32
Surprise Hot-Remove
• Random device failure or operator error
can result in surprise removal during I/O
• Storage controller driver and the software
stack are required to be robust for such cases
• Storage controller driver must check for Master Abort
– On all reads to the device, the driver checks register for FFFF_FFFFh
– If data is FFFF_FFFFh, then driver reads another register expected to have
a value that includes zeroes to verify device is still present
• Time order of removal notification is unknown (e.g. Storage controller
driver via Master Abort, or PCI Bus driver via Presence Change
interrupt, or RAID software may signal removal first)
Surprise Hot-Remove requires careful software design
33. 33
RAID for PCI Express* SSDs?
• Software RAID is a hardware
redundant solution to enable Highly
Available (HA) systems today with
PCI Express* (PCIe) SSDs
• Multi copies of Application images
(redundant resource)
• Open cloud infrastructure that
supports data redundancy with
software implementations, such as
Ceph* object storage
Storage Pool
Row B
Row A
Row B
Hardware RAID for PCIe SSD is under-developments
Data Striped
Datareplicated
34. 34
Data Center PCI Express* (PCIe) SSD
Peak Power Challenges
• Max Power:
100% Sequential Writes
• Larger capacities have high
concurrency, consume most
power (up to 25W!2)
• Power varies >40%
depending on capacity and
workload
• Consider UL touch safety
standards when planning
airflow designs or slot power
limits3
1. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may
affect actual performance. For more information go to http://www.intel.com/performance
2. PCI Express* “Enterprise SSD Form Factor” specification requires 2.5” SSD maximum continuous power of <25W
3. See PCI Express* Base Specification, Revision 3.0, Section 6.9 for more details on Slot Power Limit Control
Attention needed for power supply, thermals, and SAFETY
Source: Intel
0
5
10
15
20
25
30
Large Small
Power,W
100% Seq Write
50/50 Seq Read/Write
70/30 Seq Read/Write
100% Seq Read
Capacity
Modeled PCI Express* SSD Power1
35. 35
PCI Express* SSDs Enclosure
Management
• SSD Form Factor Specification
(www.ssdformfactor.org) defines
hot plug indicator uses, Out-of-
Band managements
• PCI Express* Base Specification
Rev. 3.0 defines enclosure
indicators and registers intended
for Hot-Plug management
support (Registers: Device
Capabilities, Slot Capabilities, Slot
Control, Slot Status
• SFF-8485 standard defines
SGPIO enclosure management
interface
Standardize PCI Express* SSD enclosure management
36. 36
Data Center PCI Express*(PCIe) SSD
Accelerators
• The data center ecosystem is capable:
Implement PCI Express* SSDs now!
• Proved system implementations of design-in 2.5”
PCIe SSDs
• Understand Hot-Plug capabilities of your device,
system and OS
• Design thermal solutions with safety in mind
• Collaborate on PCI Express SSD enclosure
management standards
Drive PCI Express* data center adoption through
education, collaboration, and careful software design
37. 37
Agenda
• Why PCI Express* (PCIe) for SSDs?
– PCIe SSD in Client
– PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs?
– Overview NVMe
– Driver ecosystem update
– NVMe technology developments
• Deploying PCIe SSD with NVMe
38. 38
PCI Express* for Data Center/Enterprise SSDs
• PCI Express* (PCIe) is a great interface for SSDs
– Stunning performance 1 GB/s per lane (PCIe Gen3 x1)
– With PCIe scalability 8 GB/s per device (PCIe Gen3 x8) or more
– Lower latency Platform+Adapter: 10 µsec down to 3 µsec
– Lower power No external SAS IOC saves 7-10 W
– Lower cost No external SAS IOC saves ~ $15
– PCIe lanes off the CPU 40 Gen3 (80 in dual socket)
• HOWEVER, there is NO standard driver
Fusion-io*
Micron*
LSI*
Virident*
Marvell*
Intel
OCZ*
PCIe SSDs are emerging in Data Center/Enterprise,
co-existing with SAS & SATA depending on application
39. 39
Next Generation NVM Technology
Family Defining Switching
Characteristics
Phase
Change
Memory
Energy (heat) converts material
between crystalline (conductive)
and amorphous (resistive) phases
Magnetic
Tunnel
Junction
(MTJ)
Switching of magnetic resistive
layer by spin-polarized electrons
Electrochemical
Cells (ECM)
Formation / dissolution of
“nano-bridge” by electrochemistry
Binary Oxide
Filament
Cells
Reversible filament formation by
Oxidation-Reduction
Interfacial
Switching
Oxygen vacancy drift diffusion
induced barrier modulation
Scalable Resistive Memory Element
Resistive RAM NVM Options
Cross Point Array in Backend Layers ~4l2 Cell
Wordlines Memory
Element
Selector
Device
Many candidate next generation NVM technologies.
Offer ~ 1000x speed-up over NAND.
40. 40
Fully Exploiting Next Generation NVM
• With Next Generation NVM, the NVM is no longer the bottleneck
– Need optimized platform storage interconnect
– Need optimized software storage access methods
*
NVM Express is the interface architected for
NAND today and next generation NVM
41. 41
Agenda
• Why PCI Express* (PCIe) for SSDs?
– PCIe SSD in Client
– PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs?
– Overview NVMe
– Driver ecosystem update
– NVMe technology developments
• Deploying PCIe SSD with NVMe
42. 42
Technical Basics
• All parameters for 4KB command in single 64B command
• Supports deep queues (64K commands per queue, up to 64K queues)
• Supports MSI-X and interrupt steering
• Streamlined & simple command set (13 required commands)
• Optional features to address target segment (Client, Enterprise, etc.)
– Enterprise: End-to-end data protection, reservations, etc.
– Client: Autonomous power state transitions, etc.
• Designed to scale for next generation NVM, agnostic to NVM type used
http://www.nvmexpress.org/
43. 43
Queuing Interface
Command Submission & Processing
Submission
Queue Host Memory
Completion
Queue
Host
NVMe Controller
Head
Tail
1
Submission Queue
Tail Doorbell
Completion Queue
Head Doorbell
2
3 4
Tail
Head
5 6
7
8
Queue
Command
Ring
Doorbell
New Tail
Fetch
Command
Process
Command
Queue
Completion
Generate
Interrupt
Process
Completion
Ring
Doorbell
New Head
Command Submission
1. Host writes command to
Submission Queue
2. Host writes updated
Submission Queue tail
pointer to doorbell
Command Processing
3. Controller fetches
command
4. Controller processes
command
*
44. 44
Queuing Interface
Command Completion
Submission
Queue Host Memory
Completion
Queue
Host
NVMe Controller
Head
Tail
1
Submission Queue
Tail Doorbell
Completion Queue
Head Doorbell
2
3 4
Tail
Head
5 6
7
8
Queue
Command
Ring
Doorbell
New Tail
Fetch
Command
Process
Command
Queue
Completion
Generate
Interrupt
Process
Completion
Ring
Doorbell
New Head
Command Completion
5. Controller writes
completion to
Completion Queue
6. Controller generates
MSI-X interrupt
7. Host processes
completion
8. Host writes updated
Completion Queue head
pointer to doorbell
*
45. 45
Simple Command Set – Optimized for NVM
Admin Commands
Create I/O Submission Queue
Delete I/O Submission Queue
Create I/O Completion Queue
Delete I/O Completion Queue
Get Log Page
Identify
Abort
Set Features
Get Features
Asynchronous Event Request
Firmware Activate (optional)
Firmware Image Download (opt)
Format NVM (optional)
Security Send (optional)
Security Receive (optional)
NVM I/O Commands
Read
Write
Flush
Write Uncorrectable (optional)
Compare (optional)
Dataset Management (optional)
Write Zeros (optional)
Reservation Register (optional)
Reservation Report (optional)
Reservation Acquire (optional)
Reservation Release (optional)
Only 10 Admin and 3 I/O commands required
46. 46
Agenda
• Why PCI Express* (PCIe) for SSDs?
– PCIe SSD in Client
– PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs?
– Overview NVMe
– Driver ecosystem update
– NVMe technology developments
• Deploying PCIe SSD with NVMe
47. 47
Driver Development on Major OSes
• Windows* 8.1 and Windows* Server 2012 R2
include native driver
• Open source driver in collaboration with OFA
Windows*
• Stable OS driver since Linux* kernel 3.10Linux*
• FreeBSD driver upstreamUnix
• Solaris driver will ship in S12Solaris*
• vmklinux driver certified release in 1H, 2014VMware*
• Open source driver available on SourceForgeUEFI
Native OS drivers already available, with more coming!
48. 48
Windows* Open Source Driver Update
• 64-bit support on Windows* 7 and Windows Server
2008 R2
• Mandatory features
Release 1
Q2 2012
• Added 64-bit support Windows 8
• Public IOCTLs and Windows 8 Storport updates
Release 1.1
Q4 2012
• Added 64-bit support on Windows Server 2012
• Signed executable drivers
Release 1.2
Aug 2013
• Hibernation on boot drive
• NUMA group support in core enumeration
Release 1.3
March 2014
• WHQL certification
• Drive Trace feature, WVI command processing
• Migrate to VS2013, WDK8.1
Release 1.4
Oct, 2014
Four major open source releases since 2012.
Contributors include Huawei*, PMC-Sierra*, Intel, LSI* & SanDisk*
https://www.openfabrics.org/resources/developer-tools/nvme-windows-development.html
49. 49
Linux* Driver Update
Recent Features
• Stabled Linux* 3.10, Latest driver in 3.14
• Surprise hot plug/remove
• Dynamic partitioning
• Deallocate (i.e., Trim support)
• 4KB sector support (in addition to 512B)
• MSI support (in addition to MSI-X)
• Disk I/O statistics
Linux OS distributors’ support
• RHEL 6.5, Ubuntu 13.10 has native drivers
• RHEL 7.0, Ubuntu 14.04LTS and SLES 12 will
have latest native drivers
• SuSE is testing external driver package for
SLES11 SP3
Works in progress: power management, end-to-end data
protection, sysfs manageability & NUMA
/dev/nvme0n1
50. 50
FreeBSD Driver Update
• NVM Express* (NVMe) support is upstream in the head and
stable/9 branches
• FreeBSD 9.2 released in September is the first official release
with NVMe support
nvme
Core NVMe driver
nvd
NVMe/block layer shim
nvmecontrol
User space utility,
including firmware update
FreeBSDNVMeModules
51. 51
Solaris* Driver Update
• Current Status from Oracle* team
- Fully compliant with 1.0e spec
- Direct block interfaces bypassing complex SCSI code path
- NUMA optimized queue/interrupt allocation
- Support x86 and SPARC platform
- A command line tool to monitor and configure the controller
- Delivered to S12 and S11 Update 2
• Future Development Plans
- Boot & install on SPARC and X86
- Surprise removal support
- Shared hosts and multi-pathing
52. 52
VMware Driver Update
• Vmklinux based driver development is completed
– First release in mid-Oct, 2013
– Public release will be 1H, 2014
• A native VMware* NVMe driver is available for end
user evaluations
• VMware’s I/O Vendor Partner Program (IOVP) offers
members a comprehensive set of tools, resources
and processes needed to develop, certify and release
software modules, including device drivers and
utility libraries for VMware ESXi
53. 53
UEFI Driver Update
• The UEFI 2.4 specification available at www.UEFI.org contains
updates for NVM Express* (NVMe)
• An open source version of an NVMe driver for UEFI is available
at nvmexpress.org/resources
“AMI is working with vendors
of NVMe devices and plans for
full BIOS support of NVMe in
2014.”
Sandip Datta Roy
VP BIOS R&D, AMI
NVMe boot support with UEFI will start percolating
releases from Independent BIOS Vendors in 2014
54. 54
Agenda
• Why PCI Express* (PCIe) for SSDs?
– PCIe SSD in Client
– PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs?
– Overview NVMe
– Driver ecosystem update
– NVMe technology developments
• Deploying PCIe SSD with NVMe
55. 55
NVMe Promoters
“Board of Directors”
Technical
Workgroup
Queueing Interface
Admin Command Set
NVMe I/O Command Set
Driver Based Management
Current spec version: NVMe 1.1
Management Interface
Workgroup
In-Band (PCIe) and Out-of-Band (SMBus)
PCIe SSD Management
First specification will be Q3, 2014
NVM Express Organization Architected for Performance
56. 56
NVM Express 1.1 Overview
• The NVM Express 1.1 specification, published in October of 2012, adds
additional optional client and Enterprise features
57. 57
NVM Express 1.1 Overview
• The NVM Express 1.1 specification, published in October of 2012, adds
additional optional client and Enterprise features
Multi-path Support
• Reservations
• Unique Identifier per Namespace
• Subsystem Reset
58. 58
NVM Express 1.1 Overview
• The NVM Express 1.1 specification, published in October of 2012, adds
additional optional client and Enterprise features
Power Optimizations
• Autonomous Power State
Transitions
Multi-path Support
• Reservations
• Unique Identifier per Namespace
• Subsystem Reset
59. 59
NVM Express 1.1 Overview
• The NVM Express 1.1 specification, published in October of 2012, adds
additional optional client and Enterprise features
Power Optimizations
• Autonomous Power State
Transitions
Command Enhancements
• Scatter Gather List support
• Active Namespace Reporting
• Persistent Features Across
Power States
• Write Zeros Command
Multi-path Support
• Reservations
• Unique Identifier per Namespace
• Subsystem Reset
61. 61
Reservations
• In some multi-host environments, like Windows* clusters, reservations
may be used to coordinate host access
• NVMe 1.1 includes a simplified reservations mechanism that is
compatible with implementations that use SCSI reservations
• What is a reservation? Enables two or more hosts to coordinate
access to a shared namespace.
– A reservation may allow Host A and Host B access, but disallow Host C
Namespace
NSID 1
NVM Express
Controller 1
Host ID = A
NSID 1
NVM Express
Controller 2
Host ID = A
NSID 1
NVM Express
Controller 3
Host ID = B
NSID 1
Host
A
Host
B
Host
C
NVM Subsystem
NVM Express
Controller 4
Host ID = C
62. 62
Power Optimizations
• NVMe 1.1 added the Autonomous Power State Transition feature for
client power focused implementations
• Without software intervention, the NVMe controller transitions to a
lower power state after a certain idle period
– Idle period prior to transition programmed by software
Power
State
Opera-
tional?
Max
Power
Entrance
Latency
Exit
Latency
0 Yes 4 W 10 µs 10 µs
1 No 10 mW 10 ms 5 ms
2 No 1 mW 15 ms 30 ms
Example Power States
Power State 0
Power State 1
Power State 2
After 50 ms idle
After 500 ms idle
63. 63
Continuing to Advance NVM Express
• NVM Express continues to add features to meet the needs of
client and Enterprise market segments as they evolve
• The Workgroup is defining features for the next revision of the
specification, expected ~ middle of 2014
Features for Next Revision
Namespace Management
Management Interface
Live Firmware Update
Power Optimizations
Enhanced Status Reporting
Events for Namespace Changes
…
Get involved – join the NVMe Workgroup
nvmexpress.org
64. 64
Agenda
• Why PCI Express* (PCIe) for SSDs?
– PCIe SSD in Client
– PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs?
– Overview NVMe
– Driver ecosystem update
– NVMe technology developments
• Deploying PCIe SSD with NVMe
65. 65
Considerations of PCI Express* SSD with
NVM Express, NVMe SSD
• NVMe driver assistant?
• S.M.A.R.T/Management?
• Performance scalability?
• PCIe SSD vs SATA SSDs?
• PCIe SSD grades?
• Software optimizations?
NVMe SSDs are on the way to Data Center
66. 66
PCI Express* SSD vs Multi SATA* SSDs
SATA SSDs advantages
• Matured hardware RAID/Adapter
for management of SSDs
• Matured technology/eco system
for SSDs
• Cost & performance balance
Quick Performance Comparison
• Random WRITE IOPS: 6 x S3700
= one PCIe SSD 1.6T (4 lanes,
Gen3)
• Random READ IOPS: ~8 x S3700
= 1 x PCIe SSD
Mix-Use PCIe and SATA SSDs
• hot-pluggable 2.5” PCIe SSD has
same maintenance advantage as
SATA SSD
• TCO, balance on performance
and cost
Performance of 6~8 Intel S3700 SSDs is close to 1x PCIe SSD
4K random workloads (IOPS)
Measurements made on Hanlan Creek (Intel S5520HC) system with two Intel Xeon X5560@ 2.93GHz and 12GB (per CPU) Mem running
RHEL6.4 O/S, Intel S3700 SATA Gen3 SSDs are connected to LSI* HBA 9211, NVMe SSD is under development, data collected by FIO* tool
0
100000
200000
300000
400000
500000
600000
100% read 50% read 0% read
6x800GB Intel S3700
1x NVMe 1600GB
IOPS
68. 68
Selections of PCI Express* SSD with NVM
Express, NVMe SSD
• High Endurance Technology (HET) PCIe SSD
Applications with intensive random write workloads, typical are high
percentage small block random writes, such as critical database,
OLTs…
• Middle Tier PCIe SSD
Applications needs random write performance and endurance, but
much lower than HET PCIe SSD, typical workloads is <70% random
writes.
• Low cost PCIe SSD
Same read performance as above, however it has 1/10th of HET
write performance and endurance, Applications with high intensive
read workloads, such as search engine etc.
Application determines cost and performance
70. 70
Optimizations of PCI Express* SSD with
NVM Express, NVMe SSD
NVMe Administration
Controller capability/identify
NVMe features
Asynchronous Event
NVMe logs
Optional IO Command
Data Set management (Trim)
71. 71
Optimizations of PCI Express* SSD with
NVM Express, NVMe SSD
NVMe Administration
Controller capability/identify
NVMe features
Asynchronous Event
NVMe logs
Optional IO Command
Data Set management (Trim)
72. 72
Optimizations of PCI Express* SSD with
NVM Express, NVMe SSD
NVMe Administration
Controller capability/identify
NVMe features
Asynchronous Event
NVMe logs
Optional IO Command
Data Set management (Trim)
73. 73
Optimizations of PCI Express* SSD with
NVM Express, NVMe SSD
NVMe Administration
Controller capability/identify
NVMe features
Asynchronous Event
NVMe logs
Optional IO Command
Data Set management (Trim)
74. 74
Optimizations of PCI Express* SSD with
NVM Express, NVMe SSD
NVMe Administration
Controller capability/identify
NVMe features
Asynchronous Event
NVMe logs
Optional IO Command
Data Set management (Trim)
NVMe IO Threaded structure
Understand number of CPU logic
cores in your system
Write multi-Thread application
programs
No need for handling rq_affinity
75. 75
Optimizations of PCI Express* SSD with
NVM Express, NVMe SSD
NVMe Administration
Controller capability/identify
NVMe features
Asynchronous Event
NVMe logs
Optional IO Command
Data Set management (Trim)
NVMe IO Threaded structure
Understand number of CPU logic
cores in your system
Write multi-Thread application
programs
No need for handling rq_affinity
Write NVMe friendly applications
76. 76
Optimizations of PCI Express* SSD with
NVM Express (cont.)
IOPS performance
• Chose higher number of threads ( < min(number system CPU cores, SSD
controller maximum allocated queues))
• Chose Low Queue depth for each thread (asynchronous IO)
• Avoid to use single thread with much higher Queue Depth(QD), especially for
small transfer blocks
• Example: 4K random read on one drive in a system with 8 CPU cores, use 8
threads with Queue Depth(QD)=16 per thread instead of single thread with
QD=128.
Latency
• Lower QD for better latency
• For intensive random write, there is a sweet point of threads & QD for
balancing performance and latency
• Example: 4K random write in 8-core system, threads=8, sweet QD is 4 to 6.
Sequential vs Random workload
• Multi-threads sequential workloads may turn to be random workloads at SSD
side
Use Multi-Threads with Low Queue Depth
77. 77
NVM Express (NVMe) Driver beyond NVMe
Specification
NVMe Linux driver is open source
LBA0……………………..LBA255 LBA256…………..…..LBA511
LBA512……………….LBA767 LBA768……………..LBA1023
LBA1024…………….………..etc. …etc…
Core 0 Core 1
• Driver Assisted Striping
– Dual core NVMe controller
each core maintains separate
NAND array and striped LBA
ranges (like RAID 0)
– Driver can enforce all
commands fall within KB
stripe, ensuring maximum
performance
• Contribute to NVMe driver
78. 78
S.M.A.R.T and Management
Use PCIe in-band
commands to get SSD
SMART log (NVMe log)
Statistical data, status,
Warnings,
Temperature,
endurance indicator
• Use Out-Of-Band
SMBus to access VPD
EEPROM, Vendor
information
• Use Out-of-Band
SMBus temperature
sensor for close loop
thermal controls (Fan
speed)
NVMe Standardizes S.M.A.R.T. on PCIe SSD
79. 79
Scalability of Multi-PCI Express* SSDs with
NVM Express
Performance on 4 PCIe SSDs = Performance on 1 PCIe SSD X 4
Advantage of NVM Express threaded and MSI-X structure!
100% random read
0.00
2.00
4.00
6.00
8.00
10.00
12.00
4K 8K 16K 64k
1xNVMe 1600GB
2xNVMe 1600GB
4xNVMe 1600GB
GB/s
0
0.5
1
1.5
2
2.5
3
3.5
4K 8K 16K 64k
1xNVMe 1600GB
2xNVMe 1600GB
4xNVMe 1600GB
GB/s
100% random write
Measurements made on Intel system with two Intel Xeon™ CPU E5-2680 v2@ 2.80GHz and 32GB Mem running RHEL6.5 O/S, NVMe SSD is
under development, data collected by FIO* tool, numJob=30, queue depth (QD)=4 (read), QD=1 (write), libaio.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance
tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist
you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
80. 80
PCI Express* SSD with NVM Express (NVMe
SSD) deployments
Source: Geoffrey Moore, Crossing the Chasm
SSDs are a disruptive technology, approaching “The Chasm”
Adoption success relies on clear benefit, simplification, and ease of use
81. 81
Summary
• PCI Express* SSD enables lower latency and further
alleviates the IO bottleneck
• NVM Express is the interface architected for PCI
Express* SSD, NAND Flash of today and next
generation NVM of tomorrow
• Promoting and adopting PCIe SSD with NVMe as
mainstream technology and get ready for next
generation of NVM
83. 83
Risk Factors
The above statements and any others in this document that refer to plans and expectations for the first quarter, the year and the
future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,”
“intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements.
Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many
factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual
results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be the
important factors that could cause actual results to differ materially from the company’s expectations. Demand could be different from
Intel's expectations due to factors including changes in business and economic conditions; customer acceptance of Intel’s and
competitors’ products; supply constraints and other disruptions affecting customers; changes in customer order patterns including
order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic and financial conditions poses a
risk that consumers and businesses may defer purchases in response to negative financial events, which could negatively affect
product demand and other related matters. Intel operates in intensely competitive industries that are characterized by a high
percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to
forecast. Revenue and the gross margin percentage are affected by the timing of Intel product introductions and the demand for and
market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing
programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to technological
developments and to incorporate new features into its products. The gross margin percentage could vary significantly from
expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying
products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and
associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials
or resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and
intangible assets. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in
countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters,
infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses, particularly certain marketing and
compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's
products and the level of revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures.
Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published
specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and
other issues, such as the litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include
monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business
practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual
property. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the
company’s most recent reports on Form 10-Q, Form 10-K and earnings release.
Rev. 1/16/14