1. FlashGrid, NVMe and 100Gbit
Infiniband – The Exadata Killer?
CEO, DATABASE-SERVERS.COM
2. Agenda
Goal / Objectives
Key components
Kernel, protocols, hardware
FlashGrid stack vs. Exadata
How Exadata works
How FlashGrid works
Benchmarks
Single server
RAC
Conclusion
Q&A
3. Introduction
About Myself
Married with children
Linux since 1998
Oracle since 2000
OCM & OCP
Chairman of NGENSTOR since 2014
CEO of DATABASE-SERVERS since 2015
4. Introduction
About DATABASE-SERVERS
Founded in 2015
Custom license-optimized whitebox servers
Standard Operating Environment Toolkit for Oracle/Redhat/Suse Linux
CLX Cloud Systems based on OVM/LXC stack
Performance kits for all brands (HP/Oracle/Dell/Lenovo)
Watercooling
Overclocking
HBA/SSD upgrades
Tuning/Redesign of Oracle engineered systems (ODA, Exadata)
Storage extension with NGENSTOR arrays
Performance kits
5. Goal / Objectives
Requirements
Maximized I/O throughput and random I/O capabilities at
the least possible CPU usage on the database server
Use only commodity hardware and technologies available
today
No closed source components like Exadata Storage Server
software
Be as compliant as possible to the standard Oracle
software stack
6. Key components: Linux Kernel
Why Oracle has the UEK3/4 Kernel
The current Linux distributions are focussed on stability and
certification matrices:
Kernel versions are frozen and features slowly/selectively
backported.
Oracle needs more frequent updates in selected areas for it’s
engineered system’s performance, especially in the following areas:
Infiniband Stack (OFED)
Network and Block I/O layer
Oracle’s Solution
Compile a newer/patched version of the Linux mainline kernel
against their Centos fork called Oracle Linux
7. Key components: Linux Kernel
UEK3 most important new feature
Multiple SCSI/Block command queues
The Linux storage stack doesn't scale:
~ 250,000 to 500.000 IOPS per LUN
~ 1,000,000 IOPS per HBA
High completion latency
High lock contention and cache line bouncing
Bad NUMA scaling
The request layer can't handle high IOPS or low latency devices
SCSI drivers are tied into the request framework
SCSI-MQ/BLK-MQ are replacements for the request layer
10. Key components: Protocols
Infiniband /RDMA
Oracle uses Infiniband as strategic interconnect
component in all of it’s engineered systems
Main purposes:
Lower CPU utilization due to hardware offload
Lower latency for small messages like RAC interconnect (solves
scalability issues)
Oracle created the RDS protocol, that runs on top of
Infiniband/RDMA
When you connect to an Exadata Storage Server, you open an RDS
Socket ;-)
The distributed Database Approach used by Oracle via the iDB relies
on this framework
14. Key components: Protocols
Non-Volative Memory Express (NVMe)
NVM Express is a standardized high performance software
interface for PCIe SSDs
Lower latency: Direct connection to CPU
Scalable performance: 1 GB/s per lane – 4 GB/s, 8 GB/s, … in one
SSD
Industry standards: NVM Express and PCI Express (PCIe) 3.0
Oracle Exadata Storage Servers X5 use NVMe SSD
DC P3600
2.6 GB/s read
270’000 IOPS @8k
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-p3600-series.html
16. Key components: Hardware
In order to keep up with Exadata we need:
Preferrably database servers with the same or similiar
specs as Oracle’s
Intel Xeon E5-V3 for a X5-2 like performance
Intel Xeon E7-V3 for a X5-8 like performance
A high speed network card
Preferrably with hardware protocol offloading
Determine best operation mode (=> see next slide)
High bandwidth
Low latency
NVMe Drives
Internal or external
18. Key Komponents: Summary
Cooking a decent Exadata killer requires the right
ingredients:
Kernel 3.18+ for BLK-MQ / SCSI-MQ support (e.g UEK3)
Decent database servers
RDMA capable high speed network with hardware protocol
offloading
High speed flash drives
RDS-Support linked properly for you Oracle Home
Clusterware and RAC Support for RDS Over Infiniband (Doc ID
751343.1)
19. FlashGrid Stack vs. Exadata
General workload categorization
OLTP
Driving number is transactions per second / IOPS
Small random IOPS (typically 8k)
Many random IOPS => alot of CPU burns away
Latency is important
DWH
Driving number is processing/elapsed time
High sequential troughput (GB/s)
Large, merged I/O => low cpu, high adapter saturation
20. FlashGrid Stack vs. Exadata
Common problems
Database server per core efficiency
A lot of waits/interrupts for network/storage
Adapter remote bandwidth issues
Even 2x40 Gbit are not enough to move terabytes of data to the
RDBMS engine
I/O Subsystem bottlenecks
Storage array cannot provide enough bandwidth or IOPS
21. FlashGrid Stack vs. Exadata
Common stack
Oracle Linux
Exadata X5: Oracle Linux 6.7
FlashGrid: Oracle Linux 7.1 or higher
Oracle Grid Infrastructure and ASM
Both recommend 12c
Infiniband / RDMA
Exadata X5: uses QDR (2x40Gbit Cards)
FlashGrid: multiple card vendors supported (Mellanox, Chelsio,
Intel, Solarflare)
22. FlashGrid Stack vs. Exadata
How Exadata works (simplyfied)
«Distributed Database»
Idea from the 90’s (central DB, remote DB over DB-Link)
Anyone recalls this and the driving site hint?
Work can be split/offloaded to remote databases (eg. joins)
Exadata
Database
Server aka
Client
Exadata Storage Server
Exadata Storage Server
23. FlashGrid Stack vs. Exadata
Core advantages of Exadata Storage Cells
We can distribute work amongst them in the same way we
would within a distributed database using DB-Links
We can use multiple data processing engines
Oracle instance on the DB Server
Engines on the Exadata Storage Cells
We save CPU and bandwidth on the effective database server
We have to transfer and process less data, as it is «pre-processed»
and also often pre-cached in the Storage Cells cache structures
But:
We have to license the Exadata Storage Cells
We have quite a vendor lock-in due the Exadata Storage Cell’s
unique architecture and proprietary IP
25. FlashGrid Stack vs. Exadata
How FlashGrid works 2/3
Hooks into your existing Oracle Grid Infrastructure stack
Basically a shared nothing storage cluster
Exports local NVMe Drives via iSCSI (either Infiniband or TCP/IP)
Uses Oracle ASM to mirror the local disks exported to all nodes
Creates a mapping for each server to use local NMVe drives for
reads instead going over the network.
Think of it like setting ASM preferred mirror read per server
Scales using Oracle RAC
The more nodes local disk, the more global bandwidth
27. FlashGrid Stack vs. Exadata
Core advantages of FlashGrid
Open source
Considerably cheaper than Exadata
Scale up using Oracle RAC just like Exadata
We are not bound to the restrictions of an engineered
system in terms of hard- and software combination
100Gbit Infiniband? => yes
Linux Containers with Oracle 12c? => yes
But:
Has no query offloading (yet) like Exadata Storage Cells
28. Benchmarks
Setup
Tool = CALIBRATE_IO
Official Oracle Tool
No warm up/pre-runs to avoid caching
Not perfect, but good enough to measure IOPS/MBs
https://db-blog.web.cern.ch/blog/luca-canali/2014-05-closer-look-calibrateio
Oracle Engineered system
Oracle Exadata X5-2 Quarter Rack, HC
Single x86 system
HP ML 350 G9
2x Xeon E5-V3 2699 (18 core, like X5-2)
3x HP P840 SAS RAID Controller with 4 GB cache, 48 OCZ Intrepid SATA SSDs (Test1+3)
Remote access via iSCSI over RDMA (Test2)
6x Intel P3608 NVMe drives connected (Test4)
FlashGrid
2x HP ML 350 G9
2x Xeon E5-V3 2699 (18 core, like X5-2)
6x Intel P3608 NVMe drives connected
2x Infiniband 100 Gbit Adapter
30. Benchmarks
HP ML 350 G9, 48x SATA disks 1/3
SCSI-MQ=off Hyperthreading=off
Limited by per port speed/controller
and number of controllers
31. Benchmarks
HP ML 350 G9, 48x SATA disks 2/3
SCSI-MQ=on Hyperthreading=on
Limited by NIC count and port speed
32. Benchmarks
HP ML 350 G9, 48x SATA disks 3/3
SCSI-MQ=on Hyperthreading=on
Limited by per port speed/controller
and number of controllers
33. Benchmarks
HP ML 350 G9, NVMe drives
SCSI-MQ=on Hyperthreading=on
Limited by number of NVMe drives
and RAC Nodes. Exadata uses 56
NVMe drives, 7 storage cells and 4
RAC nodes
34. Benchmarks
FlashGrid, 2xHP ML 350 G9, NVMe drives
Limited by number of NVMe drives
and RAC Nodes. Exadata uses 112
NVMe drives, 14 storage cells and 8
RAC nodes
To match the read performance of Exadata Full Rack EF , we need
to scale up to 8xHP ML 350 G9 RAC nodes
35. Conclusion
Having chased Oracle Exadata’s performance for quite
some years I can conclude that:
Commodity servers
Can keep up with Oracle engineered systems thanks to the newest
network and flash technology
FlashGrid
Offers execellent raw performance
Is simpler to maintain
Has cheaper TCO
Is just good enough for the majority of clients
Oracle Exadata
Has still a few areas, where it offers unmatched performance thanks
to it’s proprietary IP, even tough it’s value is declining