SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
FlashGrid, NVMe and 100Gbit
Infiniband – The Exadata Killer?
CEO, DATABASE-SERVERS.COM
Agenda
Goal / Objectives
Key components
Kernel, protocols, hardware
FlashGrid stack vs. Exadata
How Exadata works
How FlashGrid works
Benchmarks
Single server
RAC
Conclusion
Q&A
Introduction
About Myself
Married with children
Linux since 1998
Oracle since 2000
OCM & OCP
Chairman of NGENSTOR since 2014
CEO of DATABASE-SERVERS since 2015
Introduction
About DATABASE-SERVERS
Founded in 2015
Custom license-optimized whitebox servers
Standard Operating Environment Toolkit for Oracle/Redhat/Suse Linux
CLX Cloud Systems based on OVM/LXC stack
Performance kits for all brands (HP/Oracle/Dell/Lenovo)
Watercooling
Overclocking
HBA/SSD upgrades
Tuning/Redesign of Oracle engineered systems (ODA, Exadata)
Storage extension with NGENSTOR arrays
Performance kits
Goal / Objectives
Requirements
Maximized I/O throughput and random I/O capabilities at
the least possible CPU usage on the database server
Use only commodity hardware and technologies available
today
No closed source components like Exadata Storage Server
software
Be as compliant as possible to the standard Oracle
software stack
Key components: Linux Kernel
Why Oracle has the UEK3/4 Kernel
The current Linux distributions are focussed on stability and
certification matrices:
Kernel versions are frozen and features slowly/selectively
backported.
Oracle needs more frequent updates in selected areas for it’s
engineered system’s performance, especially in the following areas:
Infiniband Stack (OFED)
Network and Block I/O layer
Oracle’s Solution
Compile a newer/patched version of the Linux mainline kernel
against their Centos fork called Oracle Linux
Key components: Linux Kernel
UEK3 most important new feature
Multiple SCSI/Block command queues
The Linux storage stack doesn't scale:
~ 250,000 to 500.000 IOPS per LUN
~ 1,000,000 IOPS per HBA
High completion latency
High lock contention and cache line bouncing
Bad NUMA scaling
The request layer can't handle high IOPS or low latency devices
SCSI drivers are tied into the request framework
SCSI-MQ/BLK-MQ are replacements for the request layer
Key components: Linux Kernel
Command chain structure: old vs. new
old new
Key components: Linux Kernel
IOPS performance
Key components: Protocols
Infiniband /RDMA
Oracle uses Infiniband as strategic interconnect
component in all of it’s engineered systems
Main purposes:
Lower CPU utilization due to hardware offload
Lower latency for small messages like RAC interconnect (solves
scalability issues)
Oracle created the RDS protocol, that runs on top of
Infiniband/RDMA
When you connect to an Exadata Storage Server, you open an RDS
Socket ;-)
The distributed Database Approach used by Oracle via the iDB relies
on this framework
Key components: Protocols
Components used by Oracle engineered systems
Key components: Protocols
What Oracle says about RDS
Key components: Protocols
TCP/IP vs. Infiniband
TCP/IP Infiniband
Key components: Protocols
Non-Volative Memory Express (NVMe)
NVM Express is a standardized high performance software
interface for PCIe SSDs
Lower latency: Direct connection to CPU
Scalable performance: 1 GB/s per lane – 4 GB/s, 8 GB/s, … in one
SSD
Industry standards: NVM Express and PCI Express (PCIe) 3.0
Oracle Exadata Storage Servers X5 use NVMe SSD
DC P3600
2.6 GB/s read
270’000 IOPS @8k
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-p3600-series.html
Key components: Protocols
Performance comparison
Traditional stack NVMe stack
Key components: Hardware
In order to keep up with Exadata we need:
Preferrably database servers with the same or similiar
specs as Oracle’s
Intel Xeon E5-V3 for a X5-2 like performance
Intel Xeon E7-V3 for a X5-8 like performance
A high speed network card
Preferrably with hardware protocol offloading
Determine best operation mode (=> see next slide)
High bandwidth
Low latency
NVMe Drives
Internal or external
Key components: Hardware
Network card operation modes compared
iSCSI over RDMA iSCSI hardware
offload (ASIC)
Key Komponents: Summary
Cooking a decent Exadata killer requires the right
ingredients:
Kernel 3.18+ for BLK-MQ / SCSI-MQ support (e.g UEK3)
Decent database servers
RDMA capable high speed network with hardware protocol
offloading
High speed flash drives
RDS-Support linked properly for you Oracle Home
Clusterware and RAC Support for RDS Over Infiniband (Doc ID
751343.1)
FlashGrid Stack vs. Exadata
General workload categorization
OLTP
Driving number is transactions per second / IOPS
Small random IOPS (typically 8k)
Many random IOPS => alot of CPU burns away
Latency is important
DWH
Driving number is processing/elapsed time
High sequential troughput (GB/s)
Large, merged I/O => low cpu, high adapter saturation
FlashGrid Stack vs. Exadata
Common problems
Database server per core efficiency
A lot of waits/interrupts for network/storage
Adapter remote bandwidth issues
Even 2x40 Gbit are not enough to move terabytes of data to the
RDBMS engine
I/O Subsystem bottlenecks
Storage array cannot provide enough bandwidth or IOPS
FlashGrid Stack vs. Exadata
Common stack
Oracle Linux
Exadata X5: Oracle Linux 6.7
FlashGrid: Oracle Linux 7.1 or higher
Oracle Grid Infrastructure and ASM
Both recommend 12c
Infiniband / RDMA
Exadata X5: uses QDR (2x40Gbit Cards)
FlashGrid: multiple card vendors supported (Mellanox, Chelsio,
Intel, Solarflare)
FlashGrid Stack vs. Exadata
How Exadata works (simplyfied)
«Distributed Database»
Idea from the 90’s (central DB, remote DB over DB-Link)
Anyone recalls this and the driving site hint?
Work can be split/offloaded to remote databases (eg. joins)
Exadata
Database
Server aka
Client
Exadata Storage Server
Exadata Storage Server
FlashGrid Stack vs. Exadata
Core advantages of Exadata Storage Cells
We can distribute work amongst them in the same way we
would within a distributed database using DB-Links
We can use multiple data processing engines
Oracle instance on the DB Server
Engines on the Exadata Storage Cells
We save CPU and bandwidth on the effective database server
We have to transfer and process less data, as it is «pre-processed»
and also often pre-cached in the Storage Cells cache structures
But:
We have to license the Exadata Storage Cells
We have quite a vendor lock-in due the Exadata Storage Cell’s
unique architecture and proprietary IP
FlashGrid Stack vs. Exadata
How FlashGrid works 1/3
FlashGrid Stack vs. Exadata
How FlashGrid works 2/3
Hooks into your existing Oracle Grid Infrastructure stack
Basically a shared nothing storage cluster
Exports local NVMe Drives via iSCSI (either Infiniband or TCP/IP)
Uses Oracle ASM to mirror the local disks exported to all nodes
Creates a mapping for each server to use local NMVe drives for
reads instead going over the network.
Think of it like setting ASM preferred mirror read per server
Scales using Oracle RAC
The more nodes local disk, the more global bandwidth
FlashGrid Stack vs. Exadata
How FlashGrid works 3/3
FlashGrid Stack vs. Exadata
Core advantages of FlashGrid
Open source
Considerably cheaper than Exadata
Scale up using Oracle RAC just like Exadata
We are not bound to the restrictions of an engineered
system in terms of hard- and software combination
100Gbit Infiniband? => yes
Linux Containers with Oracle 12c? => yes
But:
Has no query offloading (yet) like Exadata Storage Cells
Benchmarks
Setup
Tool = CALIBRATE_IO
Official Oracle Tool
No warm up/pre-runs to avoid caching
Not perfect, but good enough to measure IOPS/MBs
https://db-blog.web.cern.ch/blog/luca-canali/2014-05-closer-look-calibrateio
Oracle Engineered system
Oracle Exadata X5-2 Quarter Rack, HC
Single x86 system
HP ML 350 G9
2x Xeon E5-V3 2699 (18 core, like X5-2)
3x HP P840 SAS RAID Controller with 4 GB cache, 48 OCZ Intrepid SATA SSDs (Test1+3)
Remote access via iSCSI over RDMA (Test2)
6x Intel P3608 NVMe drives connected (Test4)
FlashGrid
2x HP ML 350 G9
2x Xeon E5-V3 2699 (18 core, like X5-2)
6x Intel P3608 NVMe drives connected
2x Infiniband 100 Gbit Adapter
Benchmarks
Oracle Exadata X5-2 Quarter Rack, HC
No cache, no performance for high capacity Storage Cells …
Benchmarks
HP ML 350 G9, 48x SATA disks 1/3
SCSI-MQ=off Hyperthreading=off
Limited by per port speed/controller
and number of controllers
Benchmarks
HP ML 350 G9, 48x SATA disks 2/3
SCSI-MQ=on Hyperthreading=on
Limited by NIC count and port speed
Benchmarks
HP ML 350 G9, 48x SATA disks 3/3
SCSI-MQ=on Hyperthreading=on
Limited by per port speed/controller
and number of controllers
Benchmarks
HP ML 350 G9, NVMe drives
SCSI-MQ=on Hyperthreading=on
Limited by number of NVMe drives
and RAC Nodes. Exadata uses 56
NVMe drives, 7 storage cells and 4
RAC nodes
Benchmarks
FlashGrid, 2xHP ML 350 G9, NVMe drives
Limited by number of NVMe drives
and RAC Nodes. Exadata uses 112
NVMe drives, 14 storage cells and 8
RAC nodes
To match the read performance of Exadata Full Rack EF , we need
to scale up to 8xHP ML 350 G9 RAC nodes
Conclusion
Having chased Oracle Exadata’s performance for quite
some years I can conclude that:
Commodity servers
Can keep up with Oracle engineered systems thanks to the newest
network and flash technology
FlashGrid
Offers execellent raw performance
Is simpler to maintain
Has cheaper TCO
Is just good enough for the majority of clients
Oracle Exadata
Has still a few areas, where it offers unmatched performance thanks
to it’s proprietary IP, even tough it’s value is declining
Thanks to our partners
Q&A
Contact
elgreco@linux.com
efstathios.efstathiou@database-servers.com
Thank You

Contenu connexe

Tendances

IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory EasyIMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory EasyIn-Memory Computing Summit
 
UniPlex 1000 Series PCIe NVMe JBOF
UniPlex 1000 Series PCIe NVMe JBOFUniPlex 1000 Series PCIe NVMe JBOF
UniPlex 1000 Series PCIe NVMe JBOFUniFabric
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph Community
 
Ceph Day Shanghai - Opening
Ceph Day Shanghai - Opening Ceph Day Shanghai - Opening
Ceph Day Shanghai - Opening Ceph Community
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...In-Memory Computing Summit
 
IMCSummit 2015 - Day 2 Developer Track - The NVM Revolution
IMCSummit 2015 - Day 2 Developer Track - The NVM RevolutionIMCSummit 2015 - Day 2 Developer Track - The NVM Revolution
IMCSummit 2015 - Day 2 Developer Track - The NVM RevolutionIn-Memory Computing Summit
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCCeph Community
 
Benefity Oracle Cloudu (4/4): Storage
Benefity Oracle Cloudu (4/4): StorageBenefity Oracle Cloudu (4/4): Storage
Benefity Oracle Cloudu (4/4): StorageMarketingArrowECS_CZ
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep DiveUniFabric
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
Developing a Ceph Appliance for Secure Environments
Developing a Ceph Appliance for Secure EnvironmentsDeveloping a Ceph Appliance for Secure Environments
Developing a Ceph Appliance for Secure EnvironmentsCeph Community
 
Varrow datacenter storage today and tomorrow
Varrow   datacenter storage today and tomorrowVarrow   datacenter storage today and tomorrow
Varrow datacenter storage today and tomorrowpittmantony
 
Varrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentationVarrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentationpittmantony
 
Vm13 vnx mixed workloads
Vm13 vnx mixed workloadsVm13 vnx mixed workloads
Vm13 vnx mixed workloadspittmantony
 
SM16 - Can i move my stuff to openstack
SM16 - Can i move my stuff to openstackSM16 - Can i move my stuff to openstack
SM16 - Can i move my stuff to openstackpittmantony
 
UniPlex T1 Storage Supercharger
UniPlex T1 Storage SuperchargerUniPlex T1 Storage Supercharger
UniPlex T1 Storage SuperchargerUniFabric
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in CloudHoward Marks
 
VMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld
 

Tendances (20)

IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory EasyIMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
 
UniPlex 1000 Series PCIe NVMe JBOF
UniPlex 1000 Series PCIe NVMe JBOFUniPlex 1000 Series PCIe NVMe JBOF
UniPlex 1000 Series PCIe NVMe JBOF
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale
 
Ceph Day Shanghai - Opening
Ceph Day Shanghai - Opening Ceph Day Shanghai - Opening
Ceph Day Shanghai - Opening
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
 
IMCSummit 2015 - Day 2 Developer Track - The NVM Revolution
IMCSummit 2015 - Day 2 Developer Track - The NVM RevolutionIMCSummit 2015 - Day 2 Developer Track - The NVM Revolution
IMCSummit 2015 - Day 2 Developer Track - The NVM Revolution
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
Benefity Oracle Cloudu (4/4): Storage
Benefity Oracle Cloudu (4/4): StorageBenefity Oracle Cloudu (4/4): Storage
Benefity Oracle Cloudu (4/4): Storage
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep Dive
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Developing a Ceph Appliance for Secure Environments
Developing a Ceph Appliance for Secure EnvironmentsDeveloping a Ceph Appliance for Secure Environments
Developing a Ceph Appliance for Secure Environments
 
Varrow datacenter storage today and tomorrow
Varrow   datacenter storage today and tomorrowVarrow   datacenter storage today and tomorrow
Varrow datacenter storage today and tomorrow
 
Varrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentationVarrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentation
 
Vm13 vnx mixed workloads
Vm13 vnx mixed workloadsVm13 vnx mixed workloads
Vm13 vnx mixed workloads
 
SM16 - Can i move my stuff to openstack
SM16 - Can i move my stuff to openstackSM16 - Can i move my stuff to openstack
SM16 - Can i move my stuff to openstack
 
UniPlex T1 Storage Supercharger
UniPlex T1 Storage SuperchargerUniPlex T1 Storage Supercharger
UniPlex T1 Storage Supercharger
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in Cloud
 
VMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing Databases
 

En vedette

GPUrdma - Presentation
GPUrdma - PresentationGPUrdma - Presentation
GPUrdma - PresentationFeras Daoud
 
Paper on RDMA enabled Cluster FileSystem at Intel Developer Forum
Paper on RDMA enabled Cluster FileSystem at Intel Developer ForumPaper on RDMA enabled Cluster FileSystem at Intel Developer Forum
Paper on RDMA enabled Cluster FileSystem at Intel Developer Forumsomenathb
 
Approaching hyperconvergedopenstack
Approaching hyperconvergedopenstackApproaching hyperconvergedopenstack
Approaching hyperconvergedopenstackIkuo Kumagai
 
San disk axel rosenberg
San disk axel rosenbergSan disk axel rosenberg
San disk axel rosenbergBigDataExpo
 
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안NAIM Networks, Inc.
 
Function Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe DriverFunction Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe Driver인구 강
 
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage ComparisonIntel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage ComparisonDataStax Academy
 
NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxLF Events
 
Introduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RIntroduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RSimon Huang
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressOdinot Stanislas
 

En vedette (19)

HERD-Hanjun
HERD-HanjunHERD-Hanjun
HERD-Hanjun
 
GPUrdma - Presentation
GPUrdma - PresentationGPUrdma - Presentation
GPUrdma - Presentation
 
Paper on RDMA enabled Cluster FileSystem at Intel Developer Forum
Paper on RDMA enabled Cluster FileSystem at Intel Developer ForumPaper on RDMA enabled Cluster FileSystem at Intel Developer Forum
Paper on RDMA enabled Cluster FileSystem at Intel Developer Forum
 
slides
slidesslides
slides
 
Persistent memory
Persistent memoryPersistent memory
Persistent memory
 
Approaching hyperconvergedopenstack
Approaching hyperconvergedopenstackApproaching hyperconvergedopenstack
Approaching hyperconvergedopenstack
 
DMA, Infiniband
DMA, InfinibandDMA, Infiniband
DMA, Infiniband
 
Ceph on rdma
Ceph on rdmaCeph on rdma
Ceph on rdma
 
San disk axel rosenberg
San disk axel rosenbergSan disk axel rosenberg
San disk axel rosenberg
 
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
 
Virtualization Acceleration
Virtualization Acceleration Virtualization Acceleration
Virtualization Acceleration
 
Function Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe DriverFunction Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe Driver
 
Mellanox Storage Solutions
Mellanox Storage SolutionsMellanox Storage Solutions
Mellanox Storage Solutions
 
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage ComparisonIntel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
 
NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in Linux
 
Herd
HerdHerd
Herd
 
Introduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RIntroduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3R
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM Express
 

Similaire à SOUG_GV_Flashgrid_V4

6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_finalYutaka Kawai
 
505 kobal exadata
505 kobal exadata505 kobal exadata
505 kobal exadataKam Chan
 
2015 deploying flash in the data center
2015 deploying flash in the data center2015 deploying flash in the data center
2015 deploying flash in the data centerHoward Marks
 
2015 deploying flash in the data center
2015 deploying flash in the data center2015 deploying flash in the data center
2015 deploying flash in the data centerHoward Marks
 
Exadata architecture and internals presentation
Exadata architecture and internals presentationExadata architecture and internals presentation
Exadata architecture and internals presentationSanjoy Dasgupta
 
Presentation oracle on power power advantages and license optimization
Presentation   oracle on power power advantages and license optimizationPresentation   oracle on power power advantages and license optimization
Presentation oracle on power power advantages and license optimizationsolarisyougood
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis PyData
 
ODSA Use Case - SmartNIC
ODSA Use Case - SmartNICODSA Use Case - SmartNIC
ODSA Use Case - SmartNICODSA Workgroup
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7MarketingArrowECS_CZ
 
Red hat open stack and storage presentation
Red hat open stack and storage presentationRed hat open stack and storage presentation
Red hat open stack and storage presentationMayur Shetty
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysisodsc
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettJim St. Leger
 

Similaire à SOUG_GV_Flashgrid_V4 (20)

6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
Meetup Oracle Database MAD_BCN: 4 Saborea Exadata
Meetup Oracle Database MAD_BCN: 4 Saborea ExadataMeetup Oracle Database MAD_BCN: 4 Saborea Exadata
Meetup Oracle Database MAD_BCN: 4 Saborea Exadata
 
505 kobal exadata
505 kobal exadata505 kobal exadata
505 kobal exadata
 
2015 deploying flash in the data center
2015 deploying flash in the data center2015 deploying flash in the data center
2015 deploying flash in the data center
 
2015 deploying flash in the data center
2015 deploying flash in the data center2015 deploying flash in the data center
2015 deploying flash in the data center
 
Session 307 ravi pendekanti engineered systems
Session 307  ravi pendekanti engineered systemsSession 307  ravi pendekanti engineered systems
Session 307 ravi pendekanti engineered systems
 
Exadata architecture and internals presentation
Exadata architecture and internals presentationExadata architecture and internals presentation
Exadata architecture and internals presentation
 
Presentation oracle on power power advantages and license optimization
Presentation   oracle on power power advantages and license optimizationPresentation   oracle on power power advantages and license optimization
Presentation oracle on power power advantages and license optimization
 
Exadata
ExadataExadata
Exadata
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
ODSA Use Case - SmartNIC
ODSA Use Case - SmartNICODSA Use Case - SmartNIC
ODSA Use Case - SmartNIC
 
Storage Managment
Storage ManagmentStorage Managment
Storage Managment
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7
 
Red hat open stack and storage presentation
Red hat open stack and storage presentationRed hat open stack and storage presentation
Red hat open stack and storage presentation
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 

SOUG_GV_Flashgrid_V4

  • 1. FlashGrid, NVMe and 100Gbit Infiniband – The Exadata Killer? CEO, DATABASE-SERVERS.COM
  • 2. Agenda Goal / Objectives Key components Kernel, protocols, hardware FlashGrid stack vs. Exadata How Exadata works How FlashGrid works Benchmarks Single server RAC Conclusion Q&A
  • 3. Introduction About Myself Married with children Linux since 1998 Oracle since 2000 OCM & OCP Chairman of NGENSTOR since 2014 CEO of DATABASE-SERVERS since 2015
  • 4. Introduction About DATABASE-SERVERS Founded in 2015 Custom license-optimized whitebox servers Standard Operating Environment Toolkit for Oracle/Redhat/Suse Linux CLX Cloud Systems based on OVM/LXC stack Performance kits for all brands (HP/Oracle/Dell/Lenovo) Watercooling Overclocking HBA/SSD upgrades Tuning/Redesign of Oracle engineered systems (ODA, Exadata) Storage extension with NGENSTOR arrays Performance kits
  • 5. Goal / Objectives Requirements Maximized I/O throughput and random I/O capabilities at the least possible CPU usage on the database server Use only commodity hardware and technologies available today No closed source components like Exadata Storage Server software Be as compliant as possible to the standard Oracle software stack
  • 6. Key components: Linux Kernel Why Oracle has the UEK3/4 Kernel The current Linux distributions are focussed on stability and certification matrices: Kernel versions are frozen and features slowly/selectively backported. Oracle needs more frequent updates in selected areas for it’s engineered system’s performance, especially in the following areas: Infiniband Stack (OFED) Network and Block I/O layer Oracle’s Solution Compile a newer/patched version of the Linux mainline kernel against their Centos fork called Oracle Linux
  • 7. Key components: Linux Kernel UEK3 most important new feature Multiple SCSI/Block command queues The Linux storage stack doesn't scale: ~ 250,000 to 500.000 IOPS per LUN ~ 1,000,000 IOPS per HBA High completion latency High lock contention and cache line bouncing Bad NUMA scaling The request layer can't handle high IOPS or low latency devices SCSI drivers are tied into the request framework SCSI-MQ/BLK-MQ are replacements for the request layer
  • 8. Key components: Linux Kernel Command chain structure: old vs. new old new
  • 9. Key components: Linux Kernel IOPS performance
  • 10. Key components: Protocols Infiniband /RDMA Oracle uses Infiniband as strategic interconnect component in all of it’s engineered systems Main purposes: Lower CPU utilization due to hardware offload Lower latency for small messages like RAC interconnect (solves scalability issues) Oracle created the RDS protocol, that runs on top of Infiniband/RDMA When you connect to an Exadata Storage Server, you open an RDS Socket ;-) The distributed Database Approach used by Oracle via the iDB relies on this framework
  • 11. Key components: Protocols Components used by Oracle engineered systems
  • 12. Key components: Protocols What Oracle says about RDS
  • 13. Key components: Protocols TCP/IP vs. Infiniband TCP/IP Infiniband
  • 14. Key components: Protocols Non-Volative Memory Express (NVMe) NVM Express is a standardized high performance software interface for PCIe SSDs Lower latency: Direct connection to CPU Scalable performance: 1 GB/s per lane – 4 GB/s, 8 GB/s, … in one SSD Industry standards: NVM Express and PCI Express (PCIe) 3.0 Oracle Exadata Storage Servers X5 use NVMe SSD DC P3600 2.6 GB/s read 270’000 IOPS @8k http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-p3600-series.html
  • 15. Key components: Protocols Performance comparison Traditional stack NVMe stack
  • 16. Key components: Hardware In order to keep up with Exadata we need: Preferrably database servers with the same or similiar specs as Oracle’s Intel Xeon E5-V3 for a X5-2 like performance Intel Xeon E7-V3 for a X5-8 like performance A high speed network card Preferrably with hardware protocol offloading Determine best operation mode (=> see next slide) High bandwidth Low latency NVMe Drives Internal or external
  • 17. Key components: Hardware Network card operation modes compared iSCSI over RDMA iSCSI hardware offload (ASIC)
  • 18. Key Komponents: Summary Cooking a decent Exadata killer requires the right ingredients: Kernel 3.18+ for BLK-MQ / SCSI-MQ support (e.g UEK3) Decent database servers RDMA capable high speed network with hardware protocol offloading High speed flash drives RDS-Support linked properly for you Oracle Home Clusterware and RAC Support for RDS Over Infiniband (Doc ID 751343.1)
  • 19. FlashGrid Stack vs. Exadata General workload categorization OLTP Driving number is transactions per second / IOPS Small random IOPS (typically 8k) Many random IOPS => alot of CPU burns away Latency is important DWH Driving number is processing/elapsed time High sequential troughput (GB/s) Large, merged I/O => low cpu, high adapter saturation
  • 20. FlashGrid Stack vs. Exadata Common problems Database server per core efficiency A lot of waits/interrupts for network/storage Adapter remote bandwidth issues Even 2x40 Gbit are not enough to move terabytes of data to the RDBMS engine I/O Subsystem bottlenecks Storage array cannot provide enough bandwidth or IOPS
  • 21. FlashGrid Stack vs. Exadata Common stack Oracle Linux Exadata X5: Oracle Linux 6.7 FlashGrid: Oracle Linux 7.1 or higher Oracle Grid Infrastructure and ASM Both recommend 12c Infiniband / RDMA Exadata X5: uses QDR (2x40Gbit Cards) FlashGrid: multiple card vendors supported (Mellanox, Chelsio, Intel, Solarflare)
  • 22. FlashGrid Stack vs. Exadata How Exadata works (simplyfied) «Distributed Database» Idea from the 90’s (central DB, remote DB over DB-Link) Anyone recalls this and the driving site hint? Work can be split/offloaded to remote databases (eg. joins) Exadata Database Server aka Client Exadata Storage Server Exadata Storage Server
  • 23. FlashGrid Stack vs. Exadata Core advantages of Exadata Storage Cells We can distribute work amongst them in the same way we would within a distributed database using DB-Links We can use multiple data processing engines Oracle instance on the DB Server Engines on the Exadata Storage Cells We save CPU and bandwidth on the effective database server We have to transfer and process less data, as it is «pre-processed» and also often pre-cached in the Storage Cells cache structures But: We have to license the Exadata Storage Cells We have quite a vendor lock-in due the Exadata Storage Cell’s unique architecture and proprietary IP
  • 24. FlashGrid Stack vs. Exadata How FlashGrid works 1/3
  • 25. FlashGrid Stack vs. Exadata How FlashGrid works 2/3 Hooks into your existing Oracle Grid Infrastructure stack Basically a shared nothing storage cluster Exports local NVMe Drives via iSCSI (either Infiniband or TCP/IP) Uses Oracle ASM to mirror the local disks exported to all nodes Creates a mapping for each server to use local NMVe drives for reads instead going over the network. Think of it like setting ASM preferred mirror read per server Scales using Oracle RAC The more nodes local disk, the more global bandwidth
  • 26. FlashGrid Stack vs. Exadata How FlashGrid works 3/3
  • 27. FlashGrid Stack vs. Exadata Core advantages of FlashGrid Open source Considerably cheaper than Exadata Scale up using Oracle RAC just like Exadata We are not bound to the restrictions of an engineered system in terms of hard- and software combination 100Gbit Infiniband? => yes Linux Containers with Oracle 12c? => yes But: Has no query offloading (yet) like Exadata Storage Cells
  • 28. Benchmarks Setup Tool = CALIBRATE_IO Official Oracle Tool No warm up/pre-runs to avoid caching Not perfect, but good enough to measure IOPS/MBs https://db-blog.web.cern.ch/blog/luca-canali/2014-05-closer-look-calibrateio Oracle Engineered system Oracle Exadata X5-2 Quarter Rack, HC Single x86 system HP ML 350 G9 2x Xeon E5-V3 2699 (18 core, like X5-2) 3x HP P840 SAS RAID Controller with 4 GB cache, 48 OCZ Intrepid SATA SSDs (Test1+3) Remote access via iSCSI over RDMA (Test2) 6x Intel P3608 NVMe drives connected (Test4) FlashGrid 2x HP ML 350 G9 2x Xeon E5-V3 2699 (18 core, like X5-2) 6x Intel P3608 NVMe drives connected 2x Infiniband 100 Gbit Adapter
  • 29. Benchmarks Oracle Exadata X5-2 Quarter Rack, HC No cache, no performance for high capacity Storage Cells …
  • 30. Benchmarks HP ML 350 G9, 48x SATA disks 1/3 SCSI-MQ=off Hyperthreading=off Limited by per port speed/controller and number of controllers
  • 31. Benchmarks HP ML 350 G9, 48x SATA disks 2/3 SCSI-MQ=on Hyperthreading=on Limited by NIC count and port speed
  • 32. Benchmarks HP ML 350 G9, 48x SATA disks 3/3 SCSI-MQ=on Hyperthreading=on Limited by per port speed/controller and number of controllers
  • 33. Benchmarks HP ML 350 G9, NVMe drives SCSI-MQ=on Hyperthreading=on Limited by number of NVMe drives and RAC Nodes. Exadata uses 56 NVMe drives, 7 storage cells and 4 RAC nodes
  • 34. Benchmarks FlashGrid, 2xHP ML 350 G9, NVMe drives Limited by number of NVMe drives and RAC Nodes. Exadata uses 112 NVMe drives, 14 storage cells and 8 RAC nodes To match the read performance of Exadata Full Rack EF , we need to scale up to 8xHP ML 350 G9 RAC nodes
  • 35. Conclusion Having chased Oracle Exadata’s performance for quite some years I can conclude that: Commodity servers Can keep up with Oracle engineered systems thanks to the newest network and flash technology FlashGrid Offers execellent raw performance Is simpler to maintain Has cheaper TCO Is just good enough for the majority of clients Oracle Exadata Has still a few areas, where it offers unmatched performance thanks to it’s proprietary IP, even tough it’s value is declining
  • 36. Thanks to our partners
  • 37. Q&A