Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Japan Lustre User Group 2014
Next
Download to read offline and view in fullscreen.

1

Share

LUG 2014

Download to read offline

Presentation slides for Lustre User Group 2014 (http://opensfs.org/lug14/).

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

LUG 2014

  1. 1. Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE Hitoshi Sato*1, Shuichi Ihara*2, Satoshi Matsuoka*1 *1 Tokyo Institute of Technology *2 Data Direct Networks Japan 1
  2. 2. Tokyo Tech’s TSUBAME2.0 Nov. 1, 2010 “The Greenest Production Supercomputer in the World” TSUBAME 2.0 New Development 32nm 40nm >400GB/s Mem BW 80Gbps NW BW ~1KW max >1.6TB/s Mem BW >12TB/s Mem BW 35KW Max >600TB/s Mem BW 220Tbps NW Bisecion BW 1.4MW Max 2
  3. 3. TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD) “Global Work Space” #1 SFA10k #5 “Global Work Space” #2 “Global Work Space” #3 SFA10k #4SFA10k #3SFA10k #2SFA10k #1 /data0 /work0 /work1 /gscr “cNFS/Clusterd Samba w/ GPFS” HOME System application “NFS/CIFS/iSCSI by BlueARC” HOME iSCSI Infiniband QDR Networks SFA10k #6 GPFS#1 GPFS#2 GPFS#3 GPFS#4 Parallel File System Volumes Home Volumes QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8 1.2PB 3.6 PB /data1 Thin nodes 1408nodes (32nodes x44 Racks) HP Proliant SL390s G7 1408nodes CPU: Intel Westmere-EP 2.93GHz 6cores × 2 = 12cores/node GPU: NVIDIA Tesla K20X, 3GPUs/node Mem: 54GB (96GB) SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB) Medium nodes HP Proliant DL580 G7 24nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node GPU: NVIDIA Tesla S1070, NextIO vCORE Express 2070 Mem:128GB SSD: 120GB x 4 = 480GB Fat nodes HP Proliant DL580 G7 10nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node GPU: NVIDIA Tesla S1070 Mem: 256GB (512GB) SSD: 120GB x 4 = 480GB ・・・・・・ Computing Nodes: 17.1PFlops(SFP), 5.76PFlops(DFP), 224.69TFlops(CPU), ~100TB MEM, ~200TB SSD Interconnets: Full-bisection Optical QDR Infiniband Network Voltaire Grid Director 4700 ×12 IB QDR: 324 ports Core Switch Edge Switch Edge Switch (/w 10GbE ports) Voltaire Grid Director 4036 ×179 IB QDR : 36 ports Voltaire Grid Director 4036E ×6 IB QDR:34ports 10GbE: 2port 12switches 6switches179switches 2.4 PB HDD + GPFS+Tape Lustre Home Local SSDs 3
  4. 4. TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD) “Global Work Space” #1 SFA10k #5 “Global Work Space” #2 “Global Work Space” #3 SFA10k #4SFA10k #3SFA10k #2SFA10k #1 /data0 /work0 /work1 /gscr “cNFS/Clusterd Samba w/ GPFS” HOME System application “NFS/CIFS/iSCSI by BlueARC” HOME iSCSI Infiniband QDR Networks SFA10k #6 GPFS#1 GPFS#2 GPFS#3 GPFS#4 Parallel File System Volumes Home Volumes QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8 1.2PB 3.6 PB /data1 Thin nodes 1408nodes (32nodes x44 Racks) HP Proliant SL390s G7 1408nodes CPU: Intel Westmere-EP 2.93GHz 6cores × 2 = 12cores/node GPU: NVIDIA Tesla K20X, 3GPUs/node Mem: 54GB (96GB) SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB) Medium nodes HP Proliant DL580 G7 24nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node GPU: NVIDIA Tesla S1070, NextIO vCORE Express 2070 Mem:128GB SSD: 120GB x 4 = 480GB Fat nodes HP Proliant DL580 G7 10nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node GPU: NVIDIA Tesla S1070 Mem: 256GB (512GB) SSD: 120GB x 4 = 480GB ・・・・・・ Computing Nodes: 17.1PFlops(SFP), 5.76PFlops(DFP), 224.69TFlops(CPU), ~100TB MEM, ~200TB SSD Interconnets: Full-bisection Optical QDR Infiniband Network Voltaire Grid Director 4700 ×12 IB QDR: 324 ports Core Switch Edge Switch Edge Switch (/w 10GbE ports) Voltaire Grid Director 4036 ×179 IB QDR : 36 ports Voltaire Grid Director 4036E ×6 IB QDR:34ports 10GbE: 2port 12switches 6switches179switches 2.4 PB HDD + GPFS+Tape Lustre Home Local SSDs Mostly Read I/O (data-intensive apps, parallel workflow, parameter survey) • Home storage for computing nodes • Cloud-based campus storage services Backup Fine-grained R/W I/O (check point, temporal files) Fine-grained R/W I/O (check point, temporal files) 4
  5. 5. Towards TSUBAME 3.0 Interim Upgrade TSUBAME2.0 to 2.5 (Sept.10th, 2013) TSUBAME2.0 Compute Node Fermi GPU 3 x 1408 = 4224 GPUs TSUBAME-KFC Upgrade the TSUBAME2.0s GPUs NVIDIA Fermi M2050 to Kepler K20X SFP/DFP peak from 4.8PF/2.4PF => 17PF/5.7PF A TSUBAME3.0 prototype system with advanced cooling for next-gen. supercomputers. 40 compute nodes are oil-submerged. 160 NVIDIA Kepler K20x 80 Ivy Bridge Xeon FDR Infiniband Total 210.61 Flops System 5.21 TFlops Storage design is also a matter!! 5
  6. 6. Existing Storage Problems Small I/O • PFS – Throughput-oriented Design – Master-Worker Configuration • Performance Bottlenecks on Meta-data references • Low IOPS (1k – 10k IOPS) I/O contention • Shared across compute nodes – I/O contention from various multiple applications – I/O performance degradation Small I/O ops from user’s apps PFS operation down Recovered IOPS TIME Lustre I/O Monitoring 6
  7. 7. Requirements for Next Generation I/O Sub-System 1 • Basic Requirements – TB/sec sequential bandwidth for traditional HPC applications – Extremely high IOPS for to cope with applications that generate massive small I/O • Example: graph, etc. – Some application workflows have different I/O requirements for each workflow component • Reduction/consolidation of PFS I/O resources is needed while achieving TB/sec performance – We cannot simply use a larger number of IO servers and drives for achieving ~TB/s throughput! – Many constraints • Space, power, budget, etc. • Performance per Watt and performance per RU is crucial 7
  8. 8. Requirements for Next Generation I/O Sub-System 2 • New Approaches • Tsubame 2.0 has pioneered the use of local flash storage as a high-IOPS alternative to an external PFS • Tired and hybrid storage environments, combining (node) local flash with an external PFS • Industry Status • High-performance, high-capacity flash (and other new semiconductor devices) are becoming available at reasonable cost • New approaches/interface to use high-performance devices (e.g. NVMexpress) 8
  9. 9. Key Requirements of I/O system • Electrical Power and Space are still Challenging – Reduce of #OSS/OST, but keep higher IO Performance and large storage capacity – How maximize Lustre performance • Understand New type of Flush storage device – Any benefits with Lustre? How use it? – MDT on SSD helps? 9
  10. 10. Lustre 2.x Performance Evaluation • Maximize OSS/OST Performance for large Sequential IO – Single OSS and OST performance – 1MB vs 4MB RPC – Not only Peak performance, but also sustain performance • Small IO to shared file – 4K random read access to the Lustre • Metadata Performance – CPU impacts? – SSD helps? 10
  11. 11. Throughput per OSS Server 0 2000 4000 6000 8000 10000 12000 14000 1 2 4 8 16 Throughput(MB/sec) Number of clients Single OSS's Throughput (Write) Nobonding AA-bonding AA-bonding(CPU bind) 0 2000 4000 6000 8000 10000 12000 14000 1 2 4 8 16 Throughput(MB/s) Number of clients Single OSS's Throughput (Read) Nobonding AA-bonding AA-bonding(CPU bind) Single FDR’s LNET bandwidth 6.0 (GB/sec) Dual IB FDR LNET bandwidth 12.0 (GB/sec) 11
  12. 12. Performance comparing (1 MB vs. 4 MB RPC, IOR FPP, asyncIO) • 4 x OSS, 280 x NL-SAS • 14 clients and up to 224 processes 0 5000 10000 15000 20000 25000 30000 28 56 112 224 Throughput(MB/sec) Number of Process IOR(Write, FPP, xfersize=4MB) 1MB RPC 4MB RPC 0 5000 10000 15000 20000 25000 28 56 112 224 Throughput(MB/sec) Number of Process IOR(Read, FPP, xfersize=4MB) 1MB RPC 4MB RPC 12
  13. 13. Lustre Performance Degradation with Large Number of Threads: OSS standpoint 0 2000 4000 6000 8000 10000 12000 0 200 400 600 800 1000 1200 THroughput(MB/sec) Number of Process IOR(FPP, 1MB, Write) RPC=4MB RPC=1MB 0 1000 2000 3000 4000 5000 6000 7000 8000 0 200 400 600 800 1000 1200 THroughput(MB/sec) Number of Process IOR(FPP, 1MB, Read) RPC=4MB RPC=1MB • 1 x OSS, 80 x NL-SAS • 20 clients and up to 960 processes 4.5x 30% 55% 13
  14. 14. Lustre Performance Degradation with Large Number of Threads: SSDs vs. Nearline Disks • IOR (FFP, 1MB) to single OST • 20 clients and up to 640 processes 0% 20% 40% 60% 80% 100% 120% 0 100 200 300 400 500 600 700 PerformanceperOSTas%ofPeakPerformance Number of Process Write: Lustre Backend Performance Degradation (Maximum for each dataset=100%) 7.2KSAS(1MB RPC) 7.2KSAS(4MB RPC) SSD(1MB RPC) 0% 20% 40% 60% 80% 100% 120% 0 100 200 300 400 500 600 700 PerformanceperOSTas%ofPeakPerformance Number of Process Read : Lustre Backend Performance Degradation (Maximum for each dataset=100%) 7.2KSAS(1MB RPC) 7.2KSAS(4MB RPC) SSD(1MB RPC) 14
  15. 15. Lustre 2.4 4k Random Read Performance (With 10 SSDs) • 2 x OSS, 10 x SSD(2 x RAID5), 16 clients • Create Large random files (FFP, SSF), and run random read access with 4KB 0 100000 200000 300000 400000 500000 600000 1n32p 2n64p 4n128p 8n256p 12n384p 16n512p IOPS Lustre 4K random read(FFP and SSF) FFP(POSIX) FFP(MPIIO) SSF(POSIX) SSF(MPIIO) 15
  16. 16. Conclusions: Lustre with SSDs • SSDs pools in Lustre or SSDs – SSDs change the behavior of pools, as “seek” latency is no longer an issue • Application scenarios – Very consistent performance, independent of the IO size or number of concurrent IOs – Very high random access to a single shared file (millions of IOPS in a large file system with sufficient clients) • With SSDs, the bottleneck is no more the device, but the RAID array (RAID stack) and the file system – Very high random read IOPS in Lustre is possible, but only if the metadata workload is limited (i.e. random IO to a single shared file) • We will investigate more benchmarks of Lustre on SSD 16
  17. 17. Metadata Performance Improvements • Very Significant Improvements (since Lustre 2.3) – Increased performance for both unique and shared directory metadata – Almost linear scaling for most metadata operations with DNE 0 20000 40000 60000 80000 100000 120000 140000 Directory creation Directory stat Directory removal File creation File stat File removal ops/sec Performance Comparing: Lustre-1.8 vs 2.4 (metadata operation to shared dir) lustre-1.8.8 lustre-2.4.2 * 1 x MDS, 16 clients, 768 Lustre exports(multiple mount points) 0 100000 200000 300000 400000 500000 600000 Directory creation Directory stat Directory removal File creation File stat File removal ops/sec Performance Comparing (metadata operation to unique dir) Same Storage resource, but just add MDS resource Single MDS Dual MDS(DNE) 17
  18. 18. Metadata Performance CPU Impact • Metadata Performance is highly dependent on CPU performance • Limited variation in CPU frequency or memory speed can have a significant impact on metadata performance 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 100 200 300 400 500 600 700 ops/sec Number of Process File Creation (Unique dir) 3.6GHz 2.4GHz 0 10000 20000 30000 40000 50000 60000 70000 0 100 200 300 400 500 600 700 ops/sec Number of Process File Creation (Shared dir) 3.6GHz 2.4GHz 18
  19. 19. Conclusions: Metadata Performance • Significant metadata improvements for shares and unique metadata performance since Lustre 2.3 – Improvements across all metadata workloads, especially removals – SSDs add to performance, but the impact is limited (and probably mostly due to latency, not the IOPS performance of the device) • Important considerations for metadata benchmarks – Metadata benchmarks are very sensitive to the underlying hardware – Clients limitations are important when considering metadata performance of scaling – Only directory/file stats scale with the number of processes per node, other metadata workloads do appear not scale with the number of processes on a single node 19
  20. 20. Summary • Lustre Today – Significant performance increase in object storage and metadata performance – Much better spindle efficiency (which is now reaching the physical drive limit) – Client performance is quickly becoming the limitation for small file and random performance, not the server side!! • Consider alternative scenarios – PFS combined with fast local devices (or, even, local instances of the PFS) – PFS combined with a global acceleration/caching layer 20
  • ssuser15375d

    Jun. 22, 2016

Presentation slides for Lustre User Group 2014 (http://opensfs.org/lug14/).

Views

Total views

798

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

9

Shares

0

Comments

0

Likes

1

×