Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
February 2016
Debugging Slow Buffered Reads
to the Lustre Filesystem
By Robert Roy, Senior Staff Engineer
22
Direct IO reads are better than Buffered IO
The Problem
Seagate CS9000 with 4M RPCs
Reads Buffered ~3.5 GB/s per OST
Re...
33
Readahead requests never ramp up to 4M RPCs
The Root Cause
[rroy@rroy-vm-wireshark ~]$ tshark -r buffered_1node_1thread...
44
Even with a large 64MB IO size, all IO serviced from readahead is 1MB in size
The Root Cause
[rroy@rroy-vm-wireshark ~]...
55
The Source of the Problem
And right above that line…
/lustre/llite/rw.c
#define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SI...
66
Set the increase step to the same value as the RPC size
The Solution
< #define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZ...
77
Results
INCREASE_STEP RA
File
RA MB Clients PPN IO Size Read
Average
1MB 40 40 32 1 1M 6928.02
4MB 40 40 32 1 1M 8629.8...
February 2016
Conclusion
99
Conclusion and More Information
Buffered reads can be improved significantly when 4m RPCs are in use
Seagate implemente...
Thank You
Questions?
February 2016
About Seagate
13
›  2+ million enclosures
›  17+Petabytes shipped
›  Drive Variety (HDD, SAS,
SATA, SSD, hybrid)
›  Enclosures, controll...
14
Powering the Fastest HPC Sites
Awards
Award-Winning ClusterStor Architecture
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Seagate SC15 Announcements for HPC
Next
Download to read offline and view in fullscreen.

Share

Debugging Slow Buffered Reads to the Lustre File System

Download to read offline

In this deck from the 2016 Stanford HPC Conference, Robert Roy from Seagate Technologies presents: Debugging Slow Buffered Reads to the Lustre File System.

"Buffered read performance under Lustre has been inexplicably slow when compared to writes or even direct IO reads. A balanced FDR-based Object Storage Server can easily saturate the network or backend disk storage using o_direct based IO. However, buffered IO reads remain at 80% of write bandwidth. In this presentation we will characterize the problem, discuss how it was debugged and proposed resolution. The format will be a presentation followed by Q&A."

Learn more: http://seagate.com

See more talks from the Stanford HPC Conference: http://insidehpc.com/2016-stanford-hpc-conference-video-gallery/

Related Books

Free with a 30 day trial from Scribd

See all

Debugging Slow Buffered Reads to the Lustre File System

  1. 1. 1 February 2016 Debugging Slow Buffered Reads to the Lustre Filesystem By Robert Roy, Senior Staff Engineer
  2. 2. 22 Direct IO reads are better than Buffered IO The Problem Seagate CS9000 with 4M RPCs Reads Buffered ~3.5 GB/s per OST Reads o_direct ~4.5 GB/s per OST Writes Buffered ~4.5 GB/s per OST More clients do not produce more bandwidth Suggests server side Data path on the server side is the same for o_direct and buffered IO Suggests client side Buffered IO uses paged cache which is populated by readahead Client side readahead is suspect
  3. 3. 33 Readahead requests never ramp up to 4M RPCs The Root Cause [rroy@rroy-vm-wireshark ~]$ tshark -r buffered_1node_1thread.cap.gz - Tfields -e ip.src -e ip.dst -e lustre.obd_ioobj.ioo_id -e lustre.niobuf_remote.offset -e lustre.niobuf_remote.len -R lustre.niobuf_remote | head -10 172.19.62.138 172.19.55.5 1903 0 1048576 172.19.62.138 172.19.55.5 1903 1048576 2097152 172.19.62.138 172.19.55.5 1903 3145728 1048576 172.19.62.138 172.19.55.5 1903 4194304 1048576 172.19.62.138 172.19.55.5 1903 5242880 2097152 172.19.62.138 172.19.55.5 1903 7340032 1048576 172.19.62.138 172.19.55.5 1903 8388608 1048576 172.19.62.138 172.19.55.5 1903 9437184 2097152 172.19.62.138 172.19.55.5 1903 11534336 1048576 172.19.62.138 172.19.55.5 1903 12582912 1048576 ... 172.19.62.138 172.19.55.5 1903 1685061632 1048576 172.19.62.138 172.19.55.5 1903 1686110208 1048576 172.19.62.138 172.19.55.5 1903 1687158784 1048576 172.19.62.138 172.19.55.5 1903 1688207360 1048576 172.19.62.138 172.19.55.5 1903 1689255936 1048576 172.19.62.138 172.19.55.5 1903 1690304512 1048576 172.19.62.138 172.19.55.5 1903 1691353088 1048576 172.19.62.138 172.19.55.5 1903 1692401664 1048576 172.19.62.138 172.19.55.5 1903 1693450240 1048576 172.19.62.138 172.19.55.5 1903 1694498816 1048576
  4. 4. 44 Even with a large 64MB IO size, all IO serviced from readahead is 1MB in size The Root Cause [rroy@rroy-vm-wireshark ~]$ tshark -r buffered_32node_4thread_64mIO.cap.gz -Tfields -e ip.src -e ip.dst -e lustre.obd_ioobj.ioo_id -e lustre.niobuf_remote.offset -e lustre.niobuf_remote.len -R lustre.niobuf_remote | grep 288 | head -n 20 172.19.62.138 172.19.55.4 2288 0 4194304 172.19.62.138 172.19.55.4 2288 4194304 4194304 172.19.62.138 172.19.55.4 2288 8388608 4194304 172.19.62.138 172.19.55.4 2288 12582912 4194304 172.19.62.138 172.19.55.4 2288 16777216 4194304 172.19.62.138 172.19.55.4 2288 20971520 4194304 172.19.62.138 172.19.55.4 2288 25165824 4194304 172.19.62.138 172.19.55.4 2288 29360128 4194304 172.19.62.138 172.19.55.4 2288 33554432 4194304 172.19.62.138 172.19.55.4 2288 37748736 4194304 172.19.62.138 172.19.55.4 2288 41943040 4194304 172.19.62.138 172.19.55.4 2288 46137344 4194304 172.19.62.138 172.19.55.4 2288 50331648 4194304 172.19.62.138 172.19.55.4 2288 54525952 4194304 172.19.62.138 172.19.55.4 2288 58720256 4194304 172.19.62.138 172.19.55.4 2288 62914560 4194304 172.19.62.138 172.19.55.4 2288 67108864 1048576 172.19.62.138 172.19.55.4 2288 68157440 1048576 172.19.62.138 172.19.55.4 2288 69206016 1048576 172.19.62.138 172.19.55.4 2288 70254592 1048576
  5. 5. 55 The Source of the Problem And right above that line… /lustre/llite/rw.c #define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT) /* RAS_INCREASE_STEP should be (1UL << (inode->i_blkbits - PAGE_CACHE_SHIFT)). * Temporarily set RAS_INCREASE_STEP to 1MB. After 4MB RPC is enabled * by default, this should be adjusted corresponding with max_read_ahead_mb * and max_read_ahead_per_file_mb otherwise the readahead budget can be used * up quickly which will affect read performance significantly. See LU-2816 */
  6. 6. 66 Set the increase step to the same value as the RPC size The Solution < #define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT) > #define RAS_INCREASE_STEP(inode) (PTLRPC_MAX_BRW_SIZE >> PAGE_CACHE_SHIFT)
  7. 7. 77 Results INCREASE_STEP RA File RA MB Clients PPN IO Size Read Average 1MB 40 40 32 1 1M 6928.02 4MB 40 40 32 1 1M 8629.80 1MB 160 640 32 1 1M 7137.50 4MB 160 640 32 1 1M 9528.45 IOR -r -v -F –b 131072m -t 1m -i 3 -m -k -D 60
  8. 8. February 2016 Conclusion
  9. 9. 99 Conclusion and More Information Buffered reads can be improved significantly when 4m RPCs are in use Seagate implemented a parameter to address the issue lctl set_param -n llite.*.read_ahead_step 4 https://github.com/Xyratex/lustre-stable/commit/2395f8e0e7e963aec43deb07d719e9229884758c LU-7140 tracks the upstream work https://jira.hpdd.intel.com/browse/LU-7140
  10. 10. Thank You
  11. 11. Questions?
  12. 12. February 2016 About Seagate
  13. 13. 13 ›  2+ million enclosures ›  17+Petabytes shipped ›  Drive Variety (HDD, SAS, SATA, SSD, hybrid) ›  Enclosures, controllers ›  Customer-driven partnership ›  Services: Logistics, fulfillment, warranty, design, supply chain ›  Purpose-engineered to optimize capacity and performance ›  40% fewer racks required ›  >1TB/sec file system performance ›  Solutions for object storage ›  Reference architectures for open source and software-defined storage ›  Private cloud appliances for backup and recovery ›  Modular, scalable components for DIY customers Scale-Out SystemsHPCOEM Seagate Cloud Systems & Silicon Group
  14. 14. 14 Powering the Fastest HPC Sites Awards Award-Winning ClusterStor Architecture
  • ssuser15375d

    Jun. 22, 2016

In this deck from the 2016 Stanford HPC Conference, Robert Roy from Seagate Technologies presents: Debugging Slow Buffered Reads to the Lustre File System. "Buffered read performance under Lustre has been inexplicably slow when compared to writes or even direct IO reads. A balanced FDR-based Object Storage Server can easily saturate the network or backend disk storage using o_direct based IO. However, buffered IO reads remain at 80% of write bandwidth. In this presentation we will characterize the problem, discuss how it was debugged and proposed resolution. The format will be a presentation followed by Q&A." Learn more: http://seagate.com See more talks from the Stanford HPC Conference: http://insidehpc.com/2016-stanford-hpc-conference-video-gallery/

Views

Total views

731

On Slideshare

0

From embeds

0

Number of embeds

4

Actions

Downloads

18

Shares

0

Comments

0

Likes

1

×