Debugging Slow Buffered Reads to the Lustre File System

1
February 2016
Debugging Slow Buffered Reads
to the Lustre Filesystem
By Robert Roy, Senior Staff Engineer

22
Direct IO reads are better than Buffered IO
The Problem
Seagate CS9000 with 4M RPCs
Reads Buffered ~3.5 GB/s per OST
Reads o_direct ~4.5 GB/s per OST
Writes Buffered ~4.5 GB/s per OST
More clients do not produce more bandwidth
Suggests server side
Data path on the server side is the same
for o_direct and buffered IO
Suggests client side
Buffered IO uses paged cache which is
populated by readahead
Client side readahead is suspect

33
Readahead requests never ramp up to 4M RPCs
The Root Cause
[rroy@rroy-vm-wireshark ~]$ tshark -r buffered_1node_1thread.cap.gz -
Tfields -e ip.src -e ip.dst -e lustre.obd_ioobj.ioo_id -e
lustre.niobuf_remote.offset -e lustre.niobuf_remote.len -R
lustre.niobuf_remote | head -10
172.19.62.138 172.19.55.5 1903 0 1048576
172.19.62.138 172.19.55.5 1903 1048576 2097152
172.19.62.138 172.19.55.5 1903 3145728 1048576
172.19.62.138 172.19.55.5 1903 4194304 1048576
172.19.62.138 172.19.55.5 1903 5242880 2097152
172.19.62.138 172.19.55.5 1903 7340032 1048576
172.19.62.138 172.19.55.5 1903 8388608 1048576
172.19.62.138 172.19.55.5 1903 9437184 2097152
172.19.62.138 172.19.55.5 1903 11534336 1048576
172.19.62.138 172.19.55.5 1903 12582912 1048576
...
172.19.62.138 172.19.55.5 1903 1685061632 1048576
172.19.62.138 172.19.55.5 1903 1686110208 1048576
172.19.62.138 172.19.55.5 1903 1687158784 1048576
172.19.62.138 172.19.55.5 1903 1688207360 1048576
172.19.62.138 172.19.55.5 1903 1689255936 1048576
172.19.62.138 172.19.55.5 1903 1690304512 1048576
172.19.62.138 172.19.55.5 1903 1691353088 1048576
172.19.62.138 172.19.55.5 1903 1692401664 1048576
172.19.62.138 172.19.55.5 1903 1693450240 1048576
172.19.62.138 172.19.55.5 1903 1694498816 1048576

44
Even with a large 64MB IO size, all IO serviced from readahead is 1MB in size
The Root Cause
[rroy@rroy-vm-wireshark ~]$ tshark -r buffered_32node_4thread_64mIO.cap.gz
-Tfields -e ip.src -e ip.dst -e lustre.obd_ioobj.ioo_id -e
lustre.niobuf_remote.offset -e lustre.niobuf_remote.len -R
lustre.niobuf_remote | grep 288 | head -n 20
172.19.62.138 172.19.55.4 2288 0 4194304
172.19.62.138 172.19.55.4 2288 4194304 4194304
172.19.62.138 172.19.55.4 2288 8388608 4194304
172.19.62.138 172.19.55.4 2288 12582912 4194304
172.19.62.138 172.19.55.4 2288 16777216 4194304
172.19.62.138 172.19.55.4 2288 20971520 4194304
172.19.62.138 172.19.55.4 2288 25165824 4194304
172.19.62.138 172.19.55.4 2288 29360128 4194304
172.19.62.138 172.19.55.4 2288 33554432 4194304
172.19.62.138 172.19.55.4 2288 37748736 4194304
172.19.62.138 172.19.55.4 2288 41943040 4194304
172.19.62.138 172.19.55.4 2288 46137344 4194304
172.19.62.138 172.19.55.4 2288 50331648 4194304
172.19.62.138 172.19.55.4 2288 54525952 4194304
172.19.62.138 172.19.55.4 2288 58720256 4194304
172.19.62.138 172.19.55.4 2288 62914560 4194304
172.19.62.138 172.19.55.4 2288 67108864 1048576
172.19.62.138 172.19.55.4 2288 68157440 1048576
172.19.62.138 172.19.55.4 2288 69206016 1048576
172.19.62.138 172.19.55.4 2288 70254592 1048576

55
The Source of the Problem
And right above that line…
/lustre/llite/rw.c
#define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT)
/* RAS_INCREASE_STEP should be (1UL << (inode->i_blkbits - PAGE_CACHE_SHIFT)).
* Temporarily set RAS_INCREASE_STEP to 1MB. After 4MB RPC is enabled
* by default, this should be adjusted corresponding with max_read_ahead_mb
* and max_read_ahead_per_file_mb otherwise the readahead budget can be used
* up quickly which will affect read performance significantly. See LU-2816 */

66
Set the increase step to the same value as the RPC size
The Solution
< #define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT)
> #define RAS_INCREASE_STEP(inode) (PTLRPC_MAX_BRW_SIZE >> PAGE_CACHE_SHIFT)

77
Results
INCREASE_STEP RA
File
RA MB Clients PPN IO Size Read
Average
1MB 40 40 32 1 1M 6928.02
4MB 40 40 32 1 1M 8629.80
1MB 160 640 32 1 1M 7137.50
4MB 160 640 32 1 1M 9528.45
IOR -r -v -F –b 131072m -t 1m -i 3 -m -k -D 60

99
Conclusion and More Information
Buffered reads can be improved significantly when 4m RPCs are in use
Seagate implemented a parameter to address the issue
lctl set_param -n llite.*.read_ahead_step 4
https://github.com/Xyratex/lustre-stable/commit/2395f8e0e7e963aec43deb07d719e9229884758c
LU-7140 tracks the upstream work
https://jira.hpdd.intel.com/browse/LU-7140

13
›  2+ million enclosures
›  17+Petabytes shipped
›  Drive Variety (HDD, SAS,
SATA, SSD, hybrid)
›  Enclosures, controllers
›  Customer-driven partnership
›  Services: Logistics,
fulfillment, warranty,
design, supply chain
›  Purpose-engineered
to optimize capacity
and performance
›  40% fewer racks
required
›  >1TB/sec file system
performance
›  Solutions for object storage
›  Reference architectures
for open source and
software-defined storage
›  Private cloud appliances
for backup and recovery
›  Modular, scalable
components for DIY
customers
Scale-Out
SystemsHPCOEM
Seagate Cloud Systems & Silicon Group

14
Powering the Fastest HPC Sites
Awards
Award-Winning ClusterStor Architecture

Debugging Slow Buffered Reads to the Lustre File System

Recommandé

Recommandé

Contenu connexe

Plus de inside-BigData.com

Plus de inside-BigData.com (20)

Dernier

Dernier (20)

Debugging Slow Buffered Reads to the Lustre File System